This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!
consisting of an interpretation context p and a temporal interval I. Context intervals represent an interpretation context over a time interval; within the scope of that interval, it can change the interpretation of one or more parameters. P is a proposition and one has to clarify of what it means that p is interpreted over an interval I. The GFO-framework provides concepts and methods to make all these notions clear. 6.5 Event Proposition An event proposition represents, according to AbstrOnt, the occurrence of an external volitional action or process such as “administering a drug”. Events have a series of event attributes a(i), and each attribute a(i) must be mapped to an attribute value. There exists an is-a hierarchy of event schemas (or event types). Event schemas have a list of attributes a(i) where each attribute has a domain of possible values v(i). An event proposition is an event schema in which each attribute a(i) is mapped to some value v(i) belonging to the set V(i). A part-of relation is defined over a set of event schemas. If the pair (e(i), e(j)) belongs to the part-of relation, then the event e(i) can be a subevent of the event schema e(j) (example: a clinical protocol event can have several parts, all of them are medication events). An event interval is a structure <e, I> consisting of an event proposition e and a time interval I. Intuitively, e is an event proposition (with a list of attribute-value pairs), and I is as time interval over which the event proposition e is interpreted. The time interval represents the duration of the event. In the GFO-framework we get the following interpretation. A volitional action or process proc is a individual which belongs to a certain situoid S. As part of reality the entity proc may be rather complex; surely, proc has a temporal extension and participants which are involved in proc, e.g. physicians, patients, and other entities. For the purpose of modeling we need only selected, relevant information about proc. This selected information is presented by a list of attributes a(i) and values v(i), and this list, denoted by – say – expr(proc), is considered as an event proposition. In GFO expr(proc) is not a proposition (because it has no truth-value), but a symbolic representation of partial information about proc. A event schema can be understood, in the framework of GFO, as a category (the GFO-term for class/universal) whose instances are event expressions of the kind expr(proc). Event schemas themselves are, hence, also presented by certain symbolic expressions schemexpr. Attributes are interpreted in GFO as property-universals whose instances have no independent existence but are associated to – are dependent on – individual entities as, e. g. objects or processes. GFO provides a framework for analyzing all possible properties of processes. In many
Ontology of Time and Situoids in Medical Conceptual Modeling
273
cases so-called properties of processes are not genuine process-properties but only properties of certain presentials which are derived from processes by projecting them onto time boundaries. Since GFO provides several formal languages to represent information about real world entities, there might be information, specified in one of these languages, which cannot be represented as a list of attribute-value pairs. This implies that GFO provides means for expression of information which exceeds the possibilities of the AbstrOnt-framework. Furthermore, the part-of relation mentioned above is not clearly described. Part-of is defined for event-schema, i.e. in the GFOframework, between categories (universals/classes). But what, exactly, is the meaning of that a schema expression schemexpr1 is a part of the schema expression schemexpr2? In GFO there is the following interpretation: every instance e1 of schemexpr1, which refers to a real world event event1, is a part of an instance e2 of schemexpr1 referring to a real world event event2. The final understanding of this part-of relation should be reduced to the understanding of what it means that the event1 is a part of the event2. The GFO-framework allows to analyze and describe several kinds of part-of relations, among them – the most natural – that event1 is a temporal part of event2. 6.6 Event Interval An event interval, according to AbstrOnt, is a structure <e, I> consisting of an event proposition e and a time interval I. Intuitively, e is an event proposition (with a list of attribute-value pairs) and I is a time interval over which the event proposition e is interpreted. The time interval represents the duration of the event. In GFO this can be interpreted as follows, which shows that in the above description there is a hidden subtle ambiguity. Let e be the expression expr(proc) which refers to a real world process proc. If e is interpreted over the interval I then the natural meaning is that the associated process proc has the uniquely determined temporal extension I (as a chronoid). But then it is not reasonable to say that proc is interpreted over the duration of I. The other meaning is that proc has a temporal extension (a framing chronoid I) which has a certain duration d. But in this case the expression e does not refer to a unique single process proc but to any process satisfying the attribute-value pairs and having a temporal extension of duration d. In this case the expression e does not represent a unique process but exhibits a GFO-category (a universal/class) whose instances are individual real world processes. This consideration uncovers that we may find the same ambiguity for the notion of an event proposition as introduced above. The GFO- framework is sufficiently expressive to find the correct disambiguations of the above notions. The remaining notions of AbstrOnt, like parameter schema, parameter interval, abstraction functions, abstraction, abstraction goal and others are analysed in the extended paper.
7 Conclusion The current paper presents some results of a case study which is carried out to apply and test the GFO-framework to existing ontologies, in particular to the ontology AbstrOnt [2]. We hope that already our preliminary results
274
H. Herre and B. Heller
a)
indicate that GFO is an expressive ontology which allows for the reconstruction of other ontologies like AbstrOnt, b) show that several concepts and notions in AbstrOnt are ambiguous and are not sufficiently founded, c) demonstrate that GFO can be used to disambiguate several definitions in AbstrOnt and to elaborate a better foundation for existing ontologies. The future research of the Onto-Med group includes the reconstruction of other ontologies of the medical and biomedical domain within the framework of GFO.
Acknowledgements We would like to thank the anonymous referees of this paper for their many useful comments that helped to improve this paper’s presentation. Many thanks to the members of the Onto-Med research group for fruitful discussions. In particular we thank Frank Loebe for his effort in the editorial work of this paper.
References 1. Onto-Med Research Group: http://www.onto-med.de 2. Shahar, Y. 1994. A Knowledge-Based Method for Temporal Abstraction of Clinical Data [PhD Thesis]. Stanford University. 3. Gruber, T. R. 1993. A Translation Approach to Portable Ontology Specifications. Knowledge Acquisition 5(2):199-220. 4. Musen, M. 1992. Dimensions of Knowledge Sharing and Reuse. Computer and Biomedical Research 25(5):435-467. 5. Fensel, D., Hendler, J., Lieberman, H., Wahlster, W. (eds.) 2003. Spinning the Semantic Web. Cambridge: MIT Press. 6. Guarino, N. 1998. Formal Ontology and Information Systems. In: Guarino, N. (ed.) 1st International Conference (FOIS-98). Trento, Italy, June 6-8. p. 3-15. Amsterdam: IOS Press. 7. Guarino, N., Welty, C. 2002. Evaluating Ontological Decisions with OntoClean. Communications of the ACM 45(2):61-65. 8. Poli, R. 2002. Ontological Methodology. International Journal of Human-Computer Studies 56(6):639–664. 9. Sowa, J. F. 2000. Knowlegde Representation - Logical, Philosophical and Computational Foundations. Pacific Crove, California: Brooks/Cole. 10. Gracia, J. J. E. 1999. Metaphysics and Its Tasks: The Search for the Categorial Foundation of Knowledge. SUNY Series in Philosophy. Albany: State University of New York Press. 11. Heller, B., Herre, H. 2004. Ontological Categories in GOL. Axiomathes 14(1):57-76. 12. Galton, A. 2004. Fields and Objects in Space, Time, and Space-time. Spatial Cognition and Computation 4(1):39-68. 13. Seibt. J. 2000. The Dynamic Constitution of Things. In: Faye, J. et al. (eds.) Facts and Events. Poznan Studies in Philosophy of Science vol. 72, p. 241-278. 14. Brentano 1976. Philosophische Untersuchungen zu Raum, Zeit und Kontinuum. Hamburg: Felix-Meiner Verlag.
Ontology of Time and Situoids in Medical Conceptual Modeling
275
15. Barwise, J., Perry, J. 1983. Situations and Attitudes. Bradford Books. Cambridge (Massachusetts): MIT Press. 16. Devlin, K. 1991. Logic and Information. Cambridge (UK): Cambridge University Press. 17. Heller, B., Herre, H., Lippoldt, K. 2004. The Theory of Top-Level Ontological Mappings and Its Application to Clinical Trial Protocols. In: Günter, A. et al. (eds.) Engineering Knowledge in the Age of the Semantic Web, Proceedings of the 14. International Conference, EKAW 2004, Whittlebury Hall, UK, Oct 2004. LNAI vol. 3257, p. 1-14. 18. Herre, H., Heller, B. 2005. Ontology of Time in GFO, Onto-Med Report Nr. 8 Onto-Med Research Group, University of Leipzig. 19. Scheidler, A. 2004. A Proof of the Interpretability of the GOL Theory of Time in the Interval-based Theory of Allen-Hayes [BA-Thesis]. Onto-Med Report Nr. 9, 2005. OntoMed Research Group, University of Leipzig. 20. Allen, J., Hayes, P. J. 1989. Moments and Points in an Interval-Based Temporal Logic. Computational Intelligence 5:225-238. 21. Loebe, F. 2003. An Analysis of Roles: Towards Ontology-Based Modelling [Master's Thesis]. Onto-Med Report No. 6. Aug 2003. Onto-Med Research Group, University of Leipzig. 22. Loux, M. 1998. Metaphysics: A Contemporary Introduction. New York: Routledge. 23. Barwise, J. 1989. The Situation in Logic. CSLI Lecture Notes, Vol. 17. Stanford (California): CSLI Publications. 24. Hoehndorf, R. 2005. Situoid Theory: An Ontological Approach to Situation Theory [Master’s Thesis]. University of Leipzig.
The Use of Verbal Classification for Determining the Course of Medical Treatment by Medicinal Herbs Leonas Ustinovichius1, Robert Balcevich2, Dmitry Kochin3, and Ieva Sliesoraityte4 1 Vilnius Gediminas Technical University, LT-10223, Sauletekio al. 11, Vilnius, Lithuania, [email protected] 2 Public company “Natura Sanat”, LT-01114, Olandu 19-2, Vilnius, Lithuania [email protected] 3 Institute for System Analysis, 117312, prosp. 60-letija Octjabrja, 9, Moscow, Russia [email protected] 4 Vilnius University, Department of Medicine, LT-2009, Ciurlionio 21, Vilnius, Lithuania [email protected]
Abstract. About 44 treatment schemes including 40 medicines prepared from Andean medicinal herbs to cure a great variety of diseases, such as cancer and other serious illnesses, have been developed. The main problem is to choose a course of medical treatment of any disease from this variety, depending on particular criteria characterizing a patient. A classification approach may be used for this purpose. Classification is a very important aspect of decision making. This means the prescription of objects to particular classes. Classified objects are described by various criteria that can be qualitatively or quantitatively evaluated. In multicriteria environment it is hardly possible to achieve this without resorting to special techniques. The paper presents a feasibility study of using verbal classification for determining a course of medical treatment, depending on a particular disease and patient’s personality.
1 Introduction There are few regions in the world where biological diversity and indigenous knowledge of the use of medicinal plants are higher than in the Peruvian Amazon [1], [2]. Although phytochemical and pharmacological information available on the region’s genetic resources is limited, there is little doubt on its tremendous potential. To uncover that potential, and preserve the region´s ecosystems and societies that have lived in them for many generations will be, no doubt, one of the most important challenges for mankind in the century to come [3], [4], [5], [6]. Andean Medicine Centre is a company specialising in popularisation and propagation of knowledge about Andean phytotherapy and healing plants of South American origin. The company is located in London and has its agencies and representatives in numerous countries over the world, e.g. Germany, Poland, Italy, Lithuania, Latvia, S. Miksch et al. (Eds.): AIME 2005, LNAI 3581, pp. 276 – 285, 2005. © Springer-Verlag Berlin Heidelberg 2005
The Use of Verbal Classification
277
Czech Republic, Sweden, Peru, Bulgaria and Slovakia. Andean Medicine Centre is offering a wide range of Amazon and Andean herbal formulas and cosmetic products. The “List of treatments” prepared by a leading Peruvian phytotherapist Professor Manuel Fernandez Ibarguen and recommended by the consultative medical team of Andean Medicine Centre is at the same time the reference book both to the AMC consultants and the patients. The major goal of the authors was to correlate ethnopharmacological data on medicinal properties of plants gathered from local inhabitants with knowledge on the ethnomedical, pharmacological, and phytochemical properties of the same plants gathered through research studies. There is a set of 44 different treatments composed of about 40 preparations recommended for a number of diseases including such malignant diseases as cancer [7], [8], [9]. Both the duration of the treatment and the number of preparations applicable depends on a given disease, as well as on the stage of its development. As a rule, the initial stage of all treatments is the purifying treatment, having a diuretic effect. There are some contraindications for using “Purification” like pregnancy and breastfeeding that apply to the administration of all the preparations. Some of the preparations, like Vilcacora has even more contraindications (e.g. it can not be administered during chemo- and radiotherapy as well as to patients after organ transplantation, to those treated with insulin and so on) [10], [11], [12], [13], [14]. If the abovementioned Vilcacora can not be used, it is substituted by other preparations (most often – Tahuari or Graviola). The other most common reason for deviating from AMC course of treatment are funds the patient can spare on his/her treatment. In the case when these funds are insufficient, the consultant is obliged to suggest the use of one or two most important preparations like Vilcacora. In choosing a particular course of treatment it is of paramount importance to base oneself on a set of criteria characterizing a patient and his/her personality. A method of classification may be used for this purpose. Classification is a very important aspect in decision making. This means the prescription of projects to particular classes. Very often it is stated that classes in decision making are determined by the particular parameters, i.e. the efficiency of technical and technological decisions, credit value of the project, etc. Classified projects are described by assessing various efficiency criteria that could be both qualitatively and quantitatively expressed. In fact, many different methods for solving multicriteria classification problems are widely known. ORCLASS, as an ordinary classification, was one of the first methods designed to solve these kinds of problems [15]. Then more recent methods, such as DIFCLASS [16], CLARA [17] and CYCLE, appeared [18]. The present paper describes a feasibility study of using verbal classification for determining a course of medical treatment, depending on a particular disease and patient’s personality.
2 Formal Problem Statement Given: 1. G – a feature, corresponding to the target criterion (e.g. treatment effectiveness).
278
L. Ustinovichius et al.
2. K = {K1 , K 2 ,..., K N } – a set of criteria, used to assess each alternative (course of treatment). 3. S q = k1q ,..., k wq – for q=1,..., N – a set of verbal estimates on the scale of cri-
{
q
}
terion Кq, wq – a number of estimates for criterion Kq; estimates in Sq are ordered based on increasing intensity of the feature G; 4. Y = S1 × ... × SN – a space of the alternative features to be classified. Each alternative is described by a set of estimates obtained by using criteria K1 ,..., K N and can be presented as a vector y ∈ Y , where y = ( y1 , y2 ,..., y N ) , y q is an index of estimate
from set S q .
5. C = {C1,..., CM } – a set of decision classes, ordered based on the increasing intensity of feature G. A binary relation of strict dominance is introduced: ⎧⎪( x, y ) ∈ Y × Y ∀q = 1...N ⎫⎪ P=⎨ ⎬. ∃q0 : xq0 > yq0 ⎪⎭ ⎪⎩ xq ≥ yq
(1)
One can see that this relation is anti-reflexive, anti-symmetric and transitive. It may be also useful to consider a reflexive, anti-symmetric, transitive binary relation of weak dominance Q:
Q = {(x, y ) ∈ Y × Y ∀q = 1... N , xq ≥ yq
}.
(2)
Goal: on the basis of DM preferences to create imaginary F: Y → {Yi }, i = 1,..., M , where Yi – a set of vector estimations belonging to class Сi, satisfying the condition of consistency: ∀x, y ∈ Y : x ∈ Yi , y ∈ Y j , (x, y ) ∈ P ⇒ i ≥ j .
(3)
2.1 Analytical Verbal Decision Methods for Classification of Alternatives
In this chapter some most frequently used verbal ordinal classification methods are considered. All these methods belong to Verbal Decision Analysis group and have the following common features [15]: 1. 2.
3.
Attribute scale is based on verbal description not changed in the process of solution, when verbal evaluation is not converted into the numerical form or score. An interactive classification procedure is performed in steps, where the DM is offered an object of analysis (a course of treatment). A project is presented as a small set of rankings. The DM is familiar with this type of description, therefore he/she can make the classification based on his/her expertise and intuition. When the DM has decided to refer a project to a particular class, the decisions are ranked on the dominance basis. This provides the information about other classes of projects related with it by the relationship of dominance. Thus, an indirect classification of all the projects can be made based on a single decision of the DM.
The Use of Verbal Classification
4.
5.
279
A set of projects dominating over a considered project are referred to as domination cone. A great number of projects have been classified many times. This ensures error – free classification. If the DM makes an error, violating this principle, he/she is shown the conflicting decision on the screen and is prompted to adjust it. In general, a comprehensive classification may be obtained for various numbers of the DM decisions and phases in an interactive operation. The efficiency of multicriteria classification technique is determined based on the number of questions to DM needed to make the classification. This approach is justified because it takes into consideration the cost of the DM’s time and the need for minimizing classification expenses.
Let us consider several most commonly used methods in more detail. ORCLASS [15]. This method (Ordinal CLASSification) allows us to build a consistent classification, to check the information and to obtain general decision rules. The method relies on the notion of the most informative alternative, allowing a great number of other alternatives to be implicitly assigned to various classes. ORCLASS takes into account possibilities and limitations of the human information processing system. Method assessment: The main disadvantage of the method is low effectiveness due to the great number of questions to DM needed for building a comprehensive classification. CLARA [17]. This method (CLAssification of Real Alternatives) is based on ORCLASS, but is designed to classify a given subset rather than a complete set of alternatives (Y space). Another common application of CLARA is classification of full set with large number of exclusions, i.e. alternatives with impossible combinations of estimations. In both cases CLARA demonstrates high effectiveness. DIFCLASS [16]. This method was the first to use dynamic construction of chains covering Y space for selecting questions to DM. However, the area of DIFCLASS application is restricted to tasks with binary criteria scales and two decision classes. CYCLE [18, 19]. CYCLE (Chain Interactive Classification) algorithm overcomes DIFCLASS restrictions, generalizing the idea of dynamic chain construction to the area of ordinal classification task with arbitrary criteria scales and any number of decision classes. The chain here means an ordered sequence of vectors x1, ..., xd , where (xi +1, xi ) ∈ P and vectors xi+1 and xi differ in one of the components. Method assessment: As comparisons demonstrate, the idea of dynamic chain construction allows us to get an algorithm close to optimal by a minimum number of questions to DM necessary to build a complete classification. The application of ordinal classification demonstrates that problem formalization as well as introduction of classes and criteria structuring allows solution of classification problems by highly effective methods. The method can be successfully applied to classification of investment projects when the decision classes and the criteria used are thoroughly revised.
280
L. Ustinovichius et al.
3 Verbal Analysis of a Course of Treatment with Medicinal Herbal Products Classification of projects is one of the multiple criteria problems within the framework of a decision making system [20], [21], [22], [23], [24], [25]. Such problems can be solved by CYCLE technique developed at the Institute of Systems Analysis of the Russian Academy of Sciences [18]. This technique allows the classification to be developed in a series of successive steps, checking the conflicting information and arriving at a general method of solution. The method described takes into account the possibilities and limitations of the human data processing system [26]. Method CYCLE Let us consider metric
ρ (x,y) in discrete space Y defined as: N
ρ ( x, y ) = ∑ xq − yq .
( )
q =1
(4)
Let us denote by ρ 0, y , i. e. the sum of vector’s components, the index of vector y ∈ Y (written as ||у||). For vectors х,
y ∈ Y such that (x, у) ∈ Р, let us consider a set:
Λ (x, y ) = {v ∈ Y (x, v ) ∈ Q(v, y ) ∈ Q},
(5)
that is a set of vectors weakly dominating y and weakly dominated by x. Having denoted y′ =(1, …, 1), y ′′ =(w1, ..., wN), we can see that Λ ( y ′′, y ′) matches the entire space Y. We also introduce a set
⎧⎪ x + y ⎫⎪ L(x, y ) = ⎨v ∈ Λ( x, y ) v = ⎬, 2 ⎪⎭ ⎪⎩
(6)
that is vectors from Λ (x, y ) set equidistant from x and y (here and further division is
done without remainder). We will need numerical functions C U ( x ) и C L (x ) defined on Y, which are respectively equal to the highest and lowest class number allowable for x, that is a class for x not violating the condition of consistency (3). Let us consider vector x to be classified and belonging to class Ck, if the following condition is valid for x: C U ( x ) = C L ( x ) =k .
(7 )
Let us define the procedure S(x) (spreading by dominance). It is assumed that the class of x is known: x ∈ Yk (that is C U ( x ) = C L (x ) =k). Therefore, for all y ∈ Y such
as (x, y ) ∈ P and CU ( y ) > k function C U ( y ) is redefined so that C U ( y ) = k . Simi-
larly, for all z ∈ Y , such as (z, x) ∈ P and C L (z )
The Use of Verbal Classification
281
Basic mechanism of the CYCLE algorithm. Let us denote by D(a, b) a procedure of classification on Λ (a, b ) set using the idea of dynamic construction of chains linking
vectors а and b. It is assumed that (a, b ) ∈ P and classes of vectors а and b are known:
a ∈ Yk , b ∈ Yl . The algorithm is as follows: 1. For each vector x∈ L(a, b) the steps 2-4 are made. 2. If a class for х is unknown ( C L (x ) < CU ( x ) ) then х is presented to the DM for classification. Suppose that x ∈ Yr . The spreading by dominance S(x) is being done. The condition of consistency is being checked (3). 3. If r
4. If r
In classifying vector x at the second step, the DM can make a mistake, and a pair of vectors x, y ∈ Y violating the consistency condition (3) appears. Procedure R of resolving contradictions consists in the following. Let us denote the set of vectors explicitly classified by DM as E. So, while E contains a pair of vectors violating (3), such a pair is presented to DM with a proposition to change a class for one or two vectors. Then, functions CU and C L are redefined to their initial values and spreading by dominance S(v) is done for each v ∈ E . Generally speaking, the parameters of the algorithm including the number of questions to DM depend on the choice of vector x in the first step. The following heuristics is proposed: among all not yet classified vectors from L(a, b) set the object, which explicitly dominates a maximum number of unclassified vectors is chosen. That is, one chooses the vector
x* = arg max
x∈L (a ,b )
{y ∈ Y (x, y )∈ P or ( y, x )∈ P , ρ (x, y ) = 1 ,
(6)
C L ( y ) < CU ( y )} .
At a very high level the CYCLE algorithm is as follows: 1. 2. 3.
For each v∈ Y possible classes are set to C L ( y ) = CU ( y ) =M. DM is presented vectors y ′ and y′′ , spreading by dominance S ( y′) and S ( y ′′) is performed. If classes for y ′ and y′′ differ, then procedure D ( y′′, y′) is performed.
Algorithm features. Statement 1. At the end of CYCLE algorithm the space Y will be fully classified, that is ∀y ∈ Y C L ( y ) = CU ( y ) . Lemma 1. For any x, y ∈ Y, such as (у, х) ∈ Р, and for any chain ℜ = x,..., y , cardi-
nality of the set ℜ ∩ L( y, x ) equals to 1.
282
L. Ustinovichius et al.
Statement 2. A classification made with the help of the CYCLE algorithm is consistent, implying that a condition of consistency (2) is satisfied.
4 A Classification of Chosen Medical Preparations and Description of the Main Criteria Used A method of verbal classification may be used to analyse the treatment of 40 diseases with the suggested medicines (Fig. 1). A list of herbal medicinal products used for treating a considered disease is analysed below.
Fig. 1. General algorithm for choosing medical preparations
After a series of iterations carried out under methodological control of consultants from Public company “Natura Sanat”, the following final decision classes were chosen (Fig. 2). All classes are hierarchically structured. They include a list of medicines to be used for treating a particular disease, with the following recommendations given for each medicine usage, depending on a particular patient: 1. HIGHLY RECOMMENDED. This medicine is highly needed for treating the patient. 2. RECOMMENDED. This medicine may be used for treating the patient. 3. NOT RECOMMENDED. This medicine should not be used for treating the patient. Let us consider the main verbal classification principle as applied to tumour prophylaxis. The hierarchical structure of classes depends on the following criteria: patient’s health condition; patient’s financial condition; stage of the disease; contraindications. A more detailed description of the above groups is given below.
The Use of Verbal Classification
283
Fig. 2. The circuit of a verbal estimation of a set of medicines for tumour prophylaxis
1. The criterion “Patient’s health condition” includes the following definitions of patient’s condition: good; satisfactory; bad. They depend on some subjective and objective patient’s characteristics. 2. The criterion “Patient’s financial condition” defines the financial state of a patient as: good; satisfactory; bad. 3. The criterion “Stage of the disease” defines the development of the disease. 4. The criterion “Contraindications” defines possible complications which may restrict the use of some medicinal products by a patient.
284
L. Ustinovichius et al.
It is now necessary to classify the medicines chosen by their multicriteria description. The results obtained should be thoroughly validated. It should be noted that for some diseases the criteria may have a two-level hierarchical structure. In this case, a classification is first built at the second level of the efficiency criteria group. The role of classes is determined by general marks made at the first level of the hierarchy. When the classification is made, these general marks are filled with concrete meaning. Then, the classification is made at the second level of the hierarchy. As a result, we get decision rules for determining various combinations of medicines for a particular patient. The DM can determine the effectiveness of a course of treatment on the basis of the available information. It should be noted, that only the criteria of the first hierarchical level may be used. Having difficulties in assessment, the DM can use more detailed data from the second level. The possibility of using the information from the second hierarchical level exists even for some first level criteria.
5 Conclusions A course of treatment with medicinal herbal products was defined by a verbal analysis based on the classification decision support method CYCLE, allowing risk evaluation according to the specified classes by the suggested criteria of determining medicine usage. As shown by the comparative analysis, the idea of the DM dynamic chain construction allows us to get a nearly optimal algorithm by asking the minimum number of questions needed to build a comprehensive classification. The method was validated by solving actual problems of selecting the best alternatives of the available courses of treatment for particular patients.
References 1. Richardson, M.A.: Research on Complementary/Alternative Medicine Therapies in Oncology: Promising but Challenging. J. Clin. Oncol. 17 (11 suppl.) (1999) 38–43 2. Wagner H. Phytomedicine Research in Germany.Environ. Health Perspect.1999,107:779– 781 3. Matthews, H., Lucier, G., Fisher, K.: Medicinal Herbs in the United States: Research Needs. Environ. Health Perspect. 107 (1999) 733–778 4. Cassileth, B.: Complementary and Alternative Cancer Medicine. J. Clin. Oncol. 17 (11 suppl.) (1999) 44–52. 5. Artiges, A.: Pharmacopoeial Standards for Herbal Medicinal Products in Europe. Eur. Phytojournal 1 (2000) 28–29. 6. Keller, K.: EMEA Ad Hoc Working Group on Herbal Medicinal Products. Eur. Phytojournal 2000, 1: 9–16 7. Blumenthal, M (ed.): WHO Medicinal Plants Monographs, Vol. 38. HerbalGram (1997) 8. British Herbal Pharmacopoeia, 1989, Bournamouth, British Herbal Medicine Association, 255 9. Blumenthal, M (ed.): American Herbal Pharmacopoeia, Vol. 40. HerbalGram (1997)
The Use of Verbal Classification
285
10. Desmarchelier, C., Witting, Schaus F., Coussio, J., Cicca, G.: Effects of Sangre de Drago from Croton lechleri Muell.-Arg. On the Production of Active Oxygen Radicals. J. Ethnopharmacol. 58 (1997) 103-108 11. Ho, K.Y., Huang, J.S., Tsai, C.C., Lin, T.C., Hsu, Y.F., Lin, C.C.: Antioxidant Activity of Tannin Components from Vaccinium vitis-idaea L. J. Pharm. Pharmacol. 51 (1999) 10751078 12. Arnone, A., Nasini, G., Vajna de Pava, O. Merlini, L.: Constituents of Dragon's Blood. 5. Dracoflavans B1, B2, C1, C2, D1, and D2, new A-type deoxyproanthocyanidins. J. Nat. Prod. 60 (1997) 971-976 13. Quiros, C., Epperson, A., Hu, J., Holle, M.: Physiological Studies and Determination of Chromosome Number in Maca, Lepidium meyenii. Economic Botany 50 (1996) 216-223 14. Johns, T.: The Anu and the Maca. Journal of Ethnobiology 1 (1981) 208-211 15. Larichev, О., Mechitov, А., Moshovich, Е., Furems, Е.: Revealing of Expert Knowledge, Publishing House Nauka, Moscow (1989) (in Russian) 16. Larichev, О., Bolotov, А.: System DIFKLASS: Construction of Full and Consistent Bases of Expert Knowledge in Problems of Differential Classification. The Scientific and Technical Information, a series 2, Informacionie Procesi i Sistemi (Information Processes and Systems), 9 (1996) (in Russian). 17. Larichev, O., Kochin, D., Kortnev, A.: Decision Support System for Classification of a Finite Set of Multicriteria Alternatives. Decision Support Systems. 33 (2002) 13-21 18. Aksanov, A., Borisenkov, P., Larichev, O., Nariznij, E., Rozejnzon, G.: Method of Multicriteria Classification CYCLE and its Application for the Analysis of Credit Risk. Ekonomika i Matematiceskije Metodi(Economy and Mathematical Methods).37(2001)1421(in Russian) 19. Larichev, O., Leonov, A.: Method CYCLE of Serial Classification of Multicriteria Alternatives. The Report of the Russian Academy of Sciences, December (2000) (in Russian) 20. Larichev, O., Moshkovich, H.: An Approach to Ordinal Classification Problems. International Transaction in Operational Research. 1 (1994) 375-386 21. Larichev, O., Moshkovich, H.: Qualitative Methods of Decision Making. Nauka Fizmatlit, Moscow. (1996) (in Russian) 22. de Montgolfier, J., Bertier, P.: Approche Multicritere des Problemes de Decision. Editions Hommes et Techniques, Paris (1978) (in French) 23. Greco, S., Matarazzo, B., Slowinski, R.: Rough Sets Methodology for Sorting Problems in Presence of Multiple Attributes and Criteria. European Journal of Operational Research 138 (2002) 247-259 24. Peldschus, F., Messing, D., Zavadskas, E.K., Ustinovičius L.: LEVI 3.0 – Multiple Criteria Evaluation Program. Journal of Civil Engineering and Management, Vol 8. Technika, Vilnius (2002) 184-191 25. Ustinovičius, L., Jakučionis, S.: Multi-Criteria Analysis of the Variants of the Old Town Building Renovation in the Marketing. Statyba (Journal of Civil Engineering and Management), Vol. 6, Technika, Vilnius (2000) 212 – 222 26. Solso, P.: Cognitive Psychology. Trivola, Moscow (1996) (in Russian).
Interactive Knowledge Validation in CBR for Decision Support in Medicine Monica H. Ou1 , Geoff A.W. West1 , Mihai Lazarescu1 , and Chris Clay2 1
Department of Computing, Curtin University of Technology, GPO Box U1987, Perth 6845, Western Australia, Australia {ou, geoff, lazaresc}@cs.curtin.edu.au 2 Royal Perth Hospital, Perth, Western Australia, Australia [email protected]
Abstract. In most case-based reasoning (CBR) systems there has been little research done on validating new knowledge, specifically on how previous knowledge differs from current knowledge by means of conceptual change. This paper proposes a technique that enables the domain expert who is non-expert in artificial intelligence (AI) to interactively supervise the knowledge validation process in a CBR system. The technique is based on formal concept analysis which involves a graphical representation and comparison of the concepts, and a summary description highlighting the conceptual differences. We propose a dissimilarity metric for measuring the degree of variation between the previous and current concepts when a new case is added to the knowledge base. The developed technique has been evaluated by a dermatology consultant, and has shown to be useful for discovering ambiguous cases and keeping the database consistent.
1
Introduction
Case-base reasoning (CBR) is an artificial intelligence (AI) technique that attempts to solve new problems by using the solutions applied to similar cases in the past. Reasoning by using past cases is a popular approach that has been applied to various domains, with most of the research having been focused on the classification aspect of the system. In medical applications, CBR has been used with considerable success for patient diagnosis, such as in PROTOS and CASEY [9]. However, relatively little effort has been put into investigating how new knowledge can be validated. We have developed a web-based CBR system that provides decision support to general practitioners (GPs) for diagnosing patients with dermatological problems. This paper mainly concentrates on developing a tool to automatically assist a dermatology consultant to train and validate the knowledge in the CBR system. Note that the consultants and GPs are non-computing experts. Knowledge validation continues to be problematic in knowledge-based and casebased systems due to the modelling nature of the task. In dealing with the problem it is often desired to design a CBR system that disallows automatic updates, especially in medical applications where lives could be threatened from incorrect decisions. Normally CBR systems are allowed to learn by themselves in which the user enters the new S. Miksch et al. (Eds.): AIME 2005, LNAI 3581, pp. 289–299, 2005. c Springer-Verlag Berlin Heidelberg 2005
290
M.H. Ou et al.
case, compares it with those in the knowledge base and, once satisfied, adds the case to the database. In our method, the consultant needs to interactively supervise the CBR system in which the valid cases are determined by the validation tool and chosen by the consultant. The inconsistent or ambiguous cases can then be visualised and handled (modified or rejected) by the consultant. The reason for human supervision is to ensure that the decisions and learning are correct, and to prevent contradictory cases from being involved in the classification process. Adding contradictory cases into the knowledge base will negatively affect classification accuracy and data interpretation. Therefore, it is vital to constantly check and maintain the quality of the data in the database as new cases get added. The main aspects that need to be addressed are: 1. How to enable non-computing experts to easily and efficiently interact with the CBR system. 2. How to provide a simple but effective mechanism for incrementally validating the knowledge base. 3. How to provide a way of measuring the conceptual variation between the previous and new knowledge. This paper proposes a new approach for validating the consistency of the new acquired knowledge against the past knowledge using a decision tree classifier [6], Formal Concept Analysis (FCA), and a dissimilarity metric. FCA is a mathematical theory which formulates the conceptual structure and displays relations between concepts in the form of concept lattices which comprise attributes and objects [3, 4]. A concept in FCA is seen as a set of objects that share a certain set of attributes. A lattice is a directed acyclic graph in which each node can have more than one parent, and the convention is that the superconcept always appears above all of its subconcepts. In addition, we derive a dissimilarity metric for quantifying the level of conceptual changes between the previous and current knowledge.
2
Related Work
FCA has been used as a method for knowledge representation and retrieval [1, 7], conceptual clustering [3], and as a support technique for CBR [2]. In [8] ripple-down rules (RDR) are combined with FCA to uncover an abstraction hierarchy of concepts and the relationship between them. We use FCA and a dissimilarity measure approach to assist the knowledge validation process and highlight the conceptual differences between previous and current knowledge. Conceptual graph comparison has been widely studied in the area of information retrieval [5, 11, 12]. The comparison is categorised into two main parts: syntactic and semantic. Syntactic comparison uses a graphical representation, and determines their similarity by the amount of common structures shared [5, 11]. Semantic comparison measures whether the two concepts are similar to each other based on whether one concept representation can generalise the other [11, 12]. However, these techniques are not suitable for users that are non-experts in AI due to their complexity.
Interactive Knowledge Validation in CBR for Decision Support in Medicine
3
291
Interactive Knowledge Validation Approach in CBR
3.1
An Overview of the Knowledge Validation Process
Consider GPs using the CBR system to help in diagnosis. They generate new cases using the decision support system which are stored in the database and marked as “unchecked”. This means the cases need to be checked by the consultant before deciding whether or not to add them to the knowledge base. FCA is used for checking the validity of the new cases to minimise the validation time, as the consultant is not required to manually check the rules generated by the decision tree. This is important because manual rule inspection by the human user quickly becomes unmanageable as the rule database grows in size. In most cases, if the consultant disagrees with the conclusion the correct conclusion is needed to be specified and some features in the case selected to justify the new conclusion or the instances are stored in a repository for later use. Figure 1 presents the process of knowledge validation. It involves using J48 [10], the Weka1 implementation of the C4.5 decision tree algorithm [6] for inducing the rules, and the Galicia2 implementation for generating lattices. The attribute-value pairs of the rules are automatically extracted and represented as a context table which shows relations between the features and the diagnoses. The context table gets converted to a lattice for easy visualisation via the hierarchical structure of the lattices. As each new case gets added, the context table gets updated and a new lattice is generated. If adding a checked case will drastically change the lattice, the consultant is alerted and asked to confirm that the new case is truly valid given its effect on the lattice. There are three different options available to assist the consultant in performing knowledge validation: 1. A graphical representation of the lattices that enables the consultant to graphically visualise the conceptual differences. 2. A summary description highlighting the conceptual differences. 3. A measure which determines the degree of variation between lattices. 3.2
Dermatology Dataset
The dataset we use in our experiments consists of patient records for the diagnosis of dermatological problems. The dataset is provided by a consultant dermatologist. It contains patient details, symptoms and importantly, the consultant’s diagnoses. Each patient is given an identification number, and episode numbers are used for multiple consultations for the same patient. Currently, the data has 17 general attributes and consists of cases describing 32 different diagnoses. Data collection is a continuous process in which new cases get added to the database. Of interest to the consultant is how each new case will affect the knowledge base. New cases are collected in two ways: 1) The diagnosed cases provided by the GP, and 2) cases provided by the consultant. Before the consultant validates the new cases, they are marked as “unchecked” and combined with cases already in the database (previously 1 2
www.cs.waikato.ac.nz/ml/weka [Last accessed: October 22, 2004]. www.iro.umontreal.ca/ galicia/index.html [Last accessed: September 10, 2004].
292
M.H. Ou et al.
New Case J48 Classifier
Classification Rules used for Automatic Attibute Partitioning
Database (case)
Feature Extraction
Attribute-Value pair & Diagnoses Apply FCA
Context Table (Previous)
Context Table (Current)
Compare Previous & Current Table
Transformation
Concept Lattice
Concept Lattice
Compare Lattices (Highlight Differences)
New Knowledge Validation
Not Valid
Modify Conclusion & Attribute Features
Stored in Database as Temp
Valid
Checked, Leave the case in Database
Valid Add to Database
Fig. 1. Knowledge base validation
“checked”) for training and generating lattices. If the lattices do not show any ambiguity this means the new cases are in fact valid and they will be updated to “checked” by the dermatologist. 3.3
Formal Context and Concept Lattices
FCA enables a lattice to be built automatically from a context table. The table consists of attribute-value pairs that are generated by the decision tree algorithm that automatically partitions the continuous attribute values into discrete values. This is because context tables need to use discrete or binary formats to represent the relations between attributes and objects. The formal context (C) is defined as follows: Definition: A formal context C = (D, A, I), where D is a set of diagnoses, A is a set of characteristics, and I is a binary relation which indicates diagnosis d has characteristic a. Tables 1 and 2 show simple examples of formal contexts for our system using only a small number of cases for simplicity. Table 1 is derived from previously checked data which consists of 6 cases, and Table 2 is generated when a new unchecked case is added i.e. consists of 7 cases. The objects on each row represent the diagnosis D, each column represents a set of characteristics A, and the relation I is marked by “X”. Note, only attributes that are relevant to a context are shown (i.e. allows discrimination), and that Table 2 shows that new attributes became relevant while others dropped. For the new unchecked case, the context table may or may not be affected to reflect the changes in characteristics used for describing the diagnoses. As can be seen from the tables, some characteristics have been introduced while some have been removed when a new
Interactive Knowledge Validation in CBR for Decision Support in Medicine
293
Table 1. Context table of previous data (consists of 6 cases) Characteristics Diagnoses (Concept Types) age≤29 age>29 gender=M gender=F Eczema
X
X
Psoriasis
X
PityriasisRubraPilaris
X
X
Table 2. Context table of current data (consists of 7 cases) Characteristics Diagnoses (Concept Types) gender=M gender=F lesion status=1 lesion status=2 lesion status=3 lesion status=4 Eczema
X
Psoriasis
X
X
X
X
PityriasisRubraPilaris
X
X
case is added due to the decision tree partitioning mechanism. In Table 1, age≤29, age>29, gender=M, and gender=F are considered to be important features for classifying Eczema, Psoriasis, and PityriasisRubraPilaris. However, in Table 2, age≤29 and age>29 are no longer important in disambiguating the diagnoses but lesion status=1, lesion status=2, lesion status=3, and lesion status=4 are, and the characteristics gender=M and gender=F remain unchanged. Conceptual changes are determined by comparing the relationships between the characteristics and the diagnoses of the current and previous context tables. However, the comparison can often be done more effectively using a lattice representation that formulates the hierarchical structure of the concept grouping. Figures 2 and 3 are built from Tables 1 and 2, respectively.
0
0
I={} E={Eczema,Psoriasis,PityriasisRubraPilaris}
I={} E={Eczema,Psoriasis,PityriasisRubraPilaris}
2
2
I={gender=M} E={Eczema,PityriasisRubraPilaris}
I={gender=M} E={Eczema,PityriasisRubraPilaris}
1
I={gender=F} E={Psoriasis}
1 4
3
I={age<=29,gender=M} E={Eczema}
I={age>29,gender=M} E={PityriasisRubraPilaris}
5
I={age<=29,age>29,gender=F,gender=M} E={}
Fig. 2. Previous graph Gt−1 (6 cases)
I={gender=F} E={Psoriasis}
3
4
I={gender=M,lesion_status=2, lesion_status=3,lesion_status=4} E={Eczema}
I={gender=M,lesion_status=1} E={PityriasisRubraPilaris}
5
I={gender=F,gender=M,lesion_status=1,lesion_status=2,lesion_status=3,lesion_status=4} E={}
Fig. 3. Current graph Gt (7 cases)
Given Table 1, the system generates a lattice shown in Figure 2. When a new case is added, a new lattice is generated. If we compare the previous and current lattices, we expect the lattices to be the same or with minor variations if changes do occur. If the concept changes significantly, the consultant needs to check if the new case is in
294
M.H. Ou et al.
fact valid. For each invalid case, the consultant is required to change the diagnosis or the characteristics to satisfy the new case or store it in a repository. The validation task is vitally important to prevent contradictory cases from being used in the classification process as these could lead to inaccurate diagnoses.
4
Conceptual Dissimilarity Measure
We derive a dissimilarity metric for determining the level of changes between the two lattices. Emphasis is put on illustrating the changes in the characteristics of the diagnoses. The measure is particular effective when the lattices become too large to be manually visualised by the consultant. Therefore, it is desirable to have a measure that indicates the level of changes between the current and previous lattices, and use it as a quick representation of the changes. 4.1
Highlighting Conceptual Changes Between Concept Lattices
We propose an algorithm for determining the conceptual differences between the two lattices: Gt−1 and Gt , where t − 1 and t represent lattices derived from previous and current cases, respectively. The algorithm determines which concepts are missing from Gt−1 and which have been added to Gt . Missing concepts are determined by those present in Gt−1 but not in Gt . Whereas, added concepts are those not present in Gt−1 but present in Gt . To highlight the conceptual changes, the system displays the information such as the characteristics and diagnoses that have been ignored or added to the classification process. Based on this information the consultant does not need to manually analyse the lattice which could be time consuming and error-prone. Note in the rest of this paper the terms lattice and graph are the same. To highlight changes, we use the concepts extent and intent. A is the extent of the concept (A, B) which comprises all the attributes (characteristics) that are valid for all the objects (diagnoses); B is the intent of the concept (A, B) which covers all the objects that are belonging to that particular concept [4]. We define the number of concepts that have the same intents in Gt−1 and Gt as |B(t − 1)concept | ∩ |B(t)concept |. The number of missing concepts n(Gt−1 ) and added concepts n(Gt ) are defined as: n(Gt−1 ) = |B(t − 1)concept | \ |B(t − 1)concept | ∩ |B(t)concept |
(1)
n(Gt ) = |B(t)concept | \ |B(t − 1)concept | ∩ |B(t)concept |
(2)
where B(t − 1) represents B in the previous graph Gt−1 , and B(t) represents B in the current graph Gt . However, if the results for both n(Gt−1 ) and n(Gt ) are equal to zero, this means the two graphs contain exactly the same concepts and implies that the new added cases are consistent with the previous cases in the database. In the worse case, the values of n(Gt−1 ) and n(Gt ) can be equal to the number of concepts in Gt−1 and Gt , respectively. This implies a radical change between the two graphs and it means that the new added cases may be contradictory and need to be revised by the consultant. The values of n(Gt−1 ) and n(Gt ) provide a rough estimate of the changes in concepts between the two lattices. The locations at which the changes occur are highlighted in different colour for quick reference to the nodes in the lattice, and provide a useful
Interactive Knowledge Validation in CBR for Decision Support in Medicine
295
indication of the number of concepts being affected. The closer the changes to the root of the lattice the more concepts get affected. The missing and added nodes in Algorithm 1 are used as an entry point for the matching to prevent unnecessary processing of every node in the lattice. Algorithm 2 selects the lowest child node in the lattice that contains a certain extent element. The reason being the lowest child node provides the most detailed description compared to those that are higher up in the hierarchy. In a lattice, lower nodes convey more meaning than those above them. Algorithm 1: Conceptual differences between concept lattices input
: missing/added node i, Gt−1 , and Gt .
foreach i in Gt−1 do Find the parent node p of the node i foreach p do extentOf P arent ← getExtent(Gt−1 , p) Iterate to get each extent element e in extentOf P arent while there are more e do node ← getLowestChild(p, extentOf P arent) if node is equal to null then node ← node i Get and store the intent of node as intentG1 else Get and store the intent of node as intentG1 Find matched node m in Gt based on the intent of p if m is not equal to null then extentOf M atchedP arent ← getExtent(Gt , m) node ← getLowestChild(m, extentOf M atchedP arent) if node is equal to null then node ← matched node m Get and store the intent of node as intentG2 Highlight intent element differences between intentG1 and intentG2 else Get and store the intent of node as intentG2 Highlight intent element differences between intentG1 and intentG2 else Get top node tn extentOf T opN ode ← getExtent(Gt , tn) node ← getLowestChild(tn, extentOf T opN ode) if node is equal to null then node ← top node tn Get and store the intent of node as intentG2 Highlight intent element differences between intentG1 and intentG2 else Get and store the intent of node as intentG2 Highlight intent element differences between intentG1 and intentG2
Algorithm 2: getLowestChild input return lowest node.
: Node n and extent ext.
N ode lowest = null Get children node of n foreach children of node n do Get the children that contain extent element ext if lowest child with extent element ext is found then return lowest
For illustration purposes, consider the two simple examples in Figures 2 and 3. In Figure 3, suppose nodes 3 and 4 are the added nodes. If node 3 is selected for processing, then the parent node would be 2. At the parent node, the extent element Eczema is selected and stored. Iterate down to the lowest child node that contains Eczema which is node 3 with the intent elements of gender=M, lesion status=2, lesion status=3 and
296
M.H. Ou et al.
lesion status=4. Then perform the matching between the two graphs. The results show that node 2 of Gt−1 matches node 2 of Gt . Based on the stored extent element Eczema in Gt−1 , iterate down to the lowest child node that contains Eczema which is node 3 with the intent elements of age≤29 and gender=M. Compare and highlight the differences between the current and previous intent elements. In this case, the characteristics lesion status=2, lesion status=3 and lesion status=4 have been introduced in Gt to give the diagnosis Eczema a more detailed description, but the characteristic age≤29 has become insignificant. Repeat the same steps for node 4. The results show that in the current graph the characteristic lesion status=1 has become important for classifying PityriasisRubraPilaris, whereas the characteristic age>29 in Gt−1 has become insignificant. The characteristic for describing the diagnosis Psoriasis remains unchanged in both Gt−1 and Gt . 4.2
Dissimilarity Metric
The measure of conceptual variation between graphs Gt−1 and Gt when a new case is added is based on the following factors: 1. The number of diagnoses that have been affected. 2. The level in which the concept variations occur in the lattice. 3. The ratio of the characteristics of each diagnosis that have been affected. Let c(Gj ) be the number of diagnoses affected after node i is removed/added to Gj , where i = 1, 2, 3, ..., n, and j = t − 1, t; n is the total number of nodes removed/added from the graphs which is calculated using Equation 1 or 2 depending on j; C(Gj ) is the total number of diagnoses in Gj , a(Gji ) is the number of features that exist in node i of Gt−1 but not in Gt , and vice versa; A(Gji ) is the total number of features in node i; md(Gj ) is the maximum depth of Gj ; and d(Gji ) is the depth of node i from top of the graph to the current position of the node. The variation measure v(Gj ) for each graph is: ⎧ ⎨ v1 (Gj ) : md(Gj ) = 1 and d(Gji ) = 0 (3) v(Gj ) = v2 (Gj ) : md(Gj ) = 1 and d(Gji ) = 0 ⎩ v3 (Gj ) : md(Gj ) > 1 where v(Gj ) = {v1 (Gj )|v2 (Gj )|v3 (Gj )}, and v1 (Gj ), v2 (Gj ) and v3 (Gj ) are defined as: v1 (Gj ) = v2 (Gj ) =
n c(Gj ) a(Gji ) md(Gj ) − d(Gji ) . . C(Gj ) A(Gji ) md(Gj ) ∗ (n + 1) i=0 n 1 c(Gj ) a(Gji ) . . C(G ) A(G ) n + 1 j ji i=0
n c(Gj ) a(Gji ) md(Gj ) − d(Gji ) . . v3 (Gj ) = C(G md(Gj ) ∗ n j ) A(Gji ) i=0
The total dissimilarity measure d(G) between Gt−1 and Gt is then defined as: d(G) = v(Gt−1 ) + v(Gt )
(4)
Interactive Knowledge Validation in CBR for Decision Support in Medicine
297
The dissimilarity measure between the two graphs falls into the range of [0, 1]. The lower the value of d(G) the higher the consistency since there is less variation between the two graphs. If d(G) = 0, the new cases are consistent with the previous cases. In the worse case, the value of d(G) is close to or equal to 1, implying the new added cases contradict the previous cases. This might be because there are not enough cases to disambiguate some diagnoses, or the diagnoses are classified using different sets of characteristics. We use the examples in Figures 2 and 3 to illustrate how the dissimilarity measure is calculated. Based on Gt−1 , v(Gt−1 ) = 0.111 with c(Gt−1 ) = 2; C(Gt−1 ) = 3; a(Gt−1(1) )/A(Gt−1(1) ) = 0.5 and a(Gt−1(2) )/A(Gt−1(2) ) = 0.5; md(Gt−1 ) = 3; and d(Gt−1(1) ) = 2 and d(Gt−1(2) ) = 2. Graph Gt has v(Gt ) = 0.139 with c(Gt ) = 2; C(Gt ) = 3; a(Gt(1) )/A(Gt(1) ) = 0.75 and a(Gt(2) )/A(Gt(2) ) = 0.5; md(Gt ) = 3; and d(Gt(1) ) = 2 and d(Gt(2) ) = 2. Thus, applying Equation 4 we get d(G) = 0.25 which implies significant change according to the classifications that are defined in Table 3. 0.5
0.45 Anomaluos case removed
0.4
Rank 1 2 3 4 5
Category
Dissimilarity Values No change 0.00 - 0.10 Slight change 0.10 - 0.15 Significant change 0.15 - 0.50 Major change 0.50 - 0.90 Radical change 0.90 - 1.00
0.35
dissimilarity value
Table 3. Dissimilarity values categorization
Anomaluos case added
0.3
0.25
0.2 0.15
0.1
0.05
0 0
5
10
15
20
25
30
35
40
45
50
number of cases
Fig. 4. Dissimilarity values vs. number of new cases
4.3
Results
Discussion with the consultant revealed the effectiveness of the proposed techniques. First, the consultant recommended the grouping of the dissimilarity values into 5 different categories shown in Table 3. The categories provide a qualitative measure of the changes ranked from “No change” to “Radical change”. The value of less than 0.10 is considered as “No change”, since the variations between the two graphs are not significant enough to have a major effect on the results. The evaluation process involved analysing how the quality of the existing cases in the database is affected by the update process. It is important to note that the decision tree classifier built from the cases gives 100% correct classification and hence the lattices reflect perfect classification. First, we considered the case where no consultant interaction occurred and allowed the CBR to incrementally add new cases. Figure 4
298
M.H. Ou et al.
shows the dissimilarity values as each new case is added. As can be seen, there are significant and minor variations between the previous and current concepts as we incrementally update the database (indicated by the variations in the dissimilarity values). This is expected since the decision tree classifier repartitions the attributes to get the best classification, causing the context tables to change and hence change the lattices. In general, the dissimilarity values decrease as the number of cases used for training increase. The decreasing trend shows the CBR system increases the consistency, and this means the classifier is becoming more stable and generalising better. This is shown by the linear regression line that indicates a steady decrease as new cases are added. To check for consistency of the results, the cases are randomly added to the lattice one by one. After a number of trials, the results are similar to those shown in Figure 4 with decreasing regression lines. Rerunning the updating process with the consultant present reveals that the high values in dissimilarity in Figure 4 closely match the consultant’s perceptions of the new cases as they are added. For the two large peaks at case 4 and case 12 in Figure 4, the consultant examined the inconsistency, and observed some anomalies in case 12 and modified the features accordingly leading to lower ambiguity (represented by the dotted line). The reason for many of the anomalies is because of the GPs’ and consultants’ interpretation of the symptoms. This can be quite subjective and the checking of the cases reduces these anomalies3 . Case 4, however, is considered to be valid and no modification has been made. The increase in the dissimilarity value at case 4 is due to the repartitioning of existing attributes plus some new ones selected for describing the reoccurring diagnosis Psoriasis (i.e. already existed in the database). To further illustrate this, we randomly chose case number 30 having the dissimilarity value of zero, and modified its characteristics to determine whether the value does increase to show the fact that the modified case is no longer valid. The result shows a significant increase in the dissimilarity value (represented by the dashed line) which suggests that the case is no longer valid or consistent with other cases in the database, and this leads to a slight decrease in the accuracy of the decision tree classifier.
5
Conclusions
This paper presents a technique that enables the domain expert to interactively supervise and validate the new knowledge. The technique involves using concept lattices to graphically present the conceptual differences, a summary description highlighting the conceptual variations, and a dissimilarity metric to quantify the level of variations between the current and previous cases in the database. The trials involved having the consultant interactively revise the anomalous cases by modifying the characteristics and the diagnoses that are likely to cause the ambiguity. The results show that dissimilarities match a number of actual processes. First, when the lattices change significantly but the decision tree classifier is giving 100% classification as each new diagnosis is added. Second, when there is a problem in the data for a new case requiring consultant interaction. 3
Note this can also deal with other factors such as errors in data entry.
Interactive Knowledge Validation in CBR for Decision Support in Medicine
299
Acknowledgements The research reported in this paper has been funded in full by a grant from the AHMAC/SCRIF initiative administered by the NHMRC in Australia.
References 1. P. Cole, R. Eklund and F. Amardeilh. Browsing Semi-structured Texts on the Web using Formal Concept Analysis. Web Intelligence, 2003. 2. B. D´ıaz-Agudo and P. A. Gonzalez-Calero. Formal Concept Analysis as a Support Technique for CBR. Knowledge-Based System, 7(1):39–59, March 2001. 3. B. Ganter. Computing with Conceptual Structures. In Proc. of the 8th International Conference on Conceptual Structure, Darmstadt, 2000. Springer. 4. B. Ganter and R. Wille. Formal Concept Analysis: Mathematical Foundations. Springer, Heidelberg, 1999. 5. M. Montes-y G´omez, A. Gelbukh, A. L´opez-L´opez, and R. Baeza-Yates. Flexible Comparison of Conceptual Graphs. In Proceedings of the 12th International Conference and Workshop on Database and Expert Systems Applications, pages 102–111, Munich, Germany, 2001. 6. J. Ross. Quinlan. C4.5 Programs for Machine Learning. Morgan Kaufmann Publishers, USA, 1993. 7. D. Richards. Visualizing Knowledge Based Systems. In Proceedings of the 3rd Workshop on Software Visualization, pages 1–8, Sydney, Australia, 1999. 8. D. Richards. The Visualisation of Multiple Views to Support Knowledge Reuse. In Proceedings of the Intelligent Information Processing (IIP’2000) In Conjunction with the 16th IFIP World Computer Congress WCC2000, Beijing, China, 2000. 9. I. Watson. Applying Case-Based Reasoning: Techniques for Enterprise Systems. Morgan Kaufmann Publishers, USA, 1997. 10. I. Witten and E. Frank. Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations. Morgan Kaufmann Publishers, USA, 2000. 11. P. Z. Yeh, B. Porter, and K. Barker. Using Transformations to Improve Semantic Matching. In Proceedings of the International Conference on Knowledge Capture, pages 180–189, Sanibel Island, FL, USA, 2003. 12. J. Zhong, H. Zhu, J. Li, and Y. Yu. Conceptual Graph Matching for Semantic Search. In Proceedings of the 10th International Conference on Conceptual Structures: Integration and Interfaces, pages 92–106, Borovets, Bulgaria, 2002.
Adaptation and Medical Case-Based Reasoning Focusing on Endocrine Therapy Support Rainer Schmidt1 and Olga Vorobieva1,2 1
Institute for Medical Informatics and Biometry, University of Rostock, D-18055 Rostock, Germany [email protected] 2 Setchenow Institute of Evolutionary Physiology and Biochemistry, St.Petersburg, Russia
Abstract. So far, Case-Based Reasoning has not become as successful in medicine as in some other application domains. One, probably the main reason is the adaptation problem. In Case-Based Reasoning the adaptation task still is domain dependent und usually requires specific adaptation rules. Furthermore, in medicine adaptation is often more difficult than in other domains, because usually more and complex features have to be considered. We have developed some programs for endocrine therapy support, especially for hypothyroidism. In this paper, we do not present them in detail, but focus on adaptation. We do not only summarise experiences with adaptation in medicine, but we want to elaborate typical medical adaptation problems and hope to indicate possibilities how to solve them.
1 Introduction Case-Based Reasoning (CBR) has become a successful technique for knowledgebased systems in many domains, while in medicine some more problems arise to use this method. A CBR system has to solve two main tasks [1]: The first one is the retrieval, which is the search for or the calculation of similar cases. For this task much research has been undertaken. The basic retrieval algorithms for indexing [2], Nearest Neighbor match [3], pre-classification [4] etc. have already been developed some years ago and have been improved in the recent years. So, actually it has become correspondingly easy to find sophisticated CBR retrieval algorithms adequate for nearly every sort of application problem. The second task, the adaptation means modifying a solution of a former similar case to fit for a current problem. If there are no important differences between a current and a similar case, a solution transfer is sufficient. Sometimes just few substitutions are required, but usually adaptation is a complicated process. While in the early 90th the focus of the CBR community lay on retrieval, in the late 90th CBR researchers investigated various aspects of adaptation. Though theories and models for adaptation [e.g. 5, 6] have been developed, adaptation is still domain dependent. Usually, for each application specific adaptation rules have to be generated. Since adaptation is even more difficult in medicine, we want to elaborate typical medical adaptation problems and we hope to show possibilities how to solve them. S. Miksch et al. (Eds.): AIME 2005, LNAI 3581, pp. 300 – 309, 2005. © Springer-Verlag Berlin Heidelberg 2005
Adaptation and Medical CBR Focusing on Endocrine Therapy Support
301
2 Medical Case-Based Reasoning Systems Though CBR was not as successful in medicine as in some other domains so far, several medical systems have already been developed which at least apply parts of the Case-Based Reasoning method. Here we do not want to give a review of all these systems (for that see [7, 8, 9]), but we intend to show the main developments concerning adaptation in this area. 2.1 Avoiding the Adaptation Problem Some systems avoid the adaptation problem, because they do not apply the complete CBR method, but only a part of it, namely the retrieval. These systems can be divided into two groups, retrieval-only systems and multi modal reasoning systems. Retrieval-only systems are mainly used for image interpretation, which is mainly a classification task [10]. However, retrieval-only systems are not only used for image interpretation, but for other visualisation tasks too, e.g. for the development of kidney function courses [11] and for hepatitic surgery [12]. Multi modal reasoning systems apply parts of different reasoning methods. From CBR they usually incorporate the retrieval step, often to calculate or support evidences [13], e.g. in CARE-PARTNER [14] CBR retrieval is used to search for similar cases to support evidences for a rule-based program. 2.2 Solving the Adaptation Problem So far, only a few medical systems have been developed that apply the complete CBR method. In these systems three main techniques are used for adaptation: adaptation operators or rules, constraints, and compositional adaptation. Furthermore, abstracting from specific single cases to more general prototypical cases sometimes supports the adaptation. Adaptation rules and operators. One of the earliest medical expert systems that use CBR techniques is CASEY [16]. It deals with heart failure diagnosis. The most interesting aspect of CASEY is the ambitious attempt to solve the adaptation task. Since the creation of a complete rule base for adaptation was too time consuming, a few general operators are used to solve the adaptation task. Since many features have to be considered in the heart failure domain and since consequently many differences between cases can occur, not all differences between former similar cases and a query case can be handled by the developed general adaptation operators. So, if no similar case can be found or if adaptation fails, CASEY uses a rule-based domain theory. However, the development of complete adaptation rule bases never became a successful technique to solve the adaptation problem in medical CBR systems, because the bottleneck for rule-based medical expert systems, the knowledge acquisition, occurs again. Constraints. A more successful technique is the application of constraints. In ICONS, an antibiotics therapy adviser [17], adaptation is rather easy, it is just a reduction of a list of solutions (recommended therapies for a similar, retrieved case) by constraints (contraindications of the query patient).
302
R. Schmidt and O. Vorobieva
Another example for applying constraints is a diagnostic program concerning dysmorphic syndromes [18]. The retrieval provides a list of prototypes sorted according to their similarity in respect to the current patient. Each prototype defines a diagnosis (dysmorphic syndrome) and represents the typical features for this diagnosis. The provided list of prototypes is checked by a set of explicit constraints. These constraints state that some features of the patient either contradict or support specific prototypes (diagnoses). A typical task for applying constraints is menu planing [19], where different requirements have to be served: special diets and individual factors, not only personal preferences, but also contraindications and demands based on various complications. Compositional adaptation. A further successful adaptation technique is compositional adaptation [20]. In TA3-IVF [21], a system to modify in vitro fertilisation treatment plans, relevant similar cases are retrieved and compositional adaptation is used to compute weighted averages for the solution attributes. In TeCoMed [22], an early warning system concerning threatening influenza waves, compositional adaptation is applied on the most similar former courses to decide whether a warning against an influenza wave is appropriate. Abstraction. As one reason for the adaptation problem is the extreme specificity of single cases, the generalisation from single cases into abstracted prototypes [18] or classes [23] may support the adaptation. The idea of generating more abstract cases is typical for the medical domain, because (proto-) typical cases directly correspond to (proto-) typical diagnoses or therapies. While in GS.52 all prototypes are organised on the same level, in MNAOMIA [23], a hierarchy of classes, cases, and concepts with few layers is used. MEDIC [24], a diagnostic reasoner on the domain of pulmonology, consists of a multi-layered hierarchy of schemata, of scenes, and of memory organisation packets of individual cases.
3 Adaptation Problems in Endocrinology Diseases Therapy Support We have developed an endocrine therapy support system for a children's hospital. The architecture is shown in figure 1, for technical details see [25]. Here, we focus on adaptation problems within it to illustrate general adaptation problems in medicine. All body functions are regulated by the endocrine system. The endocrine gland produces hormones and secrets them in blood. Hypothyroidism means that a patient's thyroid gland does not produce enough thyroid hormone naturally. If hypothyroidism is undertreated, it may lead to obesity, brachicardia and other heart diseases, memory loss and many other diseases [26]. Furthermore in children it causes mental and physical retardation. If hypothyroidism is of autoimune nature, it sometimes occurs in combination with further autoimune diseases such as diabetes. The diagnosis hypothyroidism can be established by blood tests. The therapy is inevitable: thyroid hormone replacement by levothyroxine. The problem is to determine the therapeutic dose, because the thyroxin demand of a patient follows only very roughly general schema and so the therapy must be individualised [27]. If the dose is too low, hypothyroidism is undertreated. If the dose it too high, the thyroid hormone concentration is also too high, which leads to hyperactive thyroid effects [26, 27].
Adaptation and Medical CBR Focusing on Endocrine Therapy Support
303
There are two different tasks of determining an appropriate dose. The first one aims to determine the initial dose, while later on the dose has to be updated continuously during a patient's lifetime. Precise determination of the initial dose is most important for newborn babies with congenital hypothyroidism, because for them every week of proper therapy counts.
Fig. 1. Architecture of our endocrine therapy support system
3.1 Computing an Initial Dose For the determination of an initial dose (fig. 2), a couple of prototypes, called guidelines, exist, which have been defined by commissions of experts. The assignment of a patient to a fitting guideline is obvious because of the way the
304
R. Schmidt and O. Vorobieva
Hypothyroidism Patient: Dose? Determination Guideline Adaptation by Calculation Range of Dose Retrieval Similar Former Cases Check: Best Therapy Results Best Similar Cases Compositional Adaptation Optimized Dose Fig. 2. Determination of an initial dose
guidelines have been defined. With the help of these guidelines a range for good doses can be calculated. To compute an optimal dose, we retrieve similar cases with initial doses within the calculated ranges. Since there are only few attributes and since our case base is rather small, we use Tversky's sequential measure of dissimilarity [28]. On the basis of those of the retrieved cases that had best therapy results an average initial therapy is calculated. Best therapy results can be determined by values of another blood test after two weeks of treatment with the initial dose. The opposite idea to consider cases with bad therapy results does not work here, because bad results may be caused by various reasons. So, to compute an optimal dose recommendation, we apply two forms of adaptation. First, a calculation of ranges according to guidelines and patients attribute values. Secondly, we use Compositional Adaptation. That means, we take only similar cases with best therapy results into account and calculate the average dose for these cases, which has to be adapted to the query patient by another calculation. Example for computing an initial dose. The query patient is a newborn baby that is 20 days old, has a weight of 4 kg and is diagnosed for hypothyroidism. The guideline for babies about 3 weeks age and normal weight recommend a levothyroxine therapy
Adaptation and Medical CBR Focusing on Endocrine Therapy Support
305
with a daily dose between 12 and 15 µg/kg. So, because the baby weighs 4 kg, a range of 48-60 µg is calculated. The retrieval provides similar cases that must have doses within the calculated range. These cases are restricted to those where after two weeks treatment less than 10 µU/ml thyroid stimulating hormone could be observed. Since these remaining similar cases are all treated alike, an average dose per kg is computed which subsequently is multiplied with the query patient's weight to deliver the optimal daily dose. 3.2 Updating the Dose in a Patient's Lifetime For monitoring the patient, three laboratory blood tests have to be made. Usually the results of these tests correspond to each other. Otherwise, it indicates a more complicated thyroid condition and additional tests are necessary. If the tests show that the patient's thyroid hormone level is normal, it means that the current levothyroxine dose is OK. If the tests indicate that the thyroid hormone level is too low or too high, the current dose has to be increased resp. decreased by 25 or 50 µg [26, 27]. So, for monitoring adaptation is just a simple calculation according to well-known standards. Example. Figure 3 shows an example of a case study. We compared the decisions of an experienced doctor with the recommendations of our system. The decisions are based on the basic laboratory tests and on lists of observed symptoms. Intervals between two visits are approximately six months. 120
100
Levothyroxine µg/day
80
Doctor
60
Program
40
20
0 V1
V2
V3
V4
V5
V6
V7
V8
V9
V10 V11 V12 V13 V14 V15 V17 V18 V19 V20 V21 V22
Fig. 3. Dose updates recommended by our program compared with doctor’s decision. V1 means the first visit, V2 the second visit etc
306
R. Schmidt and O. Vorobieva
In this example there are three deviations, usually there are less. At the second visit (v2), according to laboratory results the levothyroxine should be increased. Our program recommended a too high increase. The applied adaptation rule was not precise enough. So, we modified it. At visit 10 (v10) the doctor decided to try to decrease the dose. The doctor’s reasons were not included in our knowledge base and since his attempt was not successful, we did not alter any adaptation rule. At visit 21 (v21) the doctor increased the dose because of some minor symptoms of hypothyroidism, which were not included in our program’s list of hypothyroidism symptoms. Since the doctors decision was probably right (visit 22), we added these symptoms to the list of hypothyroidism symptoms of our program. 3.3 Additional Diseases or Complications It often occurs that patients do not only have hypothyroidism, but they additionally suffer from further chronic diseases or complications. So, the levothyroxine therapy has to be checked for contraindications, adverse effects and interactions with additionally existing therapies. Since no alternative is available to replace levothyroxine, if necessary additionally existing therapies have to be modified, substituted, or compensated (fig. 4) [26, 27].
Hypothyroisdm
Further Complication
Levothyroxine Therapy
Further Therapy
Contraindications?
Modifications
Adverse Effects?
Substitution or Compensation
Interactions?
Adaptation Rules
Fig. 4. Levothyroxine therapy and additionally existing therapies
In our support program we perform three tests. The first one checks if another existing therapy is contraindicated to hypothyroidism. This holds only for very few therapies, namely for specific diets like soybean infant formula, but they are typical for newborn babies. Such diets have to be modified. Since no exact knowledge is available how to do it, our program just issues a warning saying that a modification is necessary.
Adaptation and Medical CBR Focusing on Endocrine Therapy Support
307
The second test considers adverse effects. There are two ways to deal with them. A further existing therapy has either to be substituted or it has to be compensated by another drug. Since such knowledge is available, we have implemented corresponding rules for substitutional resp. compensational adaptation. The third test checks for interactions between both therapies. Here we have implemented some adaptation rules, which mainly attempt to avoid the interactions. For example, if a patient has heartburn problems that are treated with an antacid, a rule for this situation exists that states that levothyroxine should be administered at least 4 hours after or before an antacid. However, if no adaptation rule can solve such an interaction problem, the same substitution rules as for adverse effects are applied.
4 Conclusion: Adaptation Techniques for Medical Therapy Problems Our intention is to undertake first steps in the direction of developing a methodology for the adaptation problem for medical CBR systems. However, we have to admit that we are just at the beginning. Furthermore, our experience concerning the adaptation problem is mainly based on therapeutic tasks. Indeed, in contrast to ideas of many computer scientists, doctors are much more interested in therapeutic than in diagnostic support programs. So, in this paper, we firstly reviewed how adaptation is handled in medical CBR systems and secondly enriched the experiences by additional examples from the endocrinology domain, where we have recently developed support programs. For medical therapy systems, that intend to apply the whole CBR cycle, at present we can summarise useful adaptation techniques (fig. 5). However, most of them are promising only for specific tasks. Abstraction. from single cases to more general prototypes seems to be a promising implicit support. However, if the prototypes correspond to guidelines they may even explicitly solve some adaptation steps (see section 3.1). Compositional Adaptation. at first glance does not seem to appropriate in medicine, because it was originally developed for configuration [20]. However, it has been successfully applied for calculation therapy doses (e.g. in TA3-IVF [21] and see section 3.1.). Constraints. are a promising adaptation technique too, but only for a specific situation (see section 2.2), namely for a set of solutions that can be reduced by checking e.g. contraindications (in ICONS) or contradictions (in GS.52). Adaptation rules. The only technique that seems to be general enough to solve many medical adaptation problems is the application of adaptation rules or operators. The technique is general, but unfortunately the content of such rules has to be domain specific. Especially for complex medical tasks the generation of adaptation rules often is too time consuming and sometimes even impossible.
308
R. Schmidt and O. Vorobieva
Therapeutic Adaptation Task
Initial Therapy
Constraints
Dose Calculation
Prototype, Guideline
Compositional Adaptation
Monitoring: Dose-update
Calculation, Guidelines
Additional Therapies
Substitution, Compensation
Modification, Elimination
Fig. 5. A task oriented adaptation model
However, for therapeutic tasks some typical forms of adaptation rules can be made out, namely for substitutional and compensational adaptation (e.g. section 3.3.), and for calculating doses (e.g. section 3.2.). So, a next step might be an attempt to generate general adaptation operators for these typical forms of adaptation rules.
References 1. Aamodt, A., Plaza, E.: Case-based reasoning: Foundational issues, methodological variations, and system approaches. AI Communications 7 (1) (1994) 39-59 2. Stottler, R. et al.: Rapid retrieval algorithms for case-based reasoning. In: Proc of 11th Int Joint Conference on Artificial Intelligence, Morgan Kaufmann Publishers, San Mateo (1989) 233-237 3. Broder, A.: Strategies for efficient incremental nearest neighbor search. Pattern Recognition 23 (1990) 171-178 4. Quinlan, J.: C4.5, Programs for Machine Learning. Morgan Kaufmann Publishers, San Mateo (1993) 5. Bergmann, R., Wilke, W.: Towards a new formal model of transformational adaptation in case-based reasoning. In: Gierl L, Lenz, M. (eds.): Proceedings of th German Workshop on CBR. University of Rostock (1998) 43-52 6. Fuchs,B.,Mille, A.: A knowledge-level task model of adaptation in case-based reasoning. In: Althoff, K.-D. et al. (eds.): Case-Based Reasoning Research and Development, Proc. of 3rd Int Conference. Springer, Berlin (1999) 118-131 7. Gierl, L., Bull, M., Schmidt, R.: CBR in Medicine. In: Lenz, M. et al., Case-Based Reasoning Technology, From Foundations to Applications. Springer, Berlin (1998) 273-297 8. Schmidt, R., Montani, S., Bellazzi,E., Portinale, L., Gierl, L.: Case-Based Reasoning for Medical Knowledge-based Systems. Int J Med Inform 64 (2-3) (2001) 355-367 9. Nilsson M, Sollenborn M: Advancements and trends in medical case-based Reasoning: An overview of systems and system developments. In: Proc of FLAIRS, AAAI Press (2004) 178-183
Adaptation and Medical CBR Focusing on Endocrine Therapy Support
309
10. Perner, P.: Why Case-Based Reasoning is Attractive for Image Interpretation. In: Aha, D., Watson, I. (eds.): Case-Based Reasoning Research and Development, Proc 4th Int Conference. Springer, Berlin (2001) 27-43 11. Schmidt, R. et al.: Medical multiparametric time course prognoses applied to kidney function assessments. Int J Med Inform 53 (2-3) (1999) 253-264 12. Dugas M.: Clinical applications of Intranet-Technology. In: Dudeck, J. et al. (eds.): New Technologies in Hospital Information Systems. IOS Press, Amsterdam (1997) 115-118 13. Montani, S. et al.: Diabetic patient's management expoiting Case-based Reasoning techniques. Computer Methods and Programs in Biomedicine 62 (2000) 205-218 14. Bichindaritz, I. et al.: Case-based reasoning in CARE-PARTNER: Gathering evidence for evidence-based medical practice. In: Smyth, B., Cunningham, P. (eds.): Advances in CaseBased Reasoning, Proc of 4th European Workshop. Springer, Berlin (1998) 334-345 15. Koton, P.: Reasoning about evidence in causal explanations. In: Kolodner, J. (ed.): First Workshop on CBR. Morgan Kaufmann Publishers, San Mateo (1988) 260-270 16. Schmidt, R., Gierl, L.: Case-based Reasoning for Antibiotics Therapy Advice: An Investigation of Retrieval Algorithms and Prototypes. Artificial Intelligence in Medicine 23 (2) (2001) 171-186 17. Gierl, L., Stengel-Rutkowski, S.: Integrating consultation and semi-automatic knowledge acquisition in a prototype-based architecture: Experiences with Dysmorphic Syndromes. Artificial Intelligence in Medicine 6 (1994) 29-49 18. Petot, G.J., Marling, C., Sterling, L.: An artificial intelligence system for computerassisted menu planing. Journal of American Diet Assoc 98 (9) (1998) 1009-10014 19. Wilke, W., Smyth, B., Cunningham, P.: Using Configuration Techniques for Adaptation. In: Lenz, M., Bartsch-Spörl, B., Burkhard, H.-D., Wess, S. (eds.): Case-Based Reasoning Technology, From Foundations to Applications. Springer, Berlin (1998) 139-168 20. Jurisica, I. et al: Case-based reasoning in IVF: prediction and knowledge mining. Artificial Intelligence in Medicine 12 (1998) 1-24 21. Schmidt, R., Gierl, L.: Prognostic Model for Early Warning of Threatening Influenza Waves. In: Minor, M., Staab, S. (eds.): Proc 1st German Workshop on Experience Management. Köllen, Bonn (2002) 39-46 22. Bichindaritz, I.: From cases to classes: Focusing on abstraction in case-based reasoning. In: Burkhard, H.-D., Lenz, M. (eds.): Proc of 4th German Workshop on CBR. Humboldt University Berlin (1996) 62-69 23. Turner, R.: Organizing and using schematic knowledge for medical diagnosis. In: Kolodner, J. (ed.): First Workshop on CBR. Morgan Kaufmann Publishers, San Mateo (1988) 435-446 24. Schmidt, R., Vorobieva, O., Gierl L.: Adaptation problems in therapeutic case-based reasoning systems. In: Palade, L., Howlett, R.J., Jain, L.(eds.): Knowledge-Based Intelligent Information and Engineering Systems, Springer, Berlin (2003) 992-999 25. Hampel, R.: Diagnostik und Therapie von Schilddrüsenfunktionsstörungen. UNI-MED Verlag, Bremen (2000) 26. DeGroot, L.J.: Thyroid Physiology and Hypothyroidsm. In: Besser, G.M., Turner, M. (eds.): Clinical endocrinilogy. Wolfe, London (1994) (Chapter 15) 27. Tversky, A.: Features of similarity. Psychological review 84 (1977) 327-352
Transcranial Magnetic Stimulation (TMS) to Evaluate and Classify Mental Diseases Using Neural Networks Alberto Faro1, Daniela Giordano1, Manuela Pennisi2, Giacomo Scarciofalo1, Concetto Spampinato1, and Francesco Tramontana1 1
Dipartimento di Ingegneria Informatica e Telecomunicazioni, Università di Catania, viale A., Doria 6, 95125 Catania, Italy {afaro, dgiordan, gscarcio, cspampin, ftramont}@diit.unict.it 2 Dipartimento di Neuroscienze, Università di Catania, Policlinico Città Universitaria, via S.Sofia 78, 95123 Catania, Italy [email protected]
Abstract. The paper proposes a methodology based on a neural network to process the signals related to the hands movements in response to the Transcranial Magnetic Stimulation (TMS) in order to diagnose the pathology and evaluate the treatment of the patients affected by demency diseases. First a time-frequency analysis of such signals is carried out to identify the main signal variables that characterize the demency diseases. Then these variables are processed by a neural network in order to classify the responses into four classes: healthy subjects, people affected by Subcortical Ischemic Vascular Dementia (SIVD) and/or Alzheimer. A comparison between the proposed method and a fuzzy approach previously developed by the authors is presented.
1 Introduction Early diagnosis and effective therapy are decisive for a successful treatment of patients affected by highly critical diseases. However, current medical practices lack quantitative methods able to diagnose with a given confidence type and degree the mental diseases such as Alzheimer and Subcortical Ischemic Vascular Dementia (SIVD). The paper aims at overcoming this limit by a two step methodology. First the relevant variables that may reveal the type and the degree of the mental diseases are identified. The second step is to find a non-linear model based on the mentioned variables that allows us to identify by a neural network the type of disease and to express a quantitative evaluation of the patient status in terms of fuzzy quantifiers (e.g., SIVD at an initial stage, very advanced Alzheimer disease, and so on). Several variables and experiments may be used for discovering the type and the degree of the mental disease, e.g., the average values of EEG signals or appropriate tests related to cognitive and physical abilities. In the paper the Transcranial Magnetic Stimulation (TMS) is taken into account since it may be used for both diagnosis and therapy [1]. Sect.2 briefly recalls the main features of the TMS and how it is currently administered for the treatment of patients affected by mental diseases [2]. In sect.3 a S. Miksch et al. (Eds.): AIME 2005, LNAI 3581, pp. 310 – 314, 2005. © Springer-Verlag Berlin Heidelberg 2005
TMS to Evaluate and Classify Mental Diseases Using Neural Networks
311
time-frequency analysis is carried out in order to identify what variables related to the responses of the patients to TMS are useful to characterize the mental diseases. In sect.4 all these variables are given as input to a supervised neural network in order to classify people in four classes, i.e., healthy people, people affected by Alzheimer and/or SIVD. In this section a comparison is carried out between this method and a fuzzy approach previously developed by the authors [3] to point out the advantages of integrating the two approaches into a two steps methodology: the first step deals with the disease identification by a neural network, the second step aims at evaluating the degree and the effect of the TMS based therapy by fuzzy logic.
2 Transcranial Magnetic Stimulation (TMS) TMS produces a modification of the neuronal activity related to a defined brain area stimulated by the variable magnetic field generated by a coil [4]. Such magnetic field induces an electrical flow in the brain that goes towards the limbs and determines some involuntary movements of the legs and of the hands [5]. The movements of the limbs in response to TMS are measured by an ElectroMyoGraph (EMG). TMS is painless and its positive effect consists of an increase in the physical and cognitive performances of the brain activity of the people affected by dementia. Since after a certain period of time such neural activity returns to the initial conditions, it is useful to have a method to identify when TMS should be administered anew to reactivate the neural activity. To gain some insight about the patient conditions from the muscles response to TMS, the tests are conducted according to a protocol consisting of three phases: “Bi-Stim-Before”, “R-Stim” and “Bi-Stim-After”. In phase “Bi-Stim-Before”, two stimulations are given to the subject; the first acts as a conditioning stimulus, the second acts as the testing stimulus. The muscular responses to be compared with the ones obtained after the repetitive stimulation are taken and stored by the EMG. In phase “R-Stim”, a repetitive stimulation is administered to the subject for a certain period of time (e.g., one repetitive stimulation session per day for 15 days). The patient is stimulated for about 30 minutes by a magnetic field at a stimulation frequency between 1 Hz and 30 Hz depending on the pathology. In phase “Bi-Stim-After”, the patient is stimulated as in phase “Bi-Stim-Before”. By comparing the signals obtained from the hand movements before and after a repetitive stimulation, it is possible to evaluate the subject’s status. In healthy subjects the signals before and after R-stim are more or less the same, whereas in the people affected by dementia such signals differ. Data on the muscular responses are collected during the mentioned Bi-Stim phase varying the delay between the first and the second stimulus. Such delay is called Inter-Stimulus time Interval (ISI). Six ISIs have been tested (i.e., 0,1,2,5,7,10 ms). For each ISI nine tracks have been stored, for a total of 54 stimulations per subject. After the R-Stim phase, a second bi-stimulation phase (i.e., Bi-Stim-After phase) is carried out to evaluate if the subject is healthy and if the R-Stim phase has produced the envisaged positive effect for patients affected by dementia. To give a quantitative basis to such a comparison, it is necessary to identify the values of the muscular response that may be useful to characterize the subject status. To this aim an analysis of
312
A. Faro et al.
the signals related to the mentioned muscular movements in the time-frequency domain and using the Wavelet transform is carried out in the next section.
3 Variables Characterizing the Mental Diseases The variables related to the signals detected by EMG in the mentioned TMS tests are assumed to concur to characterize the mental diseases if they show different values before and after the repetitive stimulation for the patients affected by dementia, since they remain unchanged for the healthy people. To diagnose the type and the degree of the dementia it is useful to have more than one relevant variables since in absence of an explicit model, the only way to obtain a characterization of the dementia is to put into correspondence dementia type and degree with a particular combination of the mentioned variables and related values. In the following we discuss what variables may be used for this aim. To find the variables relevant to characterize the dementia diseases we have taken into account several variables defined on the EMG tracks, as for example, the following ones: latency L, amplitude A, Max and Min module of the Fast Fourier Transform (FFT), Max and Min module of the Hilbert transform. The relevance check for each of these variable was performed by first finding the average track for a given value of ISI and then by verifying, ISI by ISI, if the variable values after R-Stim differ from the ones assumed by the same variable before R-Stim. This check has been performed for each variable by a simple inspection of a graphic consisting of two curves: one formed by the values of the variable in the 9 stimulations before R-stim and the other formed by the values of the variable in the 9 stimulations after R-stim. Interestingly, this allowed us not only to identify the variables relevant for this study on the mental diseases, but also to find that the stimulations for ISI > 5 are not useful for treating the patients. The following table points out that this effect is valid for all the variables taken into account, thus we limit our study to EMG tracks with ISI ≤ 5. In such table the symbols ↑, ↓ and ≈ respectively indicate that the variables after R-stim increase, decrease , or do not change. Table 1. Changes in the variables after the repetitive stimulation at different ISIs
ISI 1 ISI 2 ISI 5 ISI 7 ISI 10
A
L
↑ ↑ ↑ ≈ ↑
↓ ↓ ↓ ≈ ≈
FFT Mod.Max ↑ ↑ ↑ ↑ ≈
FFT Mod.Min ↓ ↓ ↓ ≈ ↓
HT Mod.Max ↓ ↓ ↓ ≈ ≈
HT Mod.min ↑ ↑ ↑ ↓ ≈
For a better characterization of the EMG tracks it has been decided to take into account also the Continuous Wavelet Transform (CWT) of the signals detected before and after R-stim. What makes CWT interesting is that it is able to point out details of the signals that are difficult to evaluate by using the classical time-frequency analysis,
TMS to Evaluate and Classify Mental Diseases Using Neural Networks
313
such as rapid variability or slow evolution. A comparison between the CWT of the signals detected before and after R-stim is then able to point out regularities or difference that cannot be detected by using other methods. In this way it is possible to diagnose difficult dementia cases that cannot be detected by Fourier or Hamilton transform. To this aim, after having computed the CWT for the signals related to the various ISI, a cross-correlation has been carried out between the set of values related to the phase before and the one related to the phase after R-Stim. We have found that the maximum of such cross-correlation for a patient affected by dementia is at the highest ISI being very low (near to zero) the cross-correlation at the lowest ISI. Healthy people have shown a high cross-correlation for any ISI. This confirms that stimulations at highest ISI are not useful for the patient’s treatment and that CRW is an effective variable to be take into account to diagnose mental diseases.
4 Classifying Mental Diseases: Neural Networks Versus Fuzzy Logic The choice of the variables to be given as inputs to a neural network is particularly important since considering variables that are not relevant for the study may cause high imprecision in classifying the cases. Thus the preprocessing phase discussed in the previous section is necessary for a successful classification of the patients. The variables that have a low cross-correlation between the values taken before and after the repetitive stimulation for the patients affected by dementia while they show high cross-correlation for the healthy people are the following: Latency, Amplitude, FFT Max Module, FFT Min module, Min module of the Hilbert transform, Max crosscorrelation of the Wavelet transform. The difference between the values of these variables with respect to the same ISI before and after the repetitive stimulation are given as inputs to the multilayer neural network consisting of six inputs and two outputs neurons. A hidden layer consisting of 20 neurons proved sufficient to make the neural network generalize from the learned examples. The output neurons classify the cases into four classes as follows: None, Alzheimer, SIVD and Mixed Dementia. Table 2 shows the success rate obtained in diagnosing the pathology by using the neural networks in comparison with the findings the authors obtained in another study using fuzzy logic, where the same input variables, except the one dealing with the Wavelet transform, were taken into account. Table 2. Success rate in the diagnosis of the pathology
Pathology
Using neural networks
Using fuzzy logic
None Alzheimer SIVD Mixed dementia
90 % 92,7% 95% 87%
90% 92.5% 92.5% 90%
The comparison shows that the neural network approach is more appropriate to classify crisp cases, i.e., healthy people and people affected by Alzheimer or SIVD,
314
A. Faro et al.
whereas the fuzzy logic based approach is more appropriate in discovering mixed dementia. These results refer to twenty cases. Validation of the success rate will be done with the larger number of cases that currently are being collected. Two other important user requisites have to be taken into account, i.e., the simplicity in obtaining the right information and the possibility of gaining useful quantitative evaluation for both the diagnosis and the therapy. The first requisite can be easily satisfied by a neural network with a crisp codification of the classes to recognize, whereas the fuzzy approach can be suitably used to express quantitative/qualitative rules that are enough to support medical decisions as for example “the patient is affected by an early demency that may treated by frequent sessions of TMS”. For this reason it may be convenient to integrate neural network and fuzzy logic in a unique methodology as follows: first the patient’s condition is evaluated with the help of a neural network, then the fuzzy approach is applied to have hints about the diagnosis and therapy.
5 Concluding Remarks The paper has shown that the responses to the transcranial magnetic stimulations (TMS) can be used effectively to classify mental diseases by using both neural networks and fuzzy classifiers. The neural networks classification shows better performance to classify crisp cases (i.e. no mental disease, SIVD, Alzheimer), whereas the fuzzy logic is more suitable when one is interested in having some measurements about the mental disease, e.g., to identify the disease’s stages or if the disease is due to both SIVD and Alzheimer. An integration of the two classification methods could be very effective. A way to increase the success rate is to increase the number of the inputs by taking into account variables related to other practices, e.g., EEG signals, genetic information and so on. Adding to the fuzzy classifier information related to the wavelet transform could increase its success rate. This is for further study.
References 1. Handbook of TMS. Edited by Pascual-Leone, A. et alii Arnold, London (2001) 2. Iramina, K., Maeno, T., Kowatari, Y., Ueno, S.: Effects of Transcranial Magnetic Stimulation on EEG Activity, IEEE Transactions on Magnetics, Vol. 38, N. 5 (2002) 3. Faro, A., Giordano, D., Pennisi, M., Scarciofalo, G., Spampinato, C., Tramontana F.: A Fuzzy System to Analyze SIVD Diseases using TMS, Int. J. Signal Processing N.2 (2005) 4. Pepin JL., Bogacz D., de Pasqua V., Delwaide PJ.: Motor Cortex Inhibition is not Impaired in Patients with Alzheimer's Disease: Evidence from Paired Transcranial Magnetic Stimulation, J. Neurological Sciences.Vol.170 N.2 (1999) 5. Ruohonen, J., Ollikainen, M., Nikouline, V., Virtanen, J., Ilmoniemi, R.J.: Coil Design for Real and Sham Transcranial Magnetic Stimulation, IEEE Transactions on Biomedical Engineering , Vol.47 N.2 (2000)
Towards Information Visualization and Clustering Techniques for MRI Data Sets Umberto Castellani1 , Carlo Combi1 , Pasquina Marzola2, Vittorio Murino1 , Andrea Sbarbati2 , and Marco Zampieri1 1
2
Dipartimento di Informatica - Universit` a di Verona Dipartimento di Scienze Morfologiche Biomediche - Universit` a di Verona Abstract. The paper deals with the integrated use of Information Visualization techniques and clustering algorithms to analyze Magnetic Resonance Imaging (MRI) data sets. The paper also describes the criteria we followed in designing and implementing the prototype, according to the above approach. Finally, some preliminary results are given for the considered medical application.
1
Introduction
The research interest on visualization and analysis of multidimensional data has grown rapidly during the last decade [2]. Focusing on medical applications, several activities require the visual investigation and analysis of images, video, and graphs [1]. In this paper we aim at integrating the use of Information Visualization (IV) techniques and data mining algorithms for the analysis of Magnetic Resonance Imaging (MRI) data sets. Preclinical and clinical evaluation of the efficacy of antiangiogenic compounds poses new problems to researchers because the traditional criterion for assessing the response of a tumor to a treatment, based on the measurement of tumor size reduction, is no longer valid [3]. Dynamic Contrast Enhanced MRI (DCE-MRI) techniques play a relevant role in this field [3]. DCE-MRI with macromolecular contrast agents has been used to measure characteristics of tumor microvessels such as transendothelial permeability (kPS) and fractional plasma volume (f PV) that are accepted surrogate markers of tumor angiogenesis [3]. In order to improve the analysis of such kind of data, an efficient and effective visual application has been developed with respect to the main IV criteria and methodologies. Moreover, a cluster analysis on kPS and f PV parameters space has been introduced by adopting a bayesian approach based on the combined application of K-Means algorithm and Bayesian Information Criterion (BIC)[4]. The main contribution consists of merging the cluster analysis and linked brushing visualization method by switching between the MRI volumetric space and the parameter space.
2
Information Visualization
Graphical tools for visualizing data or knowledge are very useful for human perception [2]. In order to design and develop an effective IV application, the S. Miksch et al. (Eds.): AIME 2005, LNAI 3581, pp. 315–319, 2005. c Springer-Verlag Berlin Heidelberg 2005
316
U. Castellani et al.
visualization process needs to be carefully defined: (i) raw data are transformed into intermediated representations (Data transformation); (ii) transformed data need to be represented by effective visual structures (this phase is called visual mapping). Visual structures are characterized by the spatial substrate, the graphical objects or marks, and the connection and enclosure [2]. Finally, (iii) the last step of the visualization process consists of View transformations: the visual structures are interactively modified in order to improve the perception of the analyzed data. An effective example of view transformation is linked brushing, i.e., the cursor passing over one location creates visual effects to other markers.
3
The DCE-MRI Data Set
Aim of DCE-MRI experiments is to determine non invasively the fractional plasma volume (fPV) of tumor tissue and the endothelial permeability (kPS) of tumor vasculature, accepted surrogate markers of angiogenesis. HT-20 human colon carcinoma fragments were implanted subcutaneously in the flank of 10 nude mice [3]. The dynamic evolution of the Signal Intensity in MR images is analyzed using a two compartments tissue model in which the contrast agent can freely diffuse between plasma and interstitial space. kPS and fPV values are obtained pixel by pixel by fitting the theoretical expression to experimental data. In this work the experiments were performed in order to assess the antiangiogenic effect of a specific drug. Data analysis was adapted from [3] for the special case of a macromolecular contrast agent.
4
Cluster Analysis
In order to reduce manual interventions in identifying different parts of the considered parametric maps, the classic K-Means [4] algorithm has been combined with the Bayesian Information Criterion, aiming at automatically estimating the number of classes K and the derived clusters. The main idea consists of computing different clusterizations by running K-Means by changing the value of K, and use the BIC criteria to evaluate each of them. The implemented procedure is a simplified version of the X-Means algorithm proposed by Pelleg and Moore in [4]: assume we are given the data set D and a family of alternative clusterizations Mj or models. Different models correspond to solutions with different K values of K-Means. For example M2 means the clusterization obtained from the 2-Means (i.e., K-Means with K = 2). Furthermore, K-Means can be modelled by spherical Gaussians [4]. A posteriori probability Pr (Mj |D) is used to score the models. In order to approximate the posteriors, up to normalization, the Kass and Wasserman [4] formula is defined as: pj log R (1) BIC(Mj ) = ˆlj (D) − 2 where lˆj (D) is the log-likelihood of data according to the j-th model and taken at the maximum likelihood point, pj is the number of independent parameters
Towards IV and Clustering Techniques for MRI Data Sets
317
in Mj and R = |D|. The maximum likelihood estimate (MLE) is applied for the variance, under the identical spherical Gaussian assumption [4]. In order to estimate automatically the K value, the K-Means algorithm is performed several times by incrementing the value K. At each computation, the BIC evaluation of the obtained clusterization is calculated. As usual, a gaussian behavior is assumed for the BIC evaluation, so that the maximum is obtained when the BIC value ceases to increase.
5
The Proposed Application: Visual MRI
The proposed application, named Visual MRI, allows the visualization of different representations of the MRI data set with respect to the selected spatial substrate. Four spatial substrates have been defined: the physical space, the kPS and f PV spaces, and the joined parameter space. – physical space: The physical space is a volumetric space. As usual for MRI data set, data are visualized through a 2D image (the MRI image) that is obtained by selecting a slice of the whole volume. – kPS and f PV spaces: The considered parameters kPS and f PV, previously described, are derived from the MRI slices applying the pharmacokinetics model introduced in [3]. Two parameter maps are derived that represent the kPS and f PV spaces, respectively. – joined parameter space: In order to obtain a joined representation of the kPS and f PV maps, a further parameter space is defined. Two quantitative axes have been introduced (kPS for horizontal and f PV for vertical axis). Then, the physical image has been scanned and for each pixel xi the two values kP Si and f P Vi are observed. Thus, after the projection of all the 2D points pi (kP Si , f P Vi ) the joined parameter space is obtained. According to the proposed IV approach the visual interface of the Visual MRI application has been thoroughly designed. The main available options are: – parameter map extraction. The kPS and f PV parameter maps can be extracted by a suitable button. – joined parameter space projection and clustering. The joined parameter space can be displayed and clusters are automatically detected. – brushing. The user can switch between the physical and the parameter space. A sort of bidirectional linked brushing has been proposed. Indeed, by selecting a region into the physical space, the parameter distribution on the parameter space is highlighted (straight linked brushing). On the other hand, by clustering the parameter space, the corresponding regions on the physical space appear (reverse linked brushing). Figure 1 shows an example of the use of the Visual MRI application. The tumoral area is manually selected. Figure 1 (left) shows the selected area. Then, the pharmacokinetics model is applied and the parameter maps are extracted (with respect to selected area). Figure 1 (right) shows the two parameter maps.
318
U. Castellani et al.
Fig. 1. Selected region by ROI operator (left) and kPS (middle), f PV (right) maps
Fig. 2. Parameter space before and after cluster detection
Fig. 3. Tumoral regions detected by the reverse linked brushing
Therefore, the selected region is mapped into the joined parameter space and the more significative clusters are collected by carrying out the automatic cluster detector operator. Figure 2 shows the parameter space before and after cluster detection, respectively. This phase realizes a straight linked brushing (from physical space to joined parameter space). Therefore, selected clusters are reprojected into the physical space by evidencing some slice regions of the cancer area. This phase realizes a reverse linked brushing (from joined parameters space to physical space). Figure 3 highlights the tumoral regions inferred by the proposed approach: in the present experimental model, cluster analysis provides a subdivision of the tumor tissue in three clusters. The first cluster (left part of Figure 3) contains in general a limited number of pixels characterized by the highest values of f PV. The second cluster (middle part of Figure 3) contains an intermediate number of pixels located in the external part of the tumor. The third
Towards IV and Clustering Techniques for MRI Data Sets
319
cluster (right part of Figure 3) contains most pixels of the image and covers the whole interior of the tumor; pixels belonging to the last cluster are characterized by low values of f PV. It is worth noting that from the medical point of view this is an interesting result since the regions detected by reverse linked brushing correspond to typical subdivision of the tumoral area observed in this kind of tumors by histology [3].
6
Conclusions
In this paper a new application for MRI data analysis is proposed aiming at improving the support of medical researchers in the context of cancer therapy: preliminary results have shown the effectiveness of the Visual MRI application for the comprehension of dependencies between the analyzed parameters and the known tumoral regions. The information visualization criteria have been carefully considered and implemented. Furthermore, data mining techniques have been introduced by defining a simplified version of the X-Mean algorithm for cluster analysis.
References 1. J. Bemmel and M. A. Musen. Handbook of medical informatic. Springer, 2002. 2. S.K. Card, J.D. Mackinglay, and B. Shneiderman. Readings in Information Visualization - Using Vision to Think. Morgan Kaufmann, San Francisco, 1999. 3. Marzola P., Degrassi A., Calderan L., Farace P., Crescimanno C., Nicolato E., Giusti A., Pesenti E., Terron A., Sbarbati A., Abrams T., Murray L., and Osculati F. In vivo assessment of antiangiogenic activity of su6668 in an experimental colon carcinoma model. Clin. Cancer Res., 2(10):739–50., 2004. 4. Dan Pelleg and Andrew Moore. X-means: Extending K-means with efficient estimation of the number of clusters. In Proc. 17th International Conf. on Machine Learning, pages 727–734. Morgan Kaufmann, San Francisco, CA, 2000.
Electrocardiographic Imaging: Towards Automated Interpretation of Activation Maps Liliana Ironi and Stefania Tentoni IMATI - CNR, via Ferrata 1, 27100 Pavia, Italy
Abstract. In present clinical practice, information about the heart electrical activity is routinely gathered through ECG’s, which record electrical potential from just nine sites on the body surface. However, thanks to the latest technological advances, body surface potential maps are becoming available, as well as epicardial maps obtained noninvasively from body surface data through mathematical model-based reconstruction methods. Such maps can capture a number of electrical conduction pathologies that can be missed by ECG’s analysis. But, their interpretation requires skills that are possessed by very few experts. The Spatial Aggregation (SA) approach can play a crucial role in the identification of patterns and salient features in the map, and in the long-term goal of delivering an automated map interpretation tool to be used in a clinical context. In this paper, the focus is on epicardial activation isochrone maps. The salient features that characterize the heart electrical activity, and visually correspond to specific geometric patterns, are defined, extracted from the epicardial electrical data, and finally made available in an interpretable form within a SA-based framework. Keywords: imaging, qualitative reasoning, spatial reasoning, electrocardiology.
1
Introduction
During the last decade, noninvasive functional imaging techniques, such as computerized tomography and magnetic resonance, have increasingly replaced pure anatomical imaging for medical diagnosis as they are capable to provide both very detailed anatomical images and spatio-temporal measures of physiological parameters that characterize the activity of different organ areas. In Electrocardiology, unfortunately, similar directly applicable techniques are not yet available. Nevertheless, the heart electrical function may be noninvasively evaluated by reconstructing spatio-temporal information of the epicardial activity from body surface mapping. The research effort currently devoted to the development of novel methods for electrocardiographic imaging [1] is strongly motivated by the intrinsic diagnostic limitations of traditional electrocardiograms (ECG’s), for which, however, an interpretative rationale is well-established. Infact, ECG’s provide only a low resolution projection on the chest surface of the heart electrical activity: on the one hand, due to the distance of the electrodes from the cardiac bioelectric sources, the informative content of the electrical signals recorded on the chest is necessarily weak. On the other hand, when probing is limited to a small number of sites on the chest, as in clinical ECG’s protocols, a few electrical conduction pathologies (arrythmias, infarcts, Wolf-Parkinson-White syndrome just to cite some) may remain undetected. S. Miksch et al. (Eds.): AIME 2005, LNAI 3581, pp. 323–332, 2005. c Springer-Verlag Berlin Heidelberg 2005
324
L. Ironi and S. Tentoni
A higher resolution projection of the cardiac electrical activity is obtained by body surface mapping (BSM): electrical potential is simultaneously recorded from a few hundreds of sites on the entire chest surface over a complete heart beat [2]. But, the most of information useful to localize anomalous conduction sites is got when mappings of the significant physical variable values are given, and visualized, as close as possible to the heart where such phenomena originate, and where any necessary surgical intervention has to be extremely focussed. Thanks to the latest advances in scientific computing, given as inputs (i) body surface potentials and (ii) the geometric relationship between the chest and the heart, epicardial electrical data may be noninvasively reconstructed by using mathematical models and numerical inverse procedures [3, 4]. Moreover, mathematical models are crucial to highlight, through numerical simulation, the links between the observable patterns (effects) and the underlying bioelectric phenomena (causes). Although still progressively being improved, the interpretative rationale for electrocardiac maps defined by expert electrocardiophysiologists with the helpful support of applied mathematicians is significant enough to be actually used. However, its introduction into the clinical practice is not yet at hand because the ability to both extract salient visual features from electrocardiographic maps and relate them to the underlying complex physiological phenomena still belongs to very few experts [2]. Thus, the need to bridge the gap between the established research outcomes and clinical practice. This paper describes a piece of work that fits into a long-term research project aimed at delivering an automated electrocardiac map interpretation tool to be used in a clinical context. To this end, Qualitative Reasoning (QR) methodologies, and the Spatial Aggregation (SA) approach can play a crucial role in the identification of spatiotemporal patterns and salient features in the map. Let us remind that the application of QR methods is not new in Electrocardiology as demonstrated by a number of automated interpretation tools of traditional ECG’s [5, 6, 7, 8]. SA is a computational framework specifically designed for reasoning about spatially distributed data [9, 10, 11], and provides a suitable ground to capture spatiotemporal adjacencies at multiple scales. Its hierarchical strategy in aggregating spatial objects to abstract a field at different levels emulates the way experts usually perform imagistic reasoning and reason about fields, that is (1) searching for regularities, and (2) abstracting structural information about the underlying physical processes. In outline, SA transforms a numeric input field into a multi-layered symbolic description of the structure and behavior of the physical variables associated with it. This results from successive transformations of lower-level objects into more and more abstract ones by exploiting distinctive qualitative equivalence properties shared by neighbor objects. This paper is focussed on activation time maps at epicardial level, as they are a synthetic representation of the spatio-temporal aspects of the propagation of the electrical excitation. It describes how the most significant features that characterize such phenomena and reveal either their normality or abnormality, such as wavefront breakthrough and extinction regions, minimum and maximum propagation velocity, are defined within the SA conceptual framework, abstracted from the epicardial electrical data, and made readily available for reasoning tasks.
Electrocardiographic Imaging: Towards Automated Interpretation of Activation Maps
2
325
Describing Ventricular Excitation Through Features Abstracted from the Activation Map
Experimental and model-based studies recently carried out show that the spread of the excitation within the heart is not uniform: both anisotropic conductivity properties and the fiber structure of the tissue affect the wavefront propagation. To investigate this spatio-temporal process electrocardiologists use a well-established parameter that is also important for the diagnosis of cardiac rhythm, namely the activation time. Definition 1. Let x be a point of the myocardium Ω ⊂ R3 . The activation time τ (x) is the instant at which the excitation front reaches x, causing it to depolarize. Definition 2. An activation map is a contour map of the activation time built on a reference surface, where each contour line aggregates all and none but the points that depolarize at the same instant. Since wavefront propagation is a 3D process, which is quite difficult to be visualized within the volume Ω, activation maps are built on reference surfaces, usually the external/internal boundaries of Ω (epicardial/endocardial surfaces, respectively), but also transverse and longitudinal intramural sections. The activation time can be either experimentally measured by advanced optical techniques, or computed from the epicardial potential data when these are available over a whole beat. Activation maps contain a lot of information about the wavefront structure and propagation: subsequent isochrones represent the wavefront kinematics as a sequence of snapshots. The isochrone distributions are complex, with several distinct areas showing different propagation patterns that the expert analysis may reveal: the locations on the considered surface where wavefront breaks through and vanishes, the local fiber direction, the wavefront propagation pathways, and the regions with high, low or null conductivity. Thus, such kinds of maps have a clear and strong diagnostic value: by comparison with a nominal activation map, anomalous conduction patterns and regions with altered conductivity can be easily detected and classified. Herein, we consider simulated data obtained by one of the sophisticated mathematical models of the ventricular excitation that take into account both fiber architecture
1 0 −1 −2 −3 −4
2 1
2 1
0 0
−1 −2
−1 −2
Fig. 1. Model ventricle 3D geometry: the mesh is shown on the most external and internal ventricle layers
326
L. Ironi and S. Tentoni (a) A−EPI
P−EPI
1
1
0
0
−1
−1
−2
−2
−3
−3
−4 −4
−4 −2
0
2
4
−4
−2
0
2
4
(b) C−EPI
B
1 0 −1
A
−2 −3 −4 −8
−6
−4
−2
0
2
4
6
8
Fig. 2. (a) Anterior and posterior orthogonal projections of the activation map (solid thick lines) drawn on the external boundary of the 3D mesh (thin lines). (b) Cylindrical projection of the same isomap; maximum velocity propagation pathways (thick dashed lines) from wavefront breakthrough (A) to extinction (B) locations are sketched
and conduction anisotropy available in the literature [12, 13, 14, 15]. Figure 1 illustrates a 3D simplified ventricular geometry. Its discretization was carried out by 90 horizontal sections, 61 angular sectors on each section, and 6 radial subdivisions of each sector. A numerical simulation, based on an anisotropic bidomain model of the ventricle tissue, was carried out on this mesh [15], and for each node xi the activation time τi was computed to provide us with input data. 2.1
The Feature Extraction Problem
Contour lines are the first result from processing the input activation time data. Figure 2 highlights what are the pieces of information that we want to extract from contour maps. Figure 2(a) shows the anterior and posterior orthogonal projections of the activation map on the external ventricle boundary. In order to have a unique global view with minimal spatial distortion, we consider a cylindrical projection of this map (Fig.2 (b)). Reasoning on panel (b), the expert would: i) identify point A where the excitation starts from, ii) identify regions where contours are more scattered/dense as regions where propagation is faster/slower, iii) sketch the maximum velocity propagation pathway towards the site B where excitation vanishes. Therefore, the feature extraction problem
Electrocardiographic Imaging: Towards Automated Interpretation of Activation Maps
327
is equivalent to the following one: given the activation time field, build an activation map, and search for those geometric patterns or spatial objects that characterize salient aspects of the wavefront propagation process. More precisely, – given in INPUT: • the discretized geometry data, i.e. the set of the surface mesh nodes Ωh = {xi }i=1..N , • the activation data {τi }i=1..N , where τi = τ (xi ), • a time step ∆τ to uniformly scan the time range [0, T ]; – provide as OUTPUT: • the sequence of wavefront snapshots: Ik = {x | τ (x) = k∆τ }k , k = 1, .., nτ • the wavefront breakthrough region: Rb = {x | τ (x) = min τ } • the wavefront extinction region: Re = {x | τ (x) = max τ } • the propagation velocity patterns. 2.2
The Spatial Aggregation Framework
The problem above, i.e. the extraction of both wavefront structure and propagation from raw epicardial data, is solved through a sequence of intermediate representations that gradually identify the geometric patterns, the spatial relations between them, and the global dynamical behavior. The adopted ontological framework is that one underlying the Spatial Aggregation approach: geometric patterns, or spatial objects, are built up from a given input field by applying an iterative procedure that transforms lowerlevel abstract objects, called spatial aggregates, into ones at a higher abstraction level. Neighborhood relations play a crucial role in extracting the necessary structural and behavioral information for performing a specific task: on the one hand, intra-relations bind a set of contiguous spatial aggregates into a single object; on the other hand, interrelations highlight the connectivity and interactions between the spatial objects aggregated at the previous level. The former kind of relation is called strong adjacency, and the latter one weak adjacency. In outline, the overall process iterates three main steps: aggregate, classify, and redescribe. The aggregate procedure makes the spatial contiguity between field primitive objects explicit by encoding it in a neighborhood graph (n-graph). Then, the application of a strong adjacency relation on contiguous elements represented by the n-graph defines equivalence classes characterized by a distinctive property. The equivalence classes are finally transformed into new higher-level spatial objects through the redescribe operator. The three steps are repeated until the desired structural and behavioral information is obtained. The hierarchical structure of the whole set of built spatial objects allows us to state a bi-directional mapping between higher and lower-level aggregates, and, then, it facilitates the identification of the piece of information relevant for a specific task.
3
Extracting Wavefront Structure from Epicardial Data
The isochrone shapes and distributions built on ventricular external and internal surfaces, and on intramural sections, define the wavefront structure. Then, its reconstruction turns into 2D contouring problems.
328
L. Ironi and S. Tentoni
Fig. 3. Isopoints (dots) and their ngraph (thin solid lines)
90
100
80
80
90
100
110
70
80
70
20 10
40
30
60 50
120
70
80
0 13
90
Fig. 4. Isochrones abstracted as strong adjacencies between isopoints
3.1
From Epicardial Data to Activation Isochrone Maps
Proper definitions and algorithms to soundly tackle the contouring task for generic geometrical domains within the SA framework have been given and discussed in [16]. In outline, SA contouring is performed in four main steps: 1. Pre-processing of the activation data to generate the set of isopoints P for the required levels. The set P is built by comparison of the values at the mesh nodes with the required levels, and by linear interpolation of mesh nodal values. 2. Definition of the spatial contiguities between isopoints, i.e. construction of the ngraph NP (Fig. 3). The resulting n-graph must ensure that the spatial contiguity of points in NP also respects their nearness in terms of the associated functional values: a Delaunay triangulation is accordingly adjusted to guarantee a proper representation. 3. Classification of the contiguous isopoints represented in NP . P is partitioned into equivalence classes that are built by applying a strong adjacency relation based on topological adjacency properties rather than on a metric distance. 4. Construction of the isocurves (Fig. 4). The equivalent classes defined at the previous step are redescribed as polylines, whose vertices are isopoints and whose edges are instances of the strong adjacency relation holding between them. Let us observe that a single wavefront snapshot Ik may consist of more connected components Ikik (see, for example, the isocurve labelled 80 in Fig. 4). Thus, Ik = ∪ik =1,..,nk Ikik , where nk is the number of connected components.
Electrocardiographic Imaging: Towards Automated Interpretation of Activation Maps (a)
329
(b)
70 10
20
30
40
50
60
80 80
90
80
90
70 100
120
130
100
90
110 80
70
Fig. 5. (a) Spatial, and (b) symbolic representation of NI
3.2
Spatial Adjacency Relations Between Isochrones
To identify the salient features that characterize wavefront propagation, steps analogous to 2-4 given above are iterated until the relevant pieces of information are made available at the desired high-level as aggregate objects. Then, after isochrones Ik have been abstracted by exploiting both contiguity and strong adjacency between isopoints, the next step deals with the construction of a neighborhood graph, NI , that encodes curves contiguity. To this end, a straightforward strategy consists in exploiting weak adjacency relations between isochrone constituent isopoints.
Definition 3. Given the set of isopoints P and their n-graph NP , we say that x, y ∈ P are weakly adjacent if they are contiguous within NP but not strongly adjacent. Definition 4 (n-graph of isochrones). Two isochrones I and I , with respective time labels τ , τ , are contiguous if: 1) |τ − τ | = ∆τ , and 2) there exists at least one couple of isopoints x, y weakly adjacent, where x ∈ I and y ∈ I .
Spatial contiguity between isocurves is then represented by NI that encodes any one of these weak connections. Figure 5 depicts a spatial (panel A) and a symbolic (panel B) representation of NI . In the latter representation, graph nodes represent isochrone connected components, and edges state neighborhood relations between them. Let us emphasize that the graph NI encodes both a spatial contiguity relation and a temporal order between isochrones.
4
Extracting Wavefront Propagation from Isochrone Maps
The ordered time sequence of the wavefront snapshots, their velocity properties, as well as the breakthrough and extinction regions are the features that define the wavefront propagation as it is observed on the surface considered. 4.1
Breakthough and Extinction Regions
The breakthrough and extinction regions where excitation arises and, respectively, vanishes are easily characterized as the subsets Rb and Re of Ωh which are earliest and last activated.
330
L. Ironi and S. Tentoni
Let us define a quantity space of the time variable: Qτ = {τmin τmed τmax }, and a mapping q : [0, T ] → Qτ . Let us consider the lowest-level spatial objects defined by the original mesh nodes Ωh = {xi }. Then, the mesh itself defines the spatial contiguity between points, i.e. the related n-graph, and new spatial objects are abstracted by applying the following strong adjacency relation:
Definition 5. xi , xj ∈ Ωh are similarly-activated if they are contiguous within the mesh AND qτ (xi ) = qτ (xj ). As a consequence, Rb , Re are the redescribed objects that correspond to the earliestactivated (τmin ) and last-activated (τmax ) classes. In Figure 6 such regions are labelled A and B, respectively. In this case, B is one single point, located on the top edge of the surface boundary beyond the last contour, while A consists of a whole mesh element whose vertices coincide with the sites where the electrical stimulus was applied. 4.2
Propagation Velocity Bands: A Qualitative Approach
As the spatio-temporal progression of the isochrones, encoded by NI , is available, we can identify another important feature of the excitation process, that is the wavefront velocity patterns. This is achieved in two main steps (1) by identifying isochrone segments characterized by the same qualitative velocity value, and (2) by classifying them in accordance with their propagation direction. Let us remind that the wavefront motion is a 3D process, therefore the velocity here considered is the wavefront apparent surface velocity along the outward normal to the front. If x ∈ Ik , its velocity v(x) is directed as the outward normal to Ik , and has a magnitude v(x) = 1/|∇τ (x)|, where ∇τ (x) is the gradient of τ (x) and is numerically approximated. Algorithm 1. The wavefront fragment extraction algorithm. Let us define a quantity space Qv = {νmin νmed νmax } of qualitative values that velocity magnitude may assume. For each given isochrone I: 1. Denote by v(I) the velocity magnitude range. 2. Define a mapping µI : v(I) → Qv . B
A
Fig. 6. A, B respectively denote breakthrough and extinction regions. Wavefront fragments qualitatively characterized by νmax are represented by the set of velocity vectors plotted in their constituent isopoints. Maximum propagation velocity bands are filled in gray
Electrocardiographic Imaging: Towards Automated Interpretation of Activation Maps
331
3. Consider the isochrone constituent points, whose spatial contiguity is encoded by their strong-adjacency graph NP |I , and define the following relation between them: Definition. ∀x, y ∈ I contiguous within NP |I , we say that they have the same velocity if µI v(x) = µI v(y). 4. Apply the above relation, build velocity equivalence classes, and call each new object a front fragment. 5. Repeat points 1-4 for all isochrones. 6. Denote by W the set of all the generated front fragments. A front fragment is spatially represented by the set of velocity vectors associated with its constituent points. Spatial contiguity of front fragments is inherited from NI and encoded in NW by making each fragment contiguous to all the fragments of contiguous isochrones. Since the constituent points of a front fragment w have the same qualitative velocity value, we can refer to such value as the front fragment velocity magnitude νw ∈ Qv . Algorithm 2. The propagation velocity pattern extraction algorithm. 1. Define the propagation direction of a front fragment w ∈ W as the direction of the vector u(w) := x∈w v(x). 2. Define a similarly advancing relation in W × W to highlight velocity homogeneities: Definition. ∀w, w ∈ W contiguous within NW , we say that they are related if they are advancing in a similar direction and with same qualitative velocity magnitude: u(w) · u(w ) > 0 AND νw = νw .
3. Build equivalence classes and abstract them as new objects, called velocity bands. Figure 6 shows the front fragments qualitatively characterized by νmax , and the resulting νmax -bands as gray-filled regions.
5
Conclusions
This paper focuses on the extraction, from activation time data given in surface mesh nodes, of spatial objects at different abstraction level that correspond to salient features of wavefront structure and propagation, namely activation map, front fragments, and propagation velocity bands. The work is done within the SA conceptual framework, and exploits both numerical and qualitative information to define a neat hierarchical network of spatial relations and functional similarities between objects. Such a network provides a robust and efficient way to qualitatively characterize spatio-temporal phenomena. In our specific case, it allows us to identify the locations where the wavefront breaks through or vanishes, its propagation patterns, regions with electrical conductivity properties qualitatively different. Such pieces of information are essential in a clinical context to diagnose ventricular arrhythmias as they can localize possible ectopic sites and highlight abnormal propagation of the excitation wavefront, such as slow conduction, conduction block, and reentry. With the aim of building a diagnostic tool, more
332
L. Ironi and S. Tentoni
work needs also to be done to identify such phenomena, and to extract additional temporal information from sequences of isopotential maps built from epicardial potential data. Further work will design and implement methods to automatically interpret activation maps, and to compare the features extracted from different raw epicardial data sets, either simulated or measured.
References 1. Ramanathan, C., Ghanem, R., Jia, P., Ryu, K., Rudy, Y.: Noninvasive electrocardiographic imaging for cardiac electrophysiology and arrythmia. Nature Medicine (2004) 1–7 2. Taccardi, B., Punske, B., Lux, R., MacLeod, R., Ershler, P., Dustman, T., Vyhmeister, Y.: Useful lessons from body surface mapping. Journal of Cardiovascular Electrophysiology 9 (1998) 773–786 3. Colli Franzone, P., Guerri, L., Tentoni, S., Viganotti, C., Baruffi, S., Spaggiari, S., Taccardi, B.: A mathematical procedure for solving the inverse potential problem of electrocardiography. Analysis of the time-space accuracy from in vitro experimental data. Math.Biosci. 77 (1985) 353–396 4. Oster, H., Taccardi, B., Lux, R., Ershler, P., Rudy, Y.: Noninvasive electrocardiographic imaging: reconstruction of epicardial potentials, electrograms, and isochrones and localization of single and multiple electrocardiac events. Circulation 96 (1997) 1012–1024 5. Bratko, I., Mozetic, I., Lavrac, N.: Kardio: A Study in Deep and Qualitative Knowledge for Expert Systems. MIT Press, Cambridge, MA (1989) 6. Weng, F., Quiniou, R., Carrault, G., Cordier, M.O.: Learning structural knowledge from the ECG. In: ISMDA-2001. Volume 2199., Berlin, Springer (2001) 288–294 7. Kundu, M., Nasipuri, M., Basu, D.: A knowledge based approach to ECG interpretation using fuzzy logic. IEEE Trans. Systems, Man, and Cybernetics 28 (1998) 237–243 8. Watrous, R.: A patient-adaptive neural network ECG patient monitoring algorithm. Computers in Cardiology (1995) 229–232 9. Yip, K., Zhao, F.: Spatial aggregation: Theory and applications. Journal of Artificial Intelligence Research 5 (1996) 1–26 10. Bailey-Kellogg, C., Zhao, F., Yip, K.: Spatial aggregation: Language and applications. In: Proc. AAAI-96, Los Altos, Morgan Kaufmann (1996) 517–522 11. Huang, X., Zhao, F.: Relation-based aggregation: finding objects in large spatial datasets. Intelligent Data Analysis 4 (2000) 129–147 12. Henriquez, C.: Simulating the electrical behavior of cardiac tissue using the bidomain model. Crit. Rev. Biomed. Engr. 21 (1993) 1–77 13. Henriquez, C., Muzikant, A., Smoak, C.: Anisotropy, fiber curvature, and bath loading effects on activation in thin and thick cardiac tissue preparations: Simulations in a three-dimensional bidomain model. J. Cardiovasc. Electrophysiol. 7 (1996) 424–444 14. Roth, B.: How the anisotropy of the intracellular abd extracellular conductivities influences stimulation of cardiac muscle. J. Mathematical Biology 30 (1992) 633–646 15. Colli Franzone, P., Guerri, L., Pennacchio, M.: Spreading of excitation in 3-D models of the anisotropic cardiac tissue. II. Effect of geometry and fiber architecture of the ventricular wall. Mathematical Biosciences 147 (1998) 131–171 16. Ironi, L., Tentoni, S.: On the problem of adjacency relations in the spatial aggregation approach. In Salles, P., Bredeweg, B., eds.: Seventheen International Workshop on Qualitative Reasoning (QR2003). (2003) 111–118
Automatic Landmarking of Cephalograms by Cellular Neural Networks Daniela Giordano1, Rosalia Leonardi2, Francesco Maiorana1, Gabriele Cristaldi1, and Maria Luisa Distefano2 1
Dipartimento di Ingegneria Informatica e Telecomunicazioni, Università di Catania, Viale A. Doria 6, 95125 Catania, Italy {dgiordan, fmaioran, gcristal}@diit.unict.it 2 Istituto di II Clinica Odontoiatrica, Policlinico Città Universitaria, Via S. Sofia 78, 95123 Catania, Italy {rleonard, mdistefa}@unict.it
Abstract. Cephalometric analysis is a time consuming measurement process by which experienced orthodontist identify on lateral craniofacial X-rays landmarks that are needed for diagnosis and treatment planning and evaluation. High speed and accuracy in detection of craniofacial landmarks are widely demanded. A prototyped system, which is based on CNNs (Cellular Neural Networks) is proposed as an efficient technique for landmarks detection. The first stage of system evaluation assessed the image output of the CNN, to verify that it included and properly highlighted the sought landmark. The second stage evaluated performance of the developed algorithms for 8 landmarks. Compared with the other methods proposed in the literature, the findings are particularly remarkable with respect to the accuracy obtained. Another advantage of a CNN based system is that the method can either be implemented via software, or directly embedded in the hardware, for real-time performance.
1 Introduction Cephalograms are lateral skull (craniofacial) X-radiographs. Cephalometric radiography was introduced in research and clinical orthodontics because with skull radiographs taken under standardized conditions, the spatial orientation of different anatomical structures inside the skull can be studied more thoroughly by means of linear and angular measurements [1]. Information derived from the morphology of dental, skeletal and soft tissue of the cranium can be used by specialists for the orthodontic planning and treatment. The standard practice is to locate certain characteristic anatomical points, called landmarks, on a lateral cephalogram. Orthodontic treatment effects can be evaluated from the changes between the pre- and post-treatment measurement values. With serial cephalometric radiographs, growth and development of craniofacial skeleton can also be predicted and studied. The major sources of error in cephalometric analysis include radiographic film magnification, variation in tracing, measuring, recording and landmarks identification [2], [3]. The inconsistency of landmarks identification depends on the experience and training of the clinician. S. Miksch et al. (Eds.): AIME 2005, LNAI 3581, pp. 333 – 342, 2005. © Springer-Verlag Berlin Heidelberg 2005
334
D. Giordano et al.
Automated approaches to landmarks identification are widely demanded, both to speed-up this time consuming manual process that can take up to thirty minutes and also to improve on the measurements accuracy, since both inter-rater and intra-rater variability of error measurements can be quite high, especially for those landmarks more difficult to locate. Since the first approaches to automated landmarking, the major issues that have been pointed out are: 1) the landmarks detection rate, given the great variability in morphology and anatomical structures of the skull, and in the quality of the X-rays, and 2) the accuracy level that the proposed methods are able to achieve. Accuracy is a crucial issue since measurements are considered precise when errors are within 1 mm (the mean estimating error of expert landmarking identification has been reported to be 1.26 mm [4]); measurements with errors within 2 mm are considered acceptable, and are used as a reference to evaluate the recognition success rate, although the former level of precision is deemed desirable [5]. As a consequence of these requirements, the methods proposed so far have been criticized as not applicable for standard clinical practice (e.g., [6], [7]). The novelty of the approach proposed in this paper is the application of CNNs (Cellular Neural Networks) [8], which is a new paradigm for image processing, together with landmark specific algorithms that incorporate the knowledge for point identification. The method is suitable for hardware coding for fast, real time performance. The remainder of the paper is organized as follows. Section 2 reviews previous approaches to automated landmarks identification. Section 3 outlines the fundamentals of CNNs. Section 4 illustrates the developed prototype and the CNN templates that have been used. Section 5 reports the experimental evaluation of the method and discusses the results, by comparing the achieved accuracy with the other methods. In the conclusions, factors that can lead to further performance improvements are discussed.
2 Automated Cephalometric Landmarks Identification This section briefly reviews the approaches used for landmarks detection. Because not all the mentioned studies report assessment of their method, a detailed comparison with respect to the accuracy and number of landmarks detected, where possible, is delayed to section 4, after our approach has been illustrated. The first attempts to develop an automatic landmark detection system were based on the use of filters to minimize noise and enhance the image, on the Mero-Vassay operator for edge detection, and on line-following algorithms guided by prior knowledge, introduced in the system by means of simple ad hoc criteria, and on the resolution pyramid technique to reduce processing time [9], [10], [11]. A knowledgebased system, based on a blackboard architecture was proposed in [12]; as in the previously mentioned works, due to the rigidity of the knowledge-based rules used, the results were highly dependent on the quality of the input images. A different approach to landmark detection is based on the use of pattern-matching techniques. Mathematical morphology was used in [13] to locate directly the landmarks on the original images. To make detection more robust and accurate, the system searched for several structures, assured to maintain an almost exact geometrical relation to the landmark. However, these structures can only be determined for a small number of landmarks. Spatial spectroscopy was used in [14]. The main disadvantage of the
Automatic Landmarking of Cephalograms by Cellular Neural Networks
335
system is that it applies only a pattern detection step thus false detections can arise far from the expected landmark location, dramatically reducing the system accuracy. Recently, to solve this problem, neural networks together with genetic algorithms have been used to search for sub-images containing the landmarks [15]. A model of the grey-levels around each point was generated and matched to a new image to locate the points of interest. Accuracy of this method was assessed in [6]. Fuzzy neural networks have been used in [16] and in [17] (in conjunction with template matching), whereas [18] have used Pulse Coupled Neural Networks. Active shape models (ASMs) were used in [7]. ASMs model the spatial relationships between the important structures, and match a deformable template to the structure in an image. Accuracy is not sufficient for completely automated landmarking, but ASMs could be used as a time-saving tool to provide a first landmarks location estimate. Similarly, in [19] small deformable models are applied to areas statistically identified to predict landmark position. The method proposed in this paper is based on CNNs (Cellular Neural Networks), which are described in the next section.
3 Cellular Neural Networks Cellular Neural Networks (CNNs) afford a novel paradigm for image processing [8]. A CNN array is an analog dynamic processor in which the processing elements interact with a finite local neighborhood, and there are feedback, but not recurrent connections. The basic unit of a CNN is called cell (or neuron), and contains linear and non linear circuitry. Whereas neighboring cells can directly interact, the other ones are indirectly influenced by the propagation effect of the temporal-continuous dynamics typical of CNNs. Every neuron implements the same processing functions and this property makes CNNs suitable for many image processing algorithms, by working on a local window around each pixel. A cell in a matrix of M × N is indicated by C (i,j).The neighborhood radius of the interacting cells is defined as follows: Nr (i,j) = {C (k,h): max (| k-i |, | h-j |) ≤ r, 1≤ k≤ M; 1≤ h≤ N}
(1)
CNN dynamics is determined by the following equations, where x is the state, y the output, uk,l is the input and I the threshold (bias): .
xi , j = − x i , j +
∑ A(i, j, k , h) y
C ( k ,h )∈N r ( i , j )
yi , j =
kh
(t ) +
∑ B (i , j , k , h )u
kh
(t ) + I
(2)
C ( k ,h )∈N r ( i , j )
1 ( xi , j + 1 − xi , j − 1 ) 2
(3)
A is known as feedback template and B is known as control template. The parameters of the templates A and B correspond to the synaptic weights of a neural network structure. In equation (2) we must specify the initial condition xi,j (0) for all cells and the boundary condition for those cells that interact with other cells outside the CNN array. Initial state influences only transient solution, while the boundary conditions
336
D. Giordano et al.
influence the steady state solution. With CNNs several processing tasks can be accomplished and they can be programmed via templates; libraries of known templates for typical image processing operations are available [20]. A key advantage is that the inherently parallel architecture of the CNN can be implemented on chips, known as CNN-UM (CNN Universal Machine) chips, that allow computation times three orders of magnitude faster than classical DSP and where CNN based medical imaging can be directly implemented [21],[22]. Various CNN models have been proposed for medical image processing. In [21] the 128×128 CNN-UM chip has been applied to process Xrays, CT images and MRI of the brain, in the telemedicine scenario. Aizenberg et al. [22] discuss two classes of CNNs, in particular binary CNNs and Multivalued CNNs, and their use as filtering and image enhancement. Whereas classical solution for edge detection by means of operators such as Laplace, Prewitt, Sobel have the disadvantage of being imprecise in case of a small difference between brightness of the neighbor objects, binary CNNs can obtain precise edge detection, whereas multivalued linear filtering can be used for global frequency correction [22].
4 Cephalometric Landmarks Identification by CNNs The tool developed for automated landmarking is based on a software simulator of a CNN of 512 × 480 cells, which is thus able to map an image of 512 × 480 pixels, with 256 grey levels, scaled in the interval –1 1, where –1 corresponds to white and 1 to black. Using a fixed resolution allows to make positional considerations useful during the detection process. Compliant input images were obtained from digital or scanned X-rays by selecting an area proportional to the required dimension (e.g.. 1946x1824) and by using a reduction algorithm with a good colour balance (e.g., smart resize), and converting the image to grey scale. The system uses different types of CNNs on the scanned cephalogram: first to pre-process the image and eliminate the noise, then it proceeds with a sequence of steps in which identification of the landmarks coordinates is done, for each point, by appropriate CNN templates followed by the application of landmark-specific algorithms. The tool has been designed to detect 12 landmarks, which are essential to conduct a basic cephalometric analysis: Menton, B point, Pogonion, M point, Upper incisal, Lower incisal, Nasion, A Point, ANS, Orbitale, Gonion, Sella. The tool interface is shown in Fig. 1. The CNNs are simulated under the following conditions: 1) Initial state: every cell has a state variable equal to zero (xij=0); Contour condition: xij=0 (Dirichlet contour). All feedback templates (A) used in this work are symmetrical, which ensures that operations reach a steady state, although in our approach we exploit the transient solutions provided by the CNNs. For these reasons number of cycles and integration steps become important information for point identification. The tool may run in two modalities: one is entirely automated, in the other one each point can be identified by modifying the input parameters of the CNN (templates, number of cycles, bias and integration step). This latter modality is appropriate for tuning the system, because, for example, X-rays with high luminosity could need less cycles or a less sharpening template. If not mentioned otherwise, we consider bias=0; integration step=0.1; cycle 60. In the following we exemplify the methodology.
Automatic Landmarking of Cephalograms by Cellular Neural Networks
Loaded X- Ray. Obtained landmarks are highlighted in red.
Intermediate construction lines.
Landmarks coordinates
CNN output
Templates, bias, integration steps network radium selection
337
Identified landmarks Landmarks included in CNN outputs
Fig. 1. Tool interface
In order to find the landmarks we used two methods: the first method tries to find the front bone profile and in this line picks up the point of interest: Menton, B Point, Pogonion and M point. With this method, each landmark extraction depends on the previous one. For this reason accuracy becomes worse after each extraction: the best accuracy is obtained for Menton, the worse for Point M. The effect of template application is exemplified in figures 2 and 3, where the used templates and the CNN outputs are reported. For the next four landmarks (Upper and Lower Incisal, A Point and Nasion) the method is different since we try to find each one almost independently from the others. Since the incisors configuration variability is high, in this prototype we have worked with the configuration where the up incisor is protruding with respect to the inferior one. After noise removal two processes are possible: the first one uses the left control template, the second one uses the right control template reported in Fig. 3. In both cases we used bias = -0.1; integration step = 0.2 and 60 cycles. With these templates we try to light up both incisors. The
338
D. Giordano et al.
⎛0 ⎜ ⎜0 A = ⎜0 ⎜ ⎜0 ⎜0 ⎝
0 0 0 0⎞ ⎟ 0 0 0 0⎟ 0 1 0 0 ⎟; ⎟ 0 0 0 0⎟ 0 0 0 0 ⎟⎠
⎛1 ⎜ ⎜0 B =⎜ 0 ⎜ ⎜0 ⎜ −1 ⎝
1⎞ ⎟ 0⎟ 0 0 0 0 ⎟; ⎟ 0 0 0 0⎟ −1 −1 −1 −1⎟⎠ 1
1
1
0
0
0
Fig. 2. Templates and CNN output for menton (n.cycles=30)
⎛ 0 0 0⎞ ⎜ ⎟ A = ⎜ 0 1 0 ⎟; ⎜ 0 0 0⎟ ⎝ ⎠
⎛1 1 0 ⎞ ⎟ ⎜ B = ⎜ 1 0 − 1⎟; ⎜ 0 − 1 − 1⎟ ⎠ ⎝
Fig. 3. Templates and CNN output for chin curvature (n.cycles=80)
⎛0 ⎜ ⎜0 B = ⎜1 ⎜ ⎜1 ⎜0 ⎝
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0⎞ ⎟ 0⎟ − 1⎟ ⎟ − 1⎟ 0 ⎟⎠
⎛0 ⎜ ⎜0 B = ⎜2 ⎜ ⎜2 ⎜0 ⎝
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 ⎞ ⎟ 0 ⎟ − 2⎟ ⎟ − 2⎟ 0 ⎟⎠
Fig. 4. Control Templates and CNN output for incisors (n.cycles=60)
extraction algorithm used is more robust and more efficient if the up incisor edge is higher than the lower incisor. In order to find Nasion we used four templates (Fig.4). The first two are used for X-ray with a good contrast: the first one when the nasion is in white and the second one when the nasion is in black. The second pair of templates was used for contour sharpening. In all cases we used bias = -0.1; integration step = 0.2 and 60 cycles. To extract A point we started from knowledge of upper incisal, which is a landmark extracted quite accurately. In this case, template selection is guided by a local evaluation of the X-ray brightness. After simulation with the chosen template we used two algorithms to find our point: the decision about which point to choose in this case is left to the user and his or her knowledge of the morphology.
Automatic Landmarking of Cephalograms by Cellular Neural Networks ⎛1 ⎜ ⎜0 B = ⎜1 ⎜ ⎜0 ⎜ ⎝0
0 0 0 − 1⎞ ⎟ 1 0 −1 0 ⎟ 0 0 0 − 1⎟ ⎟ 0 0 0 0⎟ ⎟ 0 0 0 0⎠
0 ⎛ − 2,5 ⎜ − 2,5 ⎜ 0 B = ⎜ − 2,5 0 ⎜ 0 ⎜ 0 ⎜ 0 ⎝ 0
White template
0 0 − 2⎞ ⎟ 0 −2 0 ⎟ 0 0 − 2⎟ ⎟ 0 0 0 ⎟ ⎟ 0 0 0 0 ⎠ Sharp White template
⎛2 ⎜ ⎜0 B = ⎜2 ⎜ ⎜0 ⎜ ⎝0
0 2 0 0
339
0 0 2,5 ⎞ ⎟ 0 2,5 0 ⎟ 0 0 2,5 ⎟ ⎟ 0 0 0 ⎟ ⎟ 0 0 0 ⎠
Black template
⎛ − 3 0 0 0 3⎞ ⎟ ⎜ ⎜ 0 − 3 0 3 0⎟ B = ⎜ − 3 0 0 0 3⎟ ⎟ ⎜ 0 0 0 0⎟ ⎜ 0 ⎟ ⎜ 0 0 0 0⎠ ⎝ 0 Sharp Black template
Fig. 5. Control Templates and CNN output for Nasion
5 Experimental Evaluation Twelve landmarks were chosen for preliminary assessment of the prototype, and a set of 97 digital X-rays was landmarked by an expert orthodontist, who consulted with another independent expert on dubious cases. Assessment was carried out in two stages. The first stage assessed the image output of the CNN to verify that it included the sought landmark. This was done by visual inspection from the same expert who landmarked the X-rays. Over 97 cases, 29 cases (30%) led to CNN outputs in which some edges were overly eroded. This implies that the number of processing cycles in these cases needs to be reduced. The second stage evaluated performance of the developed algorithms for 8 landmarks, on a sample of 26 cases (50%) randomly selected from a new set obtained from the previous one after eliminating the cases that had not been taken into consideration by the algorithms, (e.g., very young patients with deciduous teeth, cases with inverse bite or overbite). The coordinates of each point found by the program were compared to expert landmarking, and the Euclidean distance of the found landmark from the reference one was computed. Following the standard, points with a distance less than 2mm were considered successfully located. Table 1 reports the findings, specifying for each point the percentage of cases located within 1mm precision, and those falling in the range 1 to 2 mm. To facilitate comparison with other experimental findings the distribution of the points in the surroundings of the valid interval, i.e. from 2 to 3mm, and over, is also reported. To compute the mean error for some points the outliers were eliminated. The error distribution is not normal but highly skewed towards the lower side of the range, however outliers that were within the range typically found in the literature, were left; this leads to a conservative estimation. Thus the overall success rate of our method is estimated in two cases, i.e., a) the reduced sample, and b) the overall sample size. The first one is suitable if some way of automatically detecting outliers is devised, so that the program does not offer a solution instead of offering an imprecise one. Table 1 reports the results.
340
D. Giordano et al.
The overall mean success rate is 85%. Best detection performances were on menton and on upper incisor, and lowest performances were A point and B point. Remarkably, 73% of the found landmarks were within 1 mm precision. Table 1. Experimental results for 8 landmarks Imprecise cases Landmark
N.
Mean error (mm)
MD
SD
≤1 (mm)
>1;≤2 (mm)
≤3
Success Rate
Success Rate (overall sample)
>3
(mm)
(mm)
Upper incisor Lower incisor Nasion
25
.48
.25
.60
88%
8%
4%
-
96%
92%
23
.92
.67
.94
66%
26%
4%
4%
92%
81%
24
1.12
.76
1.11
70%
17%
-
13%
87%
81%
A Point
24
1.34
1.06
.82
58%
21%
17%
4%
79%
73%
Menton
26
.62
.33
.82
85%
7%
4%
4%
92%
92%
B Point
24
2.00
.42
3.3
71%
8%
-
21%
79%
73%
Pogonion
26
.87
1.34
73%
8%
8%
11%
81%
81%
M Point
26
1.25
4.0E02 .33
1.68
69%
8%
8%
15%
77%
77%
6 Discussion There are some methodological issues that must be taken into account in carrying out a comparison with the findings reported in the literature. There are differences in the sample, in the set of points chosen, and also in the way of reporting. However, some considerations can be done. With respect to the studies that report the error distribution, a major finding of this study is that we have improved detection performance. For example, the study of Liu et al.[6] found that the errors of their automatic landmarking system were statistically different from those made by the expert for identifying A point, B point, pogonion, menton, upper and lower incisal edge. In particular they report the following mean error: A point 4.29±1.56; B point 3.69±1.55; Pogonion 2.53±1.12; Menton 1.90±0.57; Upper incisal 2.36±2.01; Lower incisal 2.86±1.24. With exception of the B point, ours finding are comparable to the distribution of the human mean error reported in the same study; moreover, there is a remarkable difference in the percentage of cases with mean error less than 1mm, which in our case is 73% and in [6] is 8%. Our mean errors are smaller also compared to Rudolph et al. [14], who reports the following mean error: A point 2.33±2.63; B point 1.85±2.09; Pogonion 1.85±2.26; Menton 3.09±3.46; Upper incisal 2.02±1.99; Lower incisal 2.46±2.49. Table 2 summarizes the comparisons.
Automatic Landmarking of Cephalograms by Cellular Neural Networks
341
Table 2. Comparison of the accuracy of approaches to automated landmarking Work and Ref.
Sample size
N. Landmarks and accuracy
Techniques
Parthasarathy et al. (1989) [10]
5
Resolution piramid Knowledge based line extractor
Tong et al. (1990) [11]
5
9 landmarks, 58% < 2mm, (18%<1mm) mean error: 2.06 mm 17 landmarks, 76%< 2mm
Cardillo et al. (1994) [13] Rudolph et al. (1998) [14] Liu et al. (1999) [6]
40
Hutton et al. (2000 ) [7]
63
El-Feghi et al. (2003) [16] Innes et al. (2002) [18]
200
Our Work
26
14 38
109
mean error: 1.33 mm 20 landmarks, 75% < 2mm mean error: not reported 15 landmarks, 13% <2mm mean error: 3,07 mm 13 landmarks, 23% < 2mm (8% <1mm), mean error: 2,86 mm 16 landmarks, 35% < 2mm (13% < 1mm) mean error: 4,08 20 landmarks, 90% <2mm mean error: not reported 3 landmarks, 72% <2mm, mean error: not reported 8 landmarks, 85%<2mm (73% < 1mm) mean error: 1.07
Resolution pyramid Edge enhancement Knowledge-based extraction Pattern matching Spatial spectroscopy Statistical pattern recognition Multilayer Perceptron Genetic Algorithms Active Shape Models
Fuzzy neural network PCNN : pulse coupled neural networks Cellular Neural Networks Knowledge based landmark extraction
7 Conclusions There are basically two classes of methods to tackle cephalometric landmark detection: edge-based and region-based. Appraisal of previous works does not demonstrate a superiority of one on the other, rather they have relative advantages over certain landmarks, and indeed integration has been suggested [6]; however, in both cases, precision is not suitable for their application to clinical practice. This paper has shown that a method based on CNNs and landmark-specific search algorithms affords a better, suitable precision. Furthermore, CNNs are versatile enough to be used also for the detection of landmarks that are not located on edges, but on regions, e.g., the Sella, thus the approach is unifying, and lends itself to implementation on a CNN-UM. Future research will be devoted to find suitable templates and algorithms for detection of other landmarks that are used in more sophisticated cephalometric analyses, and to make the overall approach more robust by prior classification of the X-rays based on the morphologies of key anatomical structures, i.e., nasion, maxilla and symphysis, and bite typologies and on classification of the X-rays with respect to brightness. This information is useful both for a targeted selection and/or tuning of the CNN template(s) to apply before launching the landmark extraction algorithm, and for optimization of these extraction algorithms.
342
D. Giordano et al.
References 1. Broadbent, B.H.: A New X-Ray Technique and its Application to Orthodontia. Angle Orthod. 51 (1981), 93-114 2. Kenneth, H.S., Tsang, S., Cooke, M.S.: Comparison of Cephalometric Analysis Using a Non-Radiographic Sonic Digitizer. European Journal of Orthodontics 21 (1999) 1-13 3. YiJ.Chen, S.K.Chen ,H.F.Chang, Chen, K.C.: Comparison of Landmark Identification in Traditional vs Computer-Aided Digital Cephalometry. Angle orthod. 70 (2000) 387-392 4. Baumrind, S., Frantz, R.C.: The Reliability of Head Film Measurements Landmark Identification. American Journal Orthod. 60, 2 (1971) 111-127 5. Rakosi, T.: An Atlas and Manual of Cephalometric Radiography. Wolfe Medical Publications, London, 1982 6. Liu, J., Chen, Y., Cheng, K.: Accuracy of Computerized Automatic Identification of Cephalometric Landmarks. American Journal of Orthodontics and Dentofacial Orthopedics 118 (2000) 535-540 7. Hutton, T.J., Cunningham. S., Hammond, P.: An Evaluation of Active Shape Models for the Automatic Identification of Cephalometric Landmarks. European Journal of Orthodontics, 22 (2000), 499-508 8. Chua, L.O., Roska, T.: The CNN Paradigm. IEEE TCAS, I, 40 (1993), 147-156 9. Levy-Mandel, A.D., Venetsamopolus, A.N., Tsosos, J.K.: Knowledge Based Landmarking of Cephalograms. Computers and Biomedical Research 19, (1986) 282-309 10. Parthasaraty, S., Nugent, S.T., Gregson, P.G., Fay, D.F.: Automatic Landmarking of Cephalograms. Computers and Biomedical research, 22 (1989), 248-269 11. Tong, W., Nugent, S.T., Jensen, G.M., Fay, D.F.: An Algorithm for Locating Landmarks on Dental X-Rays. 11th IEEE Int. Conf. on Engineering in Medicine & Biology (1990) 12. Davis, D.N., Taylor, C.J.: A Blackboard Architecture for Automating Cephalometric Analysis. Journal of Medical Informatics, 16 (1991) 137-149 13. Cardillo, J., Sid-Ahmed, M.A.: An Image Processing System for Locating Craniofacial Landmarks. IEEE Trans. On Medical Imaging 13 (1994) 275-289 14. Rudolph, D.J., Sinclair, P.M., Coggins, J.M.: Automatic Computerized Radiographic Identification of Cephalometric Landmarks. American Journal of Orthodontics and Dentofacial Orthopedics 113 (1998) 173-179 15. Chen, Y., Cheng, K., Liu, J.: Improving Cephalogram Analysis through Feature Subimage Extraction. IEEE Engineering in Medicine and Biology (1999) 25-31 16. El-Feghi, I.; Sid-Ahmed, M.A.; Ahmadi, M.: Automatic Localization of Craniofacial Landmarks for Assisted Cephalometry. Circuits and Systems, 2003. ISCAS '03. Proc. International Symposium on, 3 (2003), 630-633 17. Sanei, S., Sanaei, P., Zahabsaniesi, M.: Cephalograms Analysis Applying Template Matching and Fuzzy Logic. Image and Vision Computing 18 (1999) 39-48 18. Innes, A., Ciesilski, V., Mamutil, J., Sabu, J. : Landmark Detection for Cephalometric Radiology Images Using Pulse Coupled Neural Networks. In H. Arabnia and Y.Mun (eds.) Proc. Int. Conf. on Artificial Intelligence, 2 (2002) CSREA Press 19. Romaniuk, M., Desvignes, M., Revenu, M., Deshayes, M.J. : Linear and Non-Linear Models for Statistical Localization of Landmarks. Proceedings IEEE 16th International Conference on Pattern Recognition 4 (2002) 393-396 20. Roska, T. et alii: CSL CNN Software Library. 1999 Budapest, Hungary 21. Szabo, T., Barsi, P., Szolgay, P. Application of Analogic CNN Algorithms in Telemedical Neuroradiology. Proc. 7th IEEE International Workshop on Cellular Neural Networks and their Applications, 2002. (CNNA 2002). 579 – 586 22. Aizemberg, I, Aizenberg, N., Hiltner, J., Moraga, C., Meyer zu Bexten, E.: Cellular Neural Networks and Computational Intelligence in Medical Image Processing. Image and Vision Computing 19 (2001) 177-183
Anatomical Sketch Understanding: Recognizing Explicit and Implicit Structure Peter Haddawy1 , Matthew Dailey2 , Ploen Kaewruen1, and Natapope Sarakhette1 1
2
Asian Institite of Technology Sirindhorn International Institute of Technology, Thammasat University
Abstract. Sketching is ubiquitous in medicine. Physicians commonly use sketches as part of their note taking in patient records and to help convey diagnoses and treatments to patients. Medical students frequently use sketches to help them think through clinical problems in individual and group problem solving. Applications ranging from automated patient records to medical education software could benefit greatly from the richer and more natural interfaces that would be enabled by the ability to understand sketches. In this paper we take the first steps toward developing a system that can understand anatomical sketches. Understanding an anatomical sketch requires the ability to recognize what anatomical structure has been sketched and from what view (e.g. parietal view of the brain), as well as to identify the anatomical parts and their locations in the sketch (e.g. parts of the brain), even if they have not been explicitly drawn. We present novel algorithms for sketch recognition and for part identification. We evaluate the accuracy of the recognition algorithm on sketches obtained from medical students. We evaluate the part identification algorithm by comparing its results to the judgment of an experienced physician.
1
Introduction
Sketching is ubiquitous in medicine. Physicians commonly use sketches as part of their note taking in patient records and to help convey diagnoses and treatments to patients. Medical students frequently use sketches to help them think through clinical problems and to facilitate communication with other students when participating in group problem solving. Applications ranging from automated patient records to medical education software could benefit greatly from the richer and more natural interfaces that would be enabled by the ability to understand sketches. Our particular interest in sketch understanding stems from our work on the COMET collaborative intelligent tutoring system for medical problem-based learning (PBL) [12]. COMET provides a collaborative environment in which students from disparate locations can work together to solve clinical reasoning problems. It generates tutorial hints by using models of individual and group problem solving. The system provides a multi-modal interface that integrates text and graphics so as to provide a rich communication channel S. Miksch et al. (Eds.): AIME 2005, LNAI 3581, pp. 343–352, 2005. c Springer-Verlag Berlin Heidelberg 2005
344
P. Haddawy et al.
between the students and the system, as well as among students in the group. While COMET has already proven itself useful [13], it still does not support the full range of interaction that occurs in human-tutored PBL sessions. In particular, it does not support interaction through sketches. From observation of PBL sessions at Thammasat University Medical School we have found that students typically sketch anatomical structures on the white board while solving a problem. The sketches are used to help think through the problem and as an artifact to support communication among the students. Consider the following scenario: A group of students in a PBL session is given a problem concerning unconsciousness due to a car accident. One student sketches the brain. Thinking about direct impact to the head, another student annotates the sketch to indicate a contusion in the area where the frontal lobe should be, although the frontal lobe was not explicitly drawn. The tutor understands this annotation and encourages the students to also consider damage to the brain stem by pointing to that part of the sketch and saying “think about what is going on here as well”. Supporting this kind of interaction requires several capabilities. First is the ability to recognize what anatomical structure or structures have been sketched and from what perspective (e.g. parietal view of the brain). Next is the ability to identify anatomical parts of the sketched structure (e.g. frontal lobe of the brain), even if they have not been explicitly drawn. Finally is the ability to understand annotations on the sketch and to be able to effectively use the sketch as a medium of communication in a dialogue. In this paper we address the first two issues. We present a novel approach to sketch recognition that combines the use of shape context matching [3] belongie:shape together with continuous Naive Bayes classification. The approach is robust and is insensitive to scaling. Next we present an algorithm that uses shape context matching in yet another way to identify the parts of the anatomical structure. The algorithm works even if the proportions in the sketch are not anatomically correct and whether or not the anatomical parts have been explicitly drawn. We evaluate the sketch recognition algorithm on a collection of sketches by medical students of various views of the brain, heart, and lungs. Our algorithm achieves a recognition accuracy of 73.6%, far above the baseline random classification accuracy of 12.5%. We evaluate the part identification algorithm by comparing its results to those of an experienced physician. Location, orientation, size, and shape of the parts identified by the physician and the algorithm are in close agreement.
2
Related Work
The last few years has seen a tremendous increase in interest in sketch-based interfaces. Applications include computer-aided design, knowledge acquisition, and image retrieval. Researchers in this area emphasize that the informalness of sketches is important because it communicates that fact that the ideas being represented are still rough and thus invites collaboration and modification.
Anatomical Sketch Understanding
345
Clean, precise-looking diagrams created by most graphics programs can produce an impression of more precision than was intended and can lead to a feeling of commitment to a sketch as originally drawn [9, 7]. We now discuss a few systems that are representative of the state-of-the-art. The Electronic Cocktail Napkin [7] is a general-purpose sketching program that provides trainable symbol recognition, parses configurations of symbols and spatial relations, and can match similar figures. It recognizes a symbol by comparing its features — pen path, number of strokes and corners, and aspect ratio — with a library of stored feature templates. Applications developed using the system include a visual bookmark system, an interface to simulation programs, and an HTML layout design tool. SILK [10] is a sketching tool for developing user interfaces. SILK recognizes seven basic widgets, as well as combinations of widgets. To recognize a widget, SILK first identifies primitive components using a statistical classifier learned from examples. SILK recognizes four single-stroke primitive components: rectangle, squiggly line, straight line, and ellipse. Once components are identified, they are passed to an algorithm that detects spatial relationships among primitive and widget components. These include containment, closeness, and sequence. SILK finally uses a set of rules to identify widgets from primitive components. In an evaluation with twelve users, SILK achieved a widget recognition accuracy of 69%. SILK supports use of five single-stroke gestures for editing sketches: cross, circle, squiggly line, spiral, and angle (for insertion). Designers can create storyboards by drawing arrows from any screen’s graphical objects, widgets, or background to another screen. SILK has a run mode in which it can simulate the functioning of the widgets and the transitions between screens. ASSIST [2] supports sketching and simulation of simple 2-dimensional mechanical systems. ASSIST recognizes the user’s sketch by identifying patterns that represent mechanical parts, leveraging off the fact that mechanical engineering has a fairly concrete visual vocabulary for representing components. ASSIST uses a three-stage procedure to choose the most likely interpretation for each stroke. First it matches the stroke to a set of templates to produce the set of possible interpretations, e.g. circle or rectangle. Next it ranks the interpretation using heuristics about drawing style and mechanical engineering. Finally, the system chooses the best consistent overall set of interpretations and displays this to the user. ASSIST supports editing of the sketch through the use of gestures. At any time during the design process, the user can run a simulation of the design being sketched. In an effort to attain immediate practical functionality as well as broad domain independence, Forbus and Usher [5] take a very different approach to sketching. Their sKEA system does not address the recognition issue, focusing rather on qualitative reasoning about the spatial relations among objects and on analogical comparison of sketches containing multiple objects. They avoid the recognition problem by requiring the user to indicate when he begins and finishes drawing a new object as well as the interpretation of the object. The interpretation is selected from a pull-down menu.
346
P. Haddawy et al.
The work reported in this paper is the first application of sketch-based interfaces to intelligent tutoring that we know of, and also the first in a medical domain other than image retrieval [1]. The motivation behind the use of sketching in medical tutoring is similar to that previously mentioned, namely that sketching supports collaboration and encourages modification. But in addition, sketching in medical PBL is valuable because it gives students practice in recalling anatomical structure. A menu-based drawing interface would not provide such practice. The issues involved in recognizing anatomical sketches are significantly different from those of recognizing design diagrams. Most of the previous work in sketching starts by recognizing primitive components such as lines, circles, and corners. This works fine for domains such as mechanical engineering and user interface design, but anatomical sketches are rather amorphous complex structures which may be sketched with more or less detail. This complexity and lack of a well-defined set of primitive components demands a very different approach to object recognition. Fortunately, the anatomical recognition problem is eased by the fact that by convention 2-dimensional depictions of anatomical structures are only shown from eight standard views. We have five external views corresponding to the sides of a cube: anterior, posterior, superior, inferior, lateral (2 sides); and three internal views corresponding to the three cutting planes: sagittal, coronal, axial. This fact is exploited by our recognition algorithm, described next.
3
Recognizing Structure and Parts
We call our prototype system UNAS1 for UNderstanding Anatomical Sketches. We divide the task of understanding a sketch into two subtasks: identifying what the sketch portrays, then identifying the relevant parts of the sketch. Without constraints, this problem would be extremely difficult, if not impossible. Fortunately, the fact that 2-dimensional anatomical sketches are always drawn from one of eight standard views allows us to cast the problem of identifying what a sketch portrays as a classification problem: given an image of a sketch I, find the class y = f (I) ∈ {1, . . . , K} to which the image belongs. The set of possible classes corresponds to the set of standard views of anatomical structures, e.g., “parietal view of the brain” and “internal view of the lungs.” With enough labeled examples {(I1 , y1 ), . . . , (Im , ym )}, it is possible to construct a classifier yˆ = h(I) that predicts the unknown true class y = f (I) given a previously unseen I. Once we assume the class y that sketch I belongs to, we must then segment the sketch into regions corresponding to anatomical parts. Since every instance of a standard anatomical view contains the same parts, the task is welldefined: attach a label z ∈ {1, . . . , Ly } to every pixel in I. Here the set of possible 1
Unas was the last king of the 5th dynasty of ancient Egypt. The interpretation of the bas-relief scenes on the inside of his tomb remains a challenge to this day.
Anatomical Sketch Understanding
347
labels corresponds to the set of anatomical parts normally visible in view y, e.g., “temporal lobe” and “parietal lobe.” In the preliminary experiments reported upon in this paper, we have made the following simplifying assumptions: – Each image I contains exactly one anatomical structure, e.g. brain, lungs, heart. – Sketches may not contain annotations or extraneous parts. – Each sketch is complete (there are no major parts left out). In future work, we plan to relax all of these assumptions. For the classification problem, we take the Bayesian maximum a posteriori (MAP) approach: measure a finite set of features x1 , . . . , xn from I then select the class yˆ = arg max P (y | x) y
where P (y | x) ∝ P (x | y)P (y). P (x | y) is the likelihood of feature vector x given class y, and P (y) is the prior probability of class y. We estimate the parameters of statistical model P (x | y) from training data, and in the experiments reported in this paper, we assume uniform priors P (y). In some contexts, such as a PBL session, however, the priors could be chosen to reflect our prior knowledge that, for example, in a head injury case study, sketches of the brain are more likely than sketches of the lungs. The MAP classifier just described requires a set of features and a model for the data likelihood. In our scheme, feature xi for sketch I is the dissimilarity between I and template image Ti according to Belongie et al.’s Shape Context measure [3]. Ideally, the set of templates Ti contains several examples of each class. Our model for the data likelihood is the well-known Naive Bayes model P (xi | y). P (x | y) = i
The model is “naive” in that it assumes the feature values xi are statistically conditionally independent given y, even though they generally are not. Once our Naive Bayes classifier picks the best class yˆ for a given input sketch I, the next step is to segment the sketch into regions. Our system first warps the input sketch into correspondence with a pre-labeled canonical template Tyˆ∗ for class yˆ, assigns labels to sketch points using the labels in Tyˆ∗ , finds the boundary of each region, then labels each pixel in the sketch according to which region it falls into. We compute point correspondences between I and Tyˆ∗ using (once again) Belongie et al.’s Shape Context algorithm [3]. We then use the point correspondences to estimate a mapping between arbitrary points in the sketch and template using the Thin Plate Spline (TPS) model [4]. To identify the boundary of each region, we transfer the labeled points from Tyˆ∗ to I then connect those points using a simple traveling salesperson algorithm [8]. In the rest of this section, we describe the sketch classification and segmentation algorithms in more detail.
348
P. Haddawy et al.
3.1
Sketch Classification
As previously described, the basic features in our Naive Bayes classifier are dissimilarities between the input sketch image I and each of a set of template images Ti . The particular dissimilarity measure we use is Belongie et al.’s Shape Context (SC) measure [3]. SC represents a shape as a set of points sampled from the shape’s contours. Each sample point is represented by a coarse histogram of the other points surrounding it. To determine the dissimilarity of two shapes, SC first finds a correspondence between the sampled points in the two shapes. Then the total dissimilarity between the shapes is simply the sum of the dissimilarities of the sample points. For each template image, we convert the raw grayscale or color image to a line drawing then In either case, we randomly sample Ns points from the resulting “edge” image. For each point pi , we obtain the SC histogram by counting the number of pixels falling into Nb log-polar bins around pi then normalizing the bin counts (so the sum of the bin counts is 1). The width of the bin template is adjusted to be proportional to the mean squared distance between points, to make the resulting histograms invariant to the scale of the image. For a new sketch, we perform the same sampling and SC histogram computation steps, then find the optimal correspondence between the sketch sample points and the template’s sample points. The dissimilarity between two normalized SC descriptors is the simply the χ2 test statistic. Given the (square) dissimilarity matrix for the sketch and template SC histograms, the optimal correspondence is the permutation of the sketch points minimizing the summed dissimilarity of the matched points. This corresponds to a weighted bipartite graph matching problem and is solved in O(Ns3 ) time using the Hungarian method [11]. Once we obtain the optimal assignment, the final dissimilarity xi between sketch I and template Ti is the sum of the Ns individual point-matching costs. After computing the dissimilarities xi between sketch I and templates Ti , UNAS forms the feature vector x = [x1 , . . . , xn ]T , which is then input to the sketch classifier. UNAS assumes each of the probability densities P (xi | y) used in the Naive Bayes classifier is a Gaussian with mean µy,i and standard deviation σy,i . The classifier’s 2nK parameters µy,i , σy,i are estimated directly from a training set containing an equal number of example sketches from each class. Once UNAS obtains the MAP estimate yˆ for the class of I, the next step is to segment the sketch into regions corresponding to anatomical parts. We describe the details of the segmentation procedure next. 3.2
Sketch Segmentation
As previously described, the first step in segmentation is to align sketch I with the canonical labeled template Tyˆ∗ for class yˆ. To align the sketch with the template, UNAS first uses Shape Context as described above to find a set of Ns point correspondences (xi , yi ) ↔ (xi , yi ) between the sketch and the template. These correspondences are then used to fit a thin plate spline (TPS) model [4] mapping Tyˆ∗ to I. TPS fits a smooth function fx (x, y) mapping the template
Anatomical Sketch Understanding
349
points (xi , yi ) to the x coordinates xi of the sketch points, and another smooth function fy (x, y) mapping the template points to the y coordinates yi of the sketch points. The fitted functions fx and fy model the deformation of thin steel plates constrained to interpolate the observed values xi and yi , respectively. However, since sampling introduces noise, and the Hungarian assignment method does not attempt to impose any spatial regularity constraints, strict interpolation is not desirable. Belongie et al. [3] introduce a regularization factor into the minimization that penalizes excessively warped transformations. The quality of the final transform can be iteratively improved by repeating the correspondence estimation and transform estimation steps, using the results of the previous step as a starting point. In our experiments, we iterate the process 6 times. The result is a smooth mapping from every point in Tyˆ∗ to a point in I. The canonical templates Ty∗ are derived from drawings in medical atlases, for which the ground truth segmentation is known. When we sample and compute the SC histograms for each canonical template, we also associate (by hand) a set of labels with each sampled point. The labels indicate which regions (anatomical parts) each point belongs to. Since the sampled points correspond to edges in the original image, they often delineate boundaries between two regions; in these cases, the points are assigned the labels of both regions. We initiate the segmentation process by simply copying the labels of the template points (xi , yi ) to the corresponding points (fx (xi , yi ), fy (xi , yi )) in I. Now the task is to use these points to compute a closed boundary for each region of I. Under certain conditions described by Giesen [6], solutions to the traveling salesperson tour problem (TST) accomplish exactly this task. We use Giesen’s insight for curve reconstruction in UNAS. For each anatomical part label zi ∈ {1, . . . , Lyˆ} for view yˆ, UNAS collects the set of projected boundary points for region zi and runs a traveling salesperson algorithm [8] to “connect the dots.” The result is a simple polygon approximating the boundary of region zi in I. The final step, after the region boundaries have been determined, is to use those boundaries to assign a unique label z to each pixel of I. UNAS tests each pixel for membership in each polygonal region using the technique of segment intersections: if an arbitrary ray from pixel p intersects an odd number of the polygon’s sides, it is inside the polygon; otherwise, it is outside the polygon. This concludes our description of the classification and segmentation algorithms employed by UNAS. In the next section, we describe an empirical evaluation of the approach.
4
Evaluation
We evaluated the sketch classification algorithm by building a Naive Bayes classifier for the brain, heart, and lungs and evaluating its accuracy in classifying sketches. We chose the following eight views:
350
P. Haddawy et al.
Fig. 1. Medical student sketches (first column) with corresponding segmentations produced by a physician (second column) and by UNAS (third column). The templates used by the segmentation algorithm are shown in the upper right corners
– Brain: parietal (lateral), sagittal, basal (inferior) – Heart: anterior, posterior, interior (coronal) – Lung: anterior, interior (coronal) These views were chosen because they are the standard views from which these organs are typically drawn. We collected 300 sketches from 48 medical students in their second to sixth years of study. The sketches were vetted for quality by a physician and we eliminated those that the physician could not identify. This was done because we do not expect our recognition algorithm to perform better than an experienced physician and because low quality sketches are unlikely to be useful as templates. This left us with 272 sketches. We then chose an equal number of sketches for each view, resulting in 30 sketches for each view or a total of 240 sketches. For each view we randomly separated the sketches into 70% (21
Anatomical Sketch Understanding
351
sketches) for training and 30% (9 sketches) for testing. All the 168 sketches in the training set were potential templates in the Naive Bayes classifier. To this set we added six medical atlas illustrations for each view, resulting in a total of 216 candidate templates. The 168 sketches in the training set were used to compute the means µy,i and variances σy,i of eight conditional Gaussian distributions P (xi | y) for each candidate template Ti . Using all the templates would result in a classifier with unacceptably slow running time and also may not yield the highest classification accuracy. So we conducted feature selection by performing a best-first search through the space of all subsets of templates. Each subset resulted in a different Naive Bayes classifier, that was evaluated on a validation set, using 7-fold cross validation (to evenly divide the training set of 21 sketches per view). The best performing classifier contained 24 templates. This classifier had a total classification accuracy of 73.6% on the test set, far above the baseline random classification accuracy of 12.5%. The accuracy for each class ranged from the lowest value of 55.6% for the brain parietal and heart posterior views to the highest of 88.9% for the heart anterior and lung external views. We evaluated the sketch segmentation algorithm by comparing its segmentation to that of an experienced physician on three sketches each of the external view of the lungs and the parietal view of the brain. We chose three qualitatively different sketches for each organ. Each sketch used was correctly recognized by UNAS. The segmentation results are shown in Figure 1. The first column shows the sketches, the second column is the physician’s segmentation, and the last column is UNAS’s segmentation. The segmentations produced by UNAS and by the physician agree quite closely on all sketches. For example, in the first sketch of the lung (i) the student drew a protrusion below the superior lobe of the left lung that is not normally drawn in the external view. Both the physician and UNAS correctly did not include this as part of the superior lobe. The segmentations produced by UNAS differ from of those of the physician in two primary respects. When internal parts are drawn slightly incorrectly, the physician still segments following the lines in the sketch. In contrast, UNAS attempts to correct the sketch. This can be seen by comparing the superior lobe of the left lung in viii and ix, and the cerebellum in viii and ix. The other difference is that because the thin-plate spline transformation is applied globally, sometimes parts get warped too much, for example the pons in segmentation vi.
5
Conclusions and Future Research
The results from our initial prototype system are encouraging but much work remains to be done in order to realize the functionality described in our motivating example. We are currently gathering sample sketches of more anatomical structures to expand the scope of UNAS. We also plan to compare the recognition accuracy and segmentation results of UNAS to those of physicians with varying levels of experience. On the algorithm side, several improvements and extensions can be made. The accuracy of the recognition algorithm can be improved. A first step is to use a more sophisticated search for feature selection
352
P. Haddawy et al.
since the feature space seems to contain many local maxima. A next step is to relax the Gaussian assumption in the Naive Bayes model, but this will require more examples. Other machine learning techniques that direclty use similarity information, such as support vector machines with similarity kernels, are promising and should be tried. We have assumed that a sketch includes only one anatomical structure but sketches often contain multiple structures as well as incompletely drawn structures. Generalizing our approach to handle this will possibly require adding spatial reasoning abilities. In addition to understanding the sketch, UNAS should be able to understand annotations commonly used in medicine, such as arrows, circles, crosses, darkened regions, and clusters of dots. For this we are exploring the use of hidden Markov models, which tend to work well for such relatively simple symbols. The final step will be to integrate UNAS into the COMET intelligent tutoring system.
References 1. Dean, D., Buckley, P., Bookstein, F., Kamath, J., Kwon, D., Friedman, L., Lys, C.: Three dimensional MR-based morphometric comparison of schizophrenic and normal cerebral ventricles. Vis. In Biom. Computing, Lecture Notes in Comp. Sc., pp. 363-372, 1996. 2. Styner, M., Gerig, G., Pizer, S., and Joshi, S.: Automatic and robust computation of 3D medial models incorporating object variability. International Journal of Computer Vision 55(2/3), pp. 107-122, 2003. 3. Brechbuhler, C., Gerig, G. and Kubler, O.: Parameterization of closed surfaces for 3D shape description. Computer Vision, Graphics, Image Processing: Image Understanding, Vol. 61, pp. 154-170, 1995. 4. Kelemen, A., Szekely, G., Gerig, G.: Elastic Model based Segmentation of 3D Neuroradiological Data Sets. IEEE Transaction on Med. Im., Vol. 18, No. 10, pp. 823-839, 1999. 5. Geric, G.: http://www.cs.unc.edu/∼gerig/pub.html. 6. Gibson, SFF., Mirtich, B.: A survey of deformable modeling in computer graphics. MERL-A Mitsubishi Electric Research Laboratory, TR-97-19, 1997. 7. McInerney, T., Terzopoulos, D.: Deformable models in medical images analysis: a survey. Medical Image Analysis, Vol. 1, No. 2, pp. 91-108, 1996. 8. Fisher, R. A.: The use of multiple measurements in taxonomic problems. Annals of Eugenics, Vol. 7, pp. 179-188, 1936. 9. Vapnik, V. N.: The Nature of Statistical Learning Theory, Springer, 1995. 10. Karabassi, E. A., Papaioannou, G. and Theoharis, T.: A Fast Depth-Buffer-Based Voxelization Algorithm. Journal of Graphics Tools, ACM, Vol. 4, No. 4, pp. 5-10, 1999. 11. Choi, S.M., Kim, M.H.: Shape Reconstruction from Partially Missing Data in Modal Space. Computers & Graphics, Vol. 26, No. 5, pp. 701-708, 2002. 12. Bathe, K.: Finite Element Procedures in Engineering Analysis. Prentice-Hall: Englewood Cliffs, NJ. 13. Zhang, Z.: Iterative point matching for registration of freeform curves and surfaces. International Journal of Computer Vision, Vol. 13, No.2, pp. 119-152, 1994.
Morphometry of the Hippocampus Based on a Deformable Model and Support Vector Machines Jeong-Sik Kim1 , Yong-Guk Kim1 , Soo-Mi Choi1 , and Myoung-Hee Kim2 1
School of Computer Engineering, Sejong University, Seoul, Korea [email protected] 2 Dept. of Computer Science and Engineering, Ewha Womans University, Seoul, Korea [email protected]
Abstract. This paper presents an effective representation scheme for the statistical shape analysis of the hippocampal structure and its shape classification: Morphometry of the hippocampus. The deformable model based on FEM (Finite Element Method) and ICP (Iterative Closest Point) algorithm allows us to represent parametric surfaces and to normalize multi-resolution shapes. Such deformable surfaces and 3D skeletons extracted from the voxel representations are stored in the Octree data structure. And, it will be used for the hierarchical shape analysis. We have trained SVM (Support Vector Machine) for classifying between the control and patient groups. Results suggest that the presented representation scheme provides various level of shape representation and SVM can be a useful classifier in analyzing the statistical shape of the hippocampus.
1
Introduction
Anatomical structure of the hippocampus in the human brain provides important information for the medical diagnosis. For instance, it is known that an abnormal shape of the hippocampus involves with neurological diseases such as epilepsy, schizophrenia, and Alzheimer’s diseases [1]. In order to estimate shape deformation of the hippocampus by computer, it is essential to select an efficient shape representation scheme. Then, a powerful classifier is used to discriminate a patient group from the normal one. In general, there are two approaches for the 3D shape analysis of the hippocampus: the global and local methods. The former approach focuses on the volume changes of the whole area of the 3D shape, and then investigates the correlation between the shape changes and its neurological disease. The latter approach is often adopted in planning a surgery and diagnosing a certain disease that may be related to the hippocampus 3D shape abnormality. Often, 3D shape of the human organ is analyzed according to the skeletonbased description such as Medial Representation (M-rep) [2] or the surface-based representation such as Spherical Harmonic basis function (SPHARM) [3], which S. Miksch et al. (Eds.): AIME 2005, LNAI 3581, pp. 353–362, 2005. c Springer-Verlag Berlin Heidelberg 2005
354
J.-S. Kim et al.
represents spherical topology of the shape and acquires a hierarchical description of the parametric surfaces. However, this representation scheme can be applied to only the object with a spherical structure. Gerig et al., have done numerous studies (see [5] for the list), identifying statistical shape abnormalities of different neuroanatomical structures using SPHARM and M-rep. It seems that the parameterized model is useful for the statistical analysis of the shape since it is able to represent the global as well as the local shapes. Such model is constructed by using a deformable modeling. Several attempt to reconstruct the shape of objects and to track their motion [6,7] are based on physics-based deformable models. The merit of such approaches stems not only from their ability to exploit knowledge about physics to disambiguate input data, but also from their capacity to interpolate and smoothen raw data based on prior assumptions on material properties, noise and so on. Once, the 3D shape of any organ is represented according to the above methods, we need to discriminate its shape normality using the machine learning method, by which one can train a classifier that able to discriminate between the control and abnormal groups. Such classifiers can be categorized into 1) parametric vs. non-parametric and 2) linear vs. non-linear. The most commonly used algorithm is the PCA (Principle Components Analysis) algorithm, which is a linear classifier. Similarly, FLD (Fisher Linear Discriminant) [8] is a linear and parametric classifier. Artificial Neural network based on back-propagation algorithm can be adopted as an effective non-linear classifier. Recently, SVM (Support Vector Machine) [9] turn out to be a powerful classifier since it guarantees to converge to an optimal solution even for small set of training sample. In this paper, we use a deformable model for accurate similarity estimation for the global shape changes of the hippocampus, and the skeleton extracted from the hippocampal volume is used for the local shape analysis. Then, SVM will be adopted for an accurate classification method between normal controls and neurological patients. The rest of this paper is organized as follows. Section 2 describes the surface and skeletal representation for 3D shape analysis of hippocampal structures and Section 3 shows the hierarchical shape similarity. Section 4 illustrates the implementation of the SVM-based classifier. Experimental results and discussion are given in Section 5 and some conclusions and future works are given in Section 6.
2
The Surface and Skeleton Representation for Shape Analysis
In this section, we describe how to represent the shape of the hippocampal structure based on a deformable model and skeletal representation. Initially, we segment the hippocampal structure from the MRI of the brain. Then, we reconstruct the deformable model by fitting to the surface points extracted from the images. The 3D skeletal representation is generated by using depth-buffer based voxelization [10], which makes easier to extract a skeleton as well as relate to
Morphometry of the Hippcampus Based on a Deformable Model and SVM
355
the original medical images. Both representations are used for the measurement of shape similarity explained in Section 3. 2.1
The Deformable Surface Representation
The procedure for creating the deformable model for the hippocampal structure is illustrated in Fig. 1.
Fig. 1. Overall procedure for creating the deformable model
First, we find the center of mass of the hippocampal structure and its principal axes of the surface points. Then we create an initial superellipsoid and triangulate it. Here, we use a single 3D blob element proposed as suggested in [11]. This element has the same size and orientation as the initial reference shape. The triangulation is carried out with the desired resolution and then we adopt the vertices of meshes to the FEM nodes. Using Galerkin’s surface interpolation approach [12], we can find the relationship between the blob element’s displacements at any point and the nodal point displacements. It alleviates problems caused by irregular sampling of feature points. We use a 3D Gaussian function as a shape function. The function allows us to generate the efficient deformed shape with an iterative way. Eq. (1) is the 3D Gaussian basis interpolation functions: hi (x, y, z) =
n
qik gk (x, y, z), gi (x, y, z) = e−[(x−xi)
2
2 /2σx +(y−yi )2 /2σy2 +(z−zi )2 /2σz2 ]
k=1
(1) where n is the number of FEM nodes and qik is the coefficient satisfying the property of interpolation function [12]. In order to achieve the physics-based shape deformation, we compute the mode shape vectors which are derived from the equilibrium equation for simulating the dynamic behavior of an object, and here they are the generalized eigenvectors of the dynamic equilibrium equation without damping: M U + KU = 0, U = φ sin w(t − t0 )
(2)
356
J.-S. Kim et al.
where U is a 3n×1 vector of the (∆x, ∆y, ∆z) displacements of the n nodal points relative to the object’s center of mass. φ, t, t0 is the vector of order n, the time variable, and the time constant, respectively. w is a constant identified to represent the frequency vibration of the vector φ. M and K are 3n × 3n matrices describing the mass and material stiffness, respectively. Eq. (2) may be interpreted as assigning a certain mass to each nodal point and a certain material stiffness between nodal points without damping. 3D mass matrix M can be computed directly from the interpolation matrix H and ρ is the mass density. The 3D stiffness matrix K is calculated by Eq. (3) where B is the strain displacement matrix and C is the material matrix. The strain displacement matrix B is obtained by appropriately differentiating and combining rows of the element interpolation matrix H. The material matrix C expresses the material’s particular stress-strain law. ρH T H dV and K = B T CB dV (3) M= V
V
Eq. (2) can be transformed into a form that is not only less costly but also allows a closed-form solution by mode superposition. To diagonalize the equa tion, the modal matrix Φ is used. U is transformed into modal displacements U by U =Φ U. Because M and K are normally symmetric positive definite, they can be diagonalized by Φ. The mode shape vectors form an orthogonal object-centered coordinate system for describing feature locations. We transform the mode shape vectors into nodal displacement vectors and iteratively compute the new position of the points on the deformable model using 3D Gaussian interpolation functions. We can get the final deformed positions if the energy value is below the threshold or the iteration count is over the threshold. 2.2
The 3D Skeletal Representation
3D skeletonization is a process for replacing the complex 3D model with the discrete structure including the low level representation. Since the multiresolution skeletons are extracted from intermediate binary voxels using depth-buffers [10], it is relatively easy to extract a skeleton as well as to compare to the original medical image. Our method consists of four steps. First, we divide a voxel space into 2D slices along the y-axis. At this time, we compute the centroid of the object boundary for each slice, and then we interpolate the initial centroids in order to generate skeleton points which each point has uniform distance. Finally, we fit the skeleton and the mesh model using the skeleton normalization method. Fig. 2 shows the algorithm for the skeletonization process. In Fig. 3, (a) is the depth buffer images (or depth-maps). A depth buffer image is generated for each face of the bounding box of a 3D model by parallel-projecting the object onto it. The left two columns are six depth-maps of the left hippocampus of a normal control and the right two columns are six depth-maps of an epileptic patient. Fig 3 (b) shows the binary voxel images of the left hippocampus of a normal
Morphometry of the Hippcampus Based on a Deformable Model and SVM
357
Fig. 2. The algorithm for the skeletonization process
Fig. 3. The result of the skeletonization of the hippocampus: (a) depth-maps; (b) binary voxels; (c) superimposition of the surface models and the skeletons
(a) multiresolutional voxels
(b) multiresolutional skeletons
Fig. 4. The result of multiresolutional voxels and skeletons
control (left) and an epileptic patient (right), respectively. Fig 3 (c) shows the result of comparing two surface models of the hippocampus with its skeletons. In order to reduce the computation time for constructing the shape representation as well as for estimating the shape similarity, the multi-resolution approach to the voxels and skeletons representation is useful. As we specify the resolution of the skeleton or voxels, we can get multi-resolution representations. That result is shown in Fig. 4.
358
3
J.-S. Kim et al.
The Hierarchical Shape Similarity
To reduce the computation time for the similarity estimation and to capture the local shape difference with a hierarchical fashion, the deformable surface and skeletal represenations are integrated by the Octree structure. The Octree is a data structure to represent objects in 3D space, automatically grouping them hierarchically and avoiding the representation of empty portion of the space. Given the reconstructed parametric representation, it has to be placed into a canonical coordinate system, where the position, orientation are normalized. To normalize the position and rotation, we adapt the Iterative Closest Point (ICP) algorithm proposed by Zhang et al. [13] in which they apply a free-form curve representation to the registration process.
Fig. 5. Shape similarity estimation: (left) the deformable shape; (middle) local shape similarity using skeleton point picking; (right) Octree-based hierarchical similarity
After normalizing the shape, we estimate the shape difference by computing the distance for the sample meshes extracted from the deformable meshes using the L2 norm metric. The L2 norm is a metric to compute the distance between two 3D points by Eq. (4), where x and y represents the centers of corresponding sample meshes. 1 k 2 2 L2 (x, y) = xi − yi (4) i=0
Fig. 5 shows the results of shape similarity computation between the hippocampal structures of a normal control and an epilepsy patient. Fig. 5 (left) shows the parametric representation of the hippocampus. Fig. 5 (middle) and (right) show how to compare two hippocampal shapes based on the proposed Octree and skeletal scheme. It is possible to reduce the computation time in comparing two 3D shapes by picking a certain skeletal point (Fig. 5(middle)) or by localizing an Octree node (Fig. 5(right)) from the remaining parts. It is also possible to analyze the more detail region by expanding the resolution of the Octree, since it has a hierarchical structure. The result of shape comparison is displayed on the surface of the target object using color-coding.
Morphometry of the Hippcampus Based on a Deformable Model and SVM
4
359
An SVM-Based Classifier
Once the feature vectors are extracted from the deformable model, they can be used to analyze the shape differences between populations, for example, normal controls and epilepsy patients. In this section, we briefly describe our approach based on discriminative modeling method using SVM [9]. First, we train a classifier for labeling new examples into one of the two groups. Each training data set is composed of coordinates for the deformable meshes. We then extract an explicit description for the differences between two groups captured by the classifier. This method is to detect statistical differences between two populations. In order to acquire the optimal solution, it is important to select a good classifier function. SVM is known to be robust and free from the over-fitting problem. Given a training data set {(xk , yk ), 1≤k≤n}, where xk are observations and yk are corresponding groups, and a kernel function K : Rn × Rn → R, the SVM classification function : n yk (x) = ak yk K(x, xk ) + b (5) k=1
where the coefficients ak and b are determined by solving a quadratic optimization problem that is constructed by maximizing the margin between the two classes. For the non linear classification, we employ the commonly used Polynomial Function K(x,xk )=(x · xk + 1)d (where the kernel K of two objects x and xk is the inner product of their vectors in the feature space, parameter d is the degree of polynomial). In order to estimate the accuracy of the resulting classifier and decide the optimal parameters in the non-linear case, we use crossvalidation. To obtain error, recall, and precision, we evaluate the performances for three types of SVM kernels (i.e. polynomial, RBF, sigmoid) and the linear case. Results are described in Section 5.
5
Experimental Results
In order to analyze whether the 3D hippocampus shape extracted from MR brain images is included in normal controls or epilepsy patients. First, we have experimented with the reconstruction method on points clouds obtained from the meshes. A surface representation is acquired through the marching cube algorithm. The deformable modeling algorithm is applied for the 3D meshes models. We implemented the SVM-based classifier and estimated its performance. For these experiments, we collected two template 3D models (normal control and epileptic patient) from the real MRI. And we also generate 80 deformed models using a modeling tool. To estimate the capacity of our deformable modeling method, we constructed 80 deformed models. Fig. 6 shows hippocampus shapes reconstructed with the different number of vibration modes. Our method is robust to model nondistributed model. Fig. 7 shows the result of the deformable models generated by different iteration stage.
360
J.-S. Kim et al.
Fig. 6. The deformable model reconstruction of the hippocampus depending on the vibration modes
Fig. 7. The result of the deformable models generated by different iteration stage: (left) iteration=1; (middle) iteration=2; (right) iteration=3
We measured 80 Euclidean distances between skeleton points and sampled surface points as a feature vector set for discriminating between normal hippocampus group and epilepsy hippocampus group. In our experiment, 80 training sets are composed of 40 normal cases and 40 abnormal cases. In order to implement the SVM classifiers, four kernels were tested; linear, RBF, polynomial, sigmoid functions. The cross-validation technique is used in order to overcome the problem of small training sets. So we were able to execute 10 learning and 10 test processes from 80 training sets (learning: 72 cases, testing: 8 cases). Fig. 8 illustrates the result of the training test for four different kernels. Results suggest that the polynomial kernel outperformed the others. Table 1 summarizes the result of local shape differences by comparing the 3D shapes between the reference models (P L and N R) and deformed targets (T1∼T4), respectively. P L is an abnormal left hippocampus in epilepsy and N R is a normal right hippocampus. As shown in Fig. 5, we are able to evaluate the qualitative result of the shape difference at specific region and to control the hierarchical analysis using the Octree structure (i.e. upper-front-right, bottomfront-left, upper-back-left, and the bottom region, respectively). In Fig. 5(right), we can confirm that the node H of the Octree can be divided into detailed regions and be analyzed in hierarchical fashion. In Table 1, the values in bold type represent the significantly deformed area of the hippocampus model and so
Morphometry of the Hippcampus Based on a Deformable Model and SVM
361
Fig. 8. The result of the training test using SVM for four types of kernels Table 1. The result of local shape analysis based on the Octree structure: each column (A∼H) represents the sub-space of the Octree and each row shows the shape difference between each pair of models
P L:T1 P L:T2 N R:T3 N R:T4
A 0.15 1.20 0.06 0.00
B 0.77 0.00 1.02 0.00
C 0.84 0.00 0.06 0.00
D 3.15 0.00 0.00 0.00
E 0.00 3.12 0.00 1.54
F 0.00 2.00 0.12 1.31
G 0.00 1.00 0.00 1.31
H 0.15 1.44 0.00 1.54
we observe that the similarity error at deformed region is higher than at other regions. As shown in Table 1, our method is able to discriminate the global shape difference and is also able to distinguish a certain shape difference at a specific local region in a hierarchical fashion.
6
Conclusions
This paper presents a new method for 3D shape analysis based on a deformable model and a SVM classifier. The present method is invariant under translation, rotation of models and is robust to non-uniformly distributed and incomplete data sets. A modal model based on the FEM physics is constructed from points cloud data using vibration modes and can be utilized to the statistical shape modeling. As we integrate the parametric representation and the multi-resolution skeletons into the Octree structure, we can estimate the global and local shape
362
J.-S. Kim et al.
similarity of the hippocampus. In addition, the ICP based normalization provides fast registration by a coarse-to-fine strategy and helps to overcome the alignment problem derived from the parametric shape representation. We have adopted a SVM-based classifier for the discrimination task. Results suggest that polynomial kernel outperforms the others. To ensure a more reliable result, we need to collect more experimental data for the subject controls and the epilepsy patients.
Acknowledgements We would like to thank Prof. Seung-Bong Hong and Woo-Suk Tae in Samsung Medical Center for giving useful comments. This work was supported by grant No. (R04-2003-000-10017-0) from the Korea Research Foundation.
References 1. Dean, D., Buckley, P., Bookstein, F., Kamath, J., Kwon, D., Friedman, L., Lys, C.: Three dimensional MR-based morphometric comparison of schizophrenic and normal cerebral ventricles. Vis. In Biom. Computing, Lecture Notes in Comp. Sc., pp. 363-372, 1996. 2. Styner, M., Gerig, G., Pizer, S., and Joshi, S.: Automatic and robust computation of 3D medial models incorporating object variability. International Journal of Computer Vision 55(2/3), pp. 107-122, 2003. 3. Brechbuhler, C., Gerig, G. and Kubler, O.: Parameterization of closed surfaces for 3D shape description. Computer Vision, Graphics, Image Processing: Image Understanding, Vol. 61, pp. 154-170, 1995. 4. Kelemen, A., Szekely, G., Gerig, G.: Elastic Model based Segmentation of 3D Neuroradiological Data Sets. IEEE Transaction on Med. Im., Vol. 18, No. 10, pp. 823-839, 1999. 5. Geric, G.: http://www.cs.unc.edu/∼gerig/pub.html. 6. Gibson, SFF., Mirtich, B.: A survey of deformable modeling in computer graphics. MERL-A Mitsubishi Electric Research Laboratory, TR-97-19, 1997. 7. McInerney, T., Terzopoulos, D.: Deformable models in medical images analysis: a survey. Medical Image Analysis, Vol. 1, No. 2, pp. 91-108, 1996. 8. Fisher, R. A.: The use of multiple measurements in taxonomic problems. Annals of Eugenics, Vol. 7, pp. 179-188, 1936. 9. Vapnik, V. N.: The Nature of Statistical Learning Theory, Springer, 1995. 10. Karabassi, E. A., Papaioannou, G. and Theoharis, T.: A Fast Depth-Buffer-Based Voxelization Algorithm. Journal of Graphics Tools, ACM, Vol. 4, No. 4, pp. 5-10, 1999. 11. Choi, S.M., Kim, M.H.: Shape Reconstruction from Partially Missing Data in Modal Space. Computers & Graphics, Vol. 26, No. 5, pp. 701-708, 2002. 12. Bathe, K.: Finite Element Procedures in Engineering Analysis. Prentice-Hall: Englewood Cliffs, NJ. 13. Zhang, Z.: Iterative point matching for registration of freeform curves and surfaces. International Journal of Computer Vision, Vol. 13, No.2, pp. 119-152, 1994.
Automatic Segmentation of Whole-Body Bone Scintigrams as a Preprocessing Step for Computer Assisted Diagnostics ˇ 1 , Matjaˇz Kukar1 , Igor Kononenko1, and Metka Milˇcinski2 Luka Sajn 1
2
University of Ljubljana, Faculty of Computer and Information Science, Trˇzaˇska 25, SI-1001 Ljubljana, Slovenia {luka.sajn, matjaz.kukar, igor.kononenko}@fri.uni-lj.si University Medical Centre in Ljubljana, Department for Nuclear Medicine, Zaloˇska 7, SI-1525 Ljubljana, Slovenia [email protected]
Abstract. Bone scintigraphy or whole-body bone scan is one of the most common diagnostic procedures in nuclear medicine used in the last 25 years. Pathological conditions, technically poor quality images and artifacts necessitate that algorithms use sufficient background knowledge of anatomy and spatial relations of bones in order to work satisfactorily. We present a robust knowledge based methodology for detecting reference points of the main skeletal regions that simultaneously processes anterior and posterior whole-body bone scintigrams. Expert knowledge is represented as a set of parameterized rules which are used to support standard image processing algorithms. Our study includes 467 consecutive, non-selected scintigrams, which is to our knowledge the largest number of images ever used in such studies. Automatic analysis of wholebody bone scans using our knowledge based segmentation algorithm gives more accurate and reliable results than previous studies. Obtained reference points are used for automatic segmentation of the skeleton, which is used for automatic (machine learning) or manual (expert physicians) diagnostics. Preliminary experiments show that an expert system based on machine learning closely mimics the results of expert physicians.
1
Introduction
Whole-body scan or bone scintigraphy is a well known clinical routine investigation and one of the most frequent diagnostic procedures in nuclear medicine. Indications for bone scintigraphy include benign and malignant diseases, infections, degenerative changes ... [2]). Bone scintigraphy has high sensitivity and the changes of the bone metabolism are seen earlier than changes in bone structure detected on radiograms [15]. The investigator’s role is to evaluate the image, which is of poor resolution due to the physical limitations of gamma camera. There are approximately 158 S. Miksch et al. (Eds.): AIME 2005, LNAI 3581, pp. 363–372, 2005. c Springer-Verlag Berlin Heidelberg 2005
364
ˇ L. Sajn et al.
bones visible on both anterior and posterior whole-body scans [10]. Poor quality and the number of bones to inspect makes it difficult and often tedious work. Some research on automating the process of counting bone lesions has been done, but only few studies attempted to automatically segment individual bones prior to the computerized evaluation of bone scans [6; 7; 1]. 1.1
Related Work
First attempts to automate scintigraphy in diagnostics for thyroid structure and function were made in 1973 [11]. Most of the research on automatic localization of bones has been done at the former Institute of medical information science at the University of Hildesheim in Germany from 1994 to 1996. The main contribution was made by the authors Berning [7] and Bernauer [6] who developed semantic representation of the skeleton and evaluation of the images. Benneke [1] realized their ideas in 1996. Yin and Chiu [13] tried to find lesions using a fuzzy system. Their preprocessing of scintigrams includes rough segmentation of six parts with fixed ratios of the whole skeleton. Those parts are rigid and not specific enough to localize a specific bone. Their approach for locating abnormalities in bone scintigraphy is limited to point-like lesions with high uptake. When dealing with lesion detection other authors like Noguchi [10] have been using merely intensity thresholding and manual lesion counting or manual bone ROI (region of interest) labeling. Those procedures are only sufficient for more obvious pathologies whereas new emerging pathological regions are overlooked.
2
Aim and Our Approach
The aim of our study was to develop a robust method for segmenting whole-body bone scans to allow further development of automatic algorithms for bone scan diagnostics of individual bones. We have developed the algorithm for detecting extreme edges of images (peaks). Here, respective skeletal regions are processed in the following order: shoulders, head, pelvis, thorax and extremities. The experience with automatic processing is presented. Several image processing algorithms are used such as binarization, skeletonization, Hough’s transform, Gaussian filtering [4], least square method and ellipse fitting in combination with background knowledge of anatomy and scintigraphy specialities. In everyday practice, when a bone is identified, it is diagnosed by the expert physician according to several possible pathologies (lesions, malignom, metastasis, degenerative changes, inflammation, other pathologies, no pathologies). This process can be supported by using some machine learning classifier [9] which produces independent diagnoses. As an input it is given a suitably parameterized bone image, obtained from detected reference points. As an output it assigns the bone to one of the above pathologies. It can therefore be used as a tool to give physician an additional insight in the problem.
Automatic Segmentation of Whole-Body Bone Scintigrams
3 3.1
365
Materials and Methods Patients and Images
Retrospective review of 467 consecutive, non-selected scintigraphic images from 461 different patients who visited University Medical Centre in Ljubljana from October 2003 to June 2004 was performed. Images were not preselected, so the study included standard distribution of patients coming to examination in 9 months. 19% of the images were diagnosed as normal, which means no pathology was detected on the image. 57% of the images were diagnosed with slight pathology, 20% with strong pathology and 2% were classified as super-scans. Images also contained some artifacts and non-osseous uptake such as urine contamination and medical accessories (i.e. urinary catheters) [5]. In addition, segmentation was complicated by the radiopharmaceutical site of injection. Partial scans (missing a part of the head or upper/lower extremities in the picture) were the case in 18% of the images. There were also adolescents with growth zones (5% of the images), manifested as increased osteoblastic activity in well delineated areas with very high tracer uptake. 3.2
Bone Scintigraphy
All patients were scanned with gamma camera model Siemens MultiSPECT with two heads with LEHR (Low Energy High resolution) collimators. Scan speed was 8cm per minute with no pixel zooming. 99m-Tc-DPD (TechneosR ) was used. Bone scintigraphy was obtained about 3h after intravenous injection of 750 MBq of radiopharmaceutical agent. The whole body field was used to record anterior and posterior views digitally with resolution of 1024 x 256 pixels. Images represent the counts of detected gamma rays in each spatial unit with 16-bit grayscale depth. 3.3
Detection of Reference Points
Bone scans are very different (Figure 3) one from another even though the structure and position of bones is more or less the same. In practice many scans are only partial because only a determined part of the body is observed or due to the scanning time limitations. In our study we have observed that only on two images out of 467 the shoulders were not visible. Many other characteristic parts could have been missing in images more often (i.e. head, arms, one or both legs). We have chosen shoulders as the main reference points to start with, which means they are supposed to be visible in the images. Second and the last assumption is the upward orientation of the image. This assumption is not limiting since all scintigraphies are made with same orientation. In order to make the detection of reference points faster and more reliable we have tried to automatically detect intuitive peaks which would represent edges and would cover roughly also the reference points. With normal Canny edge filter too many peaks were obtained. Our approach is based on orthogonal two-way Gaussian filtering [16].
366
ˇ L. Sajn et al.
Low image intensities (count level) acquired in typical studies are due to the limited level of radioactive dosage required to ensure patient’s safety. They make bone scans look distorted. Bone edges are more expressive after we filter images with some averaging algorithm (i.e. wavelet based, median filter, Gaussian filter) [4]. We have used Gaussian filter so that the detection of peaks was more reliable. Both images, anterior and posterior, are simultaneously processed in the same detection order and in each step the detected reference points visible on both images are compared and corrected adequately. Detected points from the anterior image are mirrored to the posterior and vice versa. Some bones are better visible on anterior and some on posterior images due to the varying distances from both collimators. This improves the calculation of circles, lines and ellipses with least square method (LSM). The order in which the reference points were detected was determined by using knowledge of human anatomy as well as physicians’ recommendations. They are represented as a list of parameterized rules. Rule parameters (e.g. thresholds, spatial and intensity ratios, ...) were initially set by physicians and further refined on a separate tuning set. (e.g. iliumStart(spineLocation, iliumROI ) :- shift iliumROI (width: shoulder width * 0.8, height: 10 pixels) starting at 60% of the estimated spine length downwards and stop when there are more than 3 peaks inside the iliumROI or the iliumROI ’s average itensity changes more than 80% regarding the previous iliumROI. iliumBone(iliumCircleParam) :- spine(spineLocation), iliumStart(spineLocation, iliumROI ), extend iliumROI ’s height to its width, narrow iliumROI with dynamic binarization, apply LSM on peaks inside iliumROI, run LSM iterations until all peaks are covered with the iliumCircleParam ring (in each run remove peaks lying outside the ring).)
More details can be found in [16].
Shoulders. They are the only part of the body that is assumed to be present in every image in the upper part on both sides. The algorithm just searches for the highest detected peak on both sides of the image. The next step is to locally shift the candidate points with local maximum intensity tracing to the outermost location. Only in 5 images out of 467 shoulders were not found correctly due to the tilted head position. Pelvic Region (Ilium Bone, Pubis Bone, Great Trochanter of Femur). The most identifiable bone in pelvic region is ilium bone which has higher uptake values than it’s neighboring soft tissue. Ilium bone has circular shape in the upper part and it is therefore convenient for circle detection with LSM method. This bone is well described with already detected peaks as shown in Figure 1(b). Ilium position is roughly estimated with regions of interest (ROIs) which are found on the basis of skeleton’s anticipated ratios and reference points found up to this step of detection. The pelvis is located at the end of the spine and has approximately the same width as shoulders. In order to find the pelvis, the calculation of the spine position is required. This is done with a beam search (Figure 1(a)). The anticipated spine length is determined from the distance between shoulders. Beam starting point is the middle point of the shoulders and it’s orientation is perpendicular to the shoulder line. The angle at which the beam covers most peaks, is a rough
Automatic Segmentation of Whole-Body Bone Scintigrams
(a) Beam search sketch
367
(b) Detection of bones in the pelvic region
Fig. 1. Beam search and detection in pelvic region
estimation of spine direction since there is most of the uptake in the vertebrae and hence peaks are dense in that region. Pubis bone is detected by estimating the pubis ROI using detected ilium location, distance between detected ilium circles and their inclination. The experimentally determined ROI’s size is narrowed and additional vertical peaks are added and circles detected as shown in Figure 1(b). Head and Neck. When at least image orientation and the location of the shoulders are known, some part of the neck or even head is visible since they are between the shoulders. Finding the head is not difficult but its orientation is, especially in cases where a part of the head in scan is not visible. The most reliable method for determining head orientation and position is ellipse fitting of the head contour determined by thresholding. Neck is found by local vertical shifting of a stripe determined by the ellipse’s semiminor axis (position and orientation). Thoracic Part (Vertebrae, Ribs). Vertebrae have more or less constant spatial relations, the only problem is that on a bone scintigraphy only a planar projection of the spine is visible. Since the spine is longitudinally curved, the spatial relations vary due to different longitudinal orientation of the patients. Average vertebrae relations have been experimentally determined from normal skeletons. Ribs are the most difficult skeleton region to detect since they are quite unexpressive on bone scans, their formation can vary considerably and their contours [14] can be disconnected in the case of stronger pathology (Figure 2).
368
ˇ L. Sajn et al.
A
B
C
D
E
Fig. 2. Rib detection steps example on a skeleton with strong pathology. Rib ROI is binarized (A), binarized image is skeletonized (B), Hough transform of linear equation is calculated on skeleton points (C), reference points are estimated using results of the Hough transform (D), rib contours are individually followed by the contour following algorithm (E)
Fig. 3. Examples of body scan variety
Lower and Upper Extremities (Femur, Knee, Tibia, Fibula, Humerus, Elbow, Radius, Ulna). They are often partly absent from whole-body scan because of limited gamma camera detector width. In our patients, a maximum of 61cm width is usually not enough for the entire skeleton. The regions of humerus, ulna and radius as well as femur, tibia and fibula bone are located with the use of controlled beam search. The beam lengths can be estimated from skeletal relationships (i.e. femur length is estimated as 78% of the distance between the neck and ilium bone center). The detection is designed so that a part or all of the extremities and/or the head may not be visible. 3.4
Diagnosing Pathologies with Machine Learning
When all reference points are obtained, every bone is assigned a portion of original scintigraphic image, according to relevant reference points. Obtained image is parameterized by using the ArTeX algorithm [8]. It uses association rules to describe
Automatic Segmentation of Whole-Body Bone Scintigrams
369
images in rotation-invariant manner. Rotation invariance is very important in our case since it accounts for different patients’ positions inside the camera. Bones were described with several hundreds of automatically generated attributes. They were used for training the SVM [12] learning algorithm. In our preliminary experiments pathologies were not discriminated, i.e. bones were labelled with only two possible diagnoses (no pathology, pathology). In 19% of patients no pathology or other artifacts were detected by expert physicians. In the remaining 81% of the patients at least one pathology or artifact was observed.
4 4.1
Results Segmentation
Approximately half of the available images were used for tuning rule parameters to optimize the recognition of the reference points and another half to test it. All 246 patients examined from October 2003 to March 2004 were used as the tuning set and 221 patients examined from April 2004 to June 2004 were used as the test set. In the tuning set there were various non-osseous uptakes in 38.9% of the images, 47.5% images with the visible injection point and 6.8% images of adolescents with the visible growth zones. Similar distribution was found in the test set (34.5% non-osseous uptakes, 41.0% visible injection points and 2.85% adolescents). Most of the artifacts were minor radioactivity points from urine contamination in genital region or other parts (81.4% of all artifacts) whereas only few other types were observed (urinary catheters 13%, artificial hips 4% and lead accessories 1.6%). We have observed that there were no ill-detected reference points in adolescents with the visible growth zones since all the bones are homogenous, have good visibility and are clearly divided with growth zones. Results of detecting reference points on the test set are shown in the Table 1. 4.2
Machine Learning Results
From our complete set of 467 patients, pathologies were thoroughly evaluated by physicians only for 268 patients. These 268 patients were used for evaluation of Table 1. False reference point detection on test set. Both frequencies and percentages are given Bone
no pathology 46 ilium 0 pubis 2 0.9% trochanter 0 shoulder 0 extremities 5 2.3% spine 0 ribs 11 5.0% neck 2 0.9%
slight pathology 133 2 0.9% 3 1.4% 1 0.5% 0 11 5.0% 2 0.9% 17 7.7% 4 1.8%
strong pathology 39 6 2.7% 2 0.9% 0 1 0.5% 0 1 0.5% 3 1.4% 0
super-scan
1 0 0 0 0 0 0 0
all
3 221 0.5% 9 4.1% 7 3.2% 1 0.5% 1 0.5% 16 7.2% 3 1.4% 31 14.0% 6 2.7%
370
ˇ L. Sajn et al. Table 2. Experimental results with machine learning on two-class problem Bone group
Classification spec.% sensit.% accuracy Cervical spine 75,9 80,0 77,8 Feet 83,8 84,1 68,0 Skull posterior 94,7 88,2 100,0 Ilium bone 87,3 87,6 82,8 Lumbal spine 71,4 75,7 65,4 Femur and tibia 88,9 84,6 73,3 Pelvic region 92,2 90,7 85,0 Ribs 98,1 92,5 91,7 Scapula 91,4 90,9 90,9 Thoracic spine 82,0 79,2 61,5 AVG 86,6 85,4 79,6
machine learning approach by using ten-fold cross validation. Results are shown in Table 2. The bones were grouped in ten relevant bone regions defined by the reference points. Classification accuracy was obtained for a two-class problem.
5
Discussion
The testing showed encouraging results since the detection of proposed reference points gave excellent results for all bone regions but the extremities, which was expected. We have paid special attention to the images with partial skeletons since it is often the case in clinical routine (in our study 18% of the images were partial and no particular problem appeared in detecting) and a robust segmentation algorithm should not fail on such images. The detection of ribs showed to be the most difficult, yet that was expected. Results show that in 14% to 20% of images there were difficulties in detecting the ribs. This usually means one rib is missed or not followed to the very end which we intend to improve in the future. In the present system such reference points can be manually repositioned by the expert physicians. Automatically detected reference points can be used for mapping a standard skeletal reference mask, which is to our belief the best way to find individual bones on scintigrams since bone regions are often not expressive enough to follow their contour. An example of such mask mapping is shown in Figure 4. While our experimental results with machine learning are quite good one must bear in mind that they were obtained for a simplified (two class) problem. Simply extending a problem to a multi-class paradigm is not acceptable in our case, as the bone may be assigned several different pathologies at the same time. A proper approach, the one we are currently working on, is to rephrase a problem to the multi-label learning problem, where each bone will be labelled with a nonempty subset of all possible labels [17; 3].
Automatic Segmentation of Whole-Body Bone Scintigrams
371
Fig. 4. Example of mapped standard skeletal mask with the detected reference points
6
Conclusion
The presented computer-aided system for bone scintigraphy is a step forward in automating routine medical procedures. Some standard image processing algorithms were tailored and used in combination to achieve the best reference point detection accuracy on scintigraphic images which have very low resolution. Poor quality, artifacts and pathologies necessitate that algorithms use as much background knowledge on anatomy and spatial relations of bones as possible in order to work satisfactorily. This combination gives quite good results and we expect that further studies on automatic scintigraphy diagnosing using reference points for image segmentation will give more accurate and reliable results than previous studies, negligent to the segmentation. This approach opens a new view on automatic scintigraphy evaluation, since in addition to detection of point-like high-uptake lesions there are also: – more accurate and reliable evaluation of bone symmetry when looking for skeletal abnormalities. Many abnormalities can be spotted only when the symmetry is observed (differences in length, girth, curvature etc.), – detection of lesions with low-uptake or lower activity due to metallic implants, – possibility of comparing uptake ratios among different bones, – more complex pathology detection with combining pathologies of more bones (i.e. arthritis in joints) – possibility of automatic reporting of bone pathologies in written language. Machine learning approach in this problem is in a very early stage, so its usefulness in practice cannot yet be objectively evaluated. However, preliminary results are encouraging and switching to the multilabel learning framework may make them even better.
Acknowledgement This work was supported by the Slovenian Ministry of Higher Education, Science and Technology through the research programme P2-0209. Special thanks to nuclear medicine specialist Jure Fettich at the University Medical Centre in Ljubljana for his help and support.
372
ˇ L. Sajn et al.
References [1] Benneke A. Konzeption und Realisierung Eines Semi-Automatischen Befundungssystems in Java und Anbindung an ein Formalisiertes Begriffssystem am Beispiel der Skelett-Szintigraphie. Diplom arbeit, Institut f¨ ur Medizinische Informatik, Universit¨ at Hildesheim, mentor Prof. Dr. D.P. Pretschner, 1997. [2] Hendler A. and Hershkop M. When to Use Bone Scintigraphy. It Can Reveal Things Other Studies Cannot. Postgraduate Medicine, 104(5):54–66, 11 1998. [3] McCallum A. Multi-Label Text Classification with a Mixture Model Trained by EM. In Proc. AAAI’99 Workshop on Text Learning, 1999. [4] Jammal G. and Bijaoui A. DeQuant: a Flexible Multiresolution Restoration Framework. Signal Processing, 84(7):1049–1069, 7 2004. [5] Weiner M. G., Jenicke L., Mller V., and Bohuslavizki H. K. Artifacts and NonOsseous Uptake in Bone Scintigraphy. Imaging Reports of 20 Cases. Radiol Oncol, 35(3):185–91, 2001. [6] Bernauer J. Zur Semantischen Rekonstruktion Medizinischer Begriffssysteme. Habilitationsschrift, Institut f¨ ur Medizinische Informatik, Univ. Hildesheim, 1995. [7] Berning K.-C. Zur Automatischen Befundung und Interpretation von Ganzk¨ orperSkelettszintigrammen. PhD thesis, Institut f¨ ur Medizinische Informatik, Universit¨ at Hildesheim, 1996. [8] Bevk M. and Kononenko I. Towards Symbolic Mining of Images with Association Rules: Preliminary Results on Textures. In Brito P. and Noirhomme-Fraiture M., editors, ECML/PKDD 2004: proc. of the workshop W2 on symbolic and spatial data analysis: mining complex data structures, pages 43–53, 2004. [9] Kukar M., Kononenko I., Groˇselj C., Kralj K., and Fettich J. Analysing and Improving the Diagnosis of Ischaemic Heart Disease with Machine Learning. Artificial Intelligence in Medicine, 16:25–50, 1999. [10] Noguchi M., Kikuchi H., Ishibashi M., and Noda S. Percentage of the Positive Area of Bone Metastasis is an Independent Predictor of Disease Death in Advanced Prostate Cancer. British Journal of Cancer, (88):195–201, 2003. [11] Maisey M.N., Natarajan T.K., Hurley P.J., and Wagner H.N. Jr. Validation of a Rapid Computerized Method of Measuring 99mTc Pertechnetate Uptake for Routine Assessment of Thyroid Structure and Function. J Clin Endocrinol Metab, 36:317–322, 1973. [12] Cristianini N. and Shawe-Taylor J. An Introduction to Support Vector Machines and Other Kernel-Based Learning Methods. Cambridge University Press, 2000. [13] Yin T.K. and Chiu N.T. A Computer-Aided Diagnosis for Locating Abnormalities in Bone Scintigraphy by a Fuzzy System With a Three-Step Minimization Approach. IEEE Transactions on Medical Imaging, 23(5):639–654, 5 2004. [14] Kindratenko V. Development and Application of Image Analysis Techniques for Identification and Classification of Microscopic Particles. PhD thesis, Universitaire Instelling Antwerpen, Departement Scheikunde, 1997. [15] M¨ uller V., Steinhagen J., de Wit M., and Bohuslavizki H. K. Bone Scintigraphy in Clinical Routine. Radiol Oncol, 35(1):21–30, 2001. ˇ [16] Sajn L., Kononenko I., Fettich J., and Milˇcinski M. Automatic Segmentation of Whole-Body Bone Scintigrams. Technical report, Faculty of Computer and Information Science, University of Ljubljana, Nov. 2004. URL http://lkm.fri.unilj.si/papers/Skelet.pdf. [17] Shen X., Boutell M., Luo J., and Brown C. Multi-Label Machine Learning and its Application to Semantic Scene Classification. In Proceedings of the 2004 International Symposium on Electronic Imaging (EI 2004), San Jose, California, 2004.
Multi-agent Patient Representation in Primary Care Chris Reed1,2, Brian Boswell2, and Ron Neville2,3 1
Division of Applied Computing University of Dundee, Dundee DD1 4HN, UK 2 Calico Jack Ltd, Argyll House, Dundee DD1 1QP, UK 3 Westgate Health Centre, Charleston Dr., Dundee DD2 4AD {chris, brian, ron}@calicojack.co.uk
Abstract. Though multi-agent systems have been explored in a wide variety of medical settings, their role at the primary care level has been relatively little investigated. In this paper, we present a system that is currently being piloted for future rollout in Scotland that employs an industrial strength multi-agent platform to tackle both technical and sociological challenges within primary care. In particular, the work is motivated by several specific issues: (i) the need to widen mechanisms for access to primary care; (ii) the need to harness technical solutions to reduce load not only for general practitioners, but also for practice nurses and administrators; (iii) the need to design and deploy technical solutions in such a way that they fit in to existing professional activity, rather than demanding changes in current practice. With direct representation of individuals in health care relationships implemented in a multi-agent system (with one multi-functional agents representing each patient, doctor, nurse, pharmacist, etc.) it becomes straightforward first to model and then to integrate with existing practice. It is for this reason that the system described here successfully widens access for patients (by opening up novel communication channels of email and SMS texting) and reduces load on the practice (by streamlining communications and semi-automating appointment arrangement). It does this by ensuring that the solution is not imposed on, but rather, integrated with what currently goes on in primary care. Furthermore, with agents responsible for maintaining audit trails for the patients they represent, it becomes possible to see elements of the electronic patient record (EPR) emerging under agent control. This EPR can be extended through structured interaction with the practice system (here, we examine the GPASS system, the market leader in Scotland), to allow rich agent-agent and agent-human interactions. By using multi-agent design and implementation techniques, we have been able to build a solution that integrates both with individuals and extant software to successfully tackle real problems in primary care.
1 Introduction Increasingly, research in multi-agent systems is exploring the advantages offered by the emerging multi-agent software engineering paradigm (see, for example [13]). By focusing on the relationship between an entity in the real world (an individual, a relationship, an organisation, a datum) and its corresponding entity in the “electronic world” (an agent, a relationship between agents, a set of agents, a datum), design can S. Miksch et al. (Eds.): AIME 2005, LNAI 3581, pp. 375 – 384, 2005. © Springer-Verlag Berlin Heidelberg 2005
376
C. Reed, B. Boswell, and R. Neville
become easier and quicker [4]. The process of teasing out complex and intricate stakeholder relationships in real world domains is not made harder by the restrictions and assumptions of the implementation paradigm. As a result, interesting multi-agent based models of complex social structures have started to emerge [22]. The health care system is a perfect example. It offers extremely rich interdependent sets of relationships, stakeholders, rights, requirements and agendas that, though interesting for the modeller, have proved an enormous challenge for the deployment and uptake of practical IT systems [15]. It is an environment subject to constant change of workforce, ‘clients’, and infrastructure, often with a zero tolerance of error. The challenge for multi-agent systems presented by the complexities of health care has been well documented (see e.g. [10, 12, 16]) and has provided a rich field for research, but primary care - and its own unique set of issues - is often omitted from these investigations. The work described here harnesses agentive representation and techniques from agent oriented programming in tackling some of these features of primary care.
2 Background Patients, politicians and health care professionals all agree that Patient Centred Care is a good thing, but translation of concept to reality has yet to be achieved [23]. The fields of health and computing share the common jargon of ‘user friendly’, accessible, and flexibility and so it is unsurprising that agents have been touted for patientcentered health care. One of the earliest examples of work examining the role of multi-agent systems in health care is offered by the AADCare architecture [10]. The focus of the work presented there, and of the broader context in which it was conducted (viz., the DILEMMA project), is upon appropriate theorem proving in decision support systems that have to deal with complex, incomplete, inconsistent and potentially conflicting data. The agent component is designed to support the distribution of clinical 'tasks' amongst players in the system, in a manner similar to the much earlier Contract Net protocol for automated task distribution [24]. A prototype of the AADCare system as a whole was implemented for the management of cancer patients in the UK NHS system, though the extent of deployment and its subsequent success is unreported. Crucially, access for the patient to their medical record, and the unification of record components across different health services for an individual patient was not a focus for AADCare in either implementation or theory, and therefore mostly side-steps issues of patient-centered health care. The Guardian Angel project [8] represents a “manifesto” developed since 1994 that tackles the patient-centered approach head on. Some elements of the manifesto have lead to implementation, of which the earliest was the Personal Internetworked Notary and Guardian project, PING [21]. Although that work mentions “agents” in passing, its focus is upon implementing basic security mechanisms for (conceptually) centralised data stored using XML. It makes no use of the agent oriented, peer-to-peer approach in either design or implementation. The motivating concepts, described in [14] however, are precisely those addressed in the current work: the need to balance patient access and security; the need to reduce fragmentation in medical records; the need for IT infrastructure to be interoperable; etc.
Multi-agent Patient Representation in Primary Care
377
In the context of UK health care, Pouloudi and Reed [20] offer a relatively early example of using multi-agent systems to represent and model interactions in the NHS in an attempt to build a realistic foundation for integrative systems. The work combines intra-agent representational concepts with inter-agent communication and relationship structures in modelling the interactions between stakeholder relationships in patient data. The model there was theoretical and unimplemented. More recently, the Advanced Computational Lab at Cancer Research UK has built multi-agent models of the same sorts of complex relationships specifically in the context of cancer, from initial patient contact with their GP through various stages of care and maintenance [2; 7]. They employ the mature COGENT and PROforma tools and the tried-and-tested Domino model of agent architecture, but still the focus is squarely upon the interacting agencies of the health system, for which the patient is simply a customer. Moreno et al. [16] describes a system that moves closer to the ideal of patient centered involvement and access. In their HeCaSe system, patients have an interface that supports appointment booking and various static configuration parameters. It is, however, focused only on the interaction between patients and initial, primary care consultations, and is run from PC clients. Crucially from a deployment point of view, it requires doctors to switch to a new system. Finally, it is worth noting that perhaps the largest impact of multi-agent systems in health care to date has been in specifically targeted applications that focus on particular functions of the health system. So, for example, there are prototypes and demonstrators of multi-agent system applications in areas such as organ transplant [6; 17], antibiotic prescription [9], pharmacy in general [3], protocol monitoring [1], proactive information provision in anaesthesia [11], data flow in Leukemia management [12] and others (such as those in the special issue, volume 27 issue 3, of AI in Medicine). These examples are clinician centred attempts to streamline existing processes. Thus multi-agent systems as a tool is having an impact in many areas. But this is peripheral to the argument that we hope to make here, namely, that multi-agent systems as a paradigm fits the goal of patient-centered health care perfectly not only at a conceptual level, but also in implementation and deployment.
3 Multi-agent Systems for Patient-Centered Health Care The concept of agentiverepresentation is implicit in very many agent-based models of real world structures, and even entire agent based methodologies such as Gaia [25]. The idea is simply that one component in the real world is represented by a single corresponding agent in the system. This idea is now also starting to gain traction in the commercial world [4]. In the medical domain, agentive representation means agents representing general practitioners, consultants, pharmacists, and, of course, patients. To build systems that are to be deployed in real health care situations, it is important that the infrastructure meets and exceeds a range of basic expectations of the users with respect to various aspects such as security, scaling and reliability. In the work described here, we have selected the JUDE platform [5] for reasons of flexibility and robustness. The architecture of agents in JUDE is simple in that each agent is equipped with a set of generic functionality (such as basic reasoning and communication) that can then be augmented with additional modular functionality as needed.
378
C. Reed, B. Boswell, and R. Neville
The system implements the patient-centered approach by equipping agents representing patients with all the functionality they require to represent their corresponding patient in the electronic health care world. So for example, the patient agent can access data on that patient held in different locations and by different parties. The patient agent can communicate with the local surgery to organise appointments. The patient agent can access information on pharmacy location and availability. And, of course, the patient agent can communicate with, and be contacted by, the patient themselves. This communication can make use of whatever channels may happen to be available at a given moment – from web to SMS. But every time, and in every case, the patient is simply communicating with their own, persistent, agent. Detailing the implementation in full is beyond the scope of this paper, but it is useful to offer depth in a subset of the functionality and go on to show how this functionality is deployed and fitted in to existing primary care processes. 3.1 An Example: Reducing DNA Rates In this currently live trial, patient agents are equipped with the ability to interact with agents representing individuals in a GP surgery, including the receptionist and GP. One agent-mediated interaction is the process of booking and confirming appointments. The appointments process is a current area of interest in practice management, since a substantial proportion of valuable GP time is wasted as a result of people who book appointments but then subsequently do not attend (DNA). Reducing the DNA rate offers practical and substantial advantages to GPs and practices in the UK. The current model is to allow patients to ring the practice receptionist and negotiate verbally to arrange an appointment time convenient for both GP and patient. In some cases, as described above, some or all of this process may be conducted by email instead of over the telephone. An agent-based solution offers a technical improvement whilst integrating with existing practice to minimise barriers to use. A patient's agent is responsible for intervening in some or all of the communication between the patient and the practice, and is responsible for reminding the patient of upcoming appointments. A patient is allocated an agent in the system following consent agreement. At that point, the patient can send a text message or an email to their agent, which, in either case, then communicates with the agent representing the practice receptionist. The receptionist's agent then communicates with the receptionist through the most appropriate means – at the moment, that is email. The negotiation is conducted in this way between patient and receptionist until agreement on appointment time is met. (It is unreasonable to expect patients to be using an electronic diary, and thus it is not possible to automate the patient end of the negotiation process. Similarly, it is important to keep the receptionist in the loop, and so automating that end is also counterproductive.) When agreement is reached, the receptionist confirms the appointment through a web interface provided by the receptionist's agent. That agent then informs the patient's agent of the confirmed appointment time. At 24 hours before the appointment, the patient's agent sends a reminder (currently by SMS). At two hours before the appointment is due, the patient's agent sends a second reminder. If the patient does not reply to that second reminder, thereby failing to confirm that they still intend to keep the appointment, the patient's
Multi-agent Patient Representation in Primary Care
379
agent will inform the receptionist's agent of a problem. In this case, the receptionist's agent will take some default action, which is currently to email the receptionist suggesting that the appointment be cancelled and the time freed up. Figure 1, above, summarises an example interaction, demonstrating the various communication mechanisms (text, email and web, indicated by the icons on the
Fig. 1. Sample Interaction
380
C. Reed, B. Boswell, and R. Neville
interaction arrows), the simple proactive work of the patient's agent, and the involvement of the receptionist and patient in the system. There are many practical advantages offered by introducing agents into the loop, both in the current system and for its future development. First, robustness is improved by having state information maintained explicitly in the agents – so, for example, if an SMS message fails to be delivered, an agent can retry, use a different communication medium (such as email) or at least alert appropriately (such as warning the receptionist). Second, message routing in general becomes transparent – the receptionist need not be concerned with a patient's decision to have communications delivered through one medium (SMS) rather than another (email). Third, as explored in the next section, agents represent a natural and simple means of offering audit and tracing facilities. Fourth, agents in the framework are extensible on the fly; as discussed below, developing and deploying new functionality is straightforward. Fifth, for some services, patient anonymity is vital; agents provide a natural decoupling that can preserve such anonymity. Finally, looking further to the future, MAS provides a natural mechanism for representing real world relationships that can be exploited in routing messages.
4 Realities of Using a Multi-agent System in Primary Care The published work on MAS applications in health care understandably concentrates on the computing methodology and on the need to design software around complex health care problems. It will become increasingly important to involve the people using the system – clinicians and patients - in future developments. Research into any new technology, including MAS, must look at subtle patient centred issues. Innovations need to be based around what people actually want to use. Multi-agent systems have the potential to support an ongoing dialogue between doctor and patient. Research must therefore include an ongoing dialogue to plan and implement findings. The “Do Not Attend” component of the system described in the previous section is adopted as an exemplar by which to follow through this process. By drawing on previous experience of integrating new technology into the patient-facing side of primary care [18; 19], the design of pilot trials and ethical approval for those trials has been achievable. The current deployment is based around ‘as needed’ creation of agents for up to 100 patients who volunteer to take part in the trial. Text messaging based appointment booking is offered as an extra service. For the receptionist, all interaction with the system, and with patients using the system is through a simple web interface and email, both of which are familiar and non-threatening. The interface between the system and patients is text messaging, with which all who sign up are very familiar. Finally, the interface between the system and additional components (such as GPs) is built around email, again fitting in with existing practices. As a result, introduction of the service has been straightforward and has hit no major problems. Clearly, the system, its deployment and its uptake are in very early days, but the current status marks an important milestone. The key step that has been taken is to provide each individual patient with an agent that interacts, on the patient's behalf, with agents representing health care professionals. With this scenario engineered and
Multi-agent Patient Representation in Primary Care
381
deployed, and with the dynamic upgrade facilities in JUDE (whereby new functionality can be deployed to existing agents without needing to “reboot” either the systems or those agents that are upgraded), it is relatively easy to map out a programme of rolling development and deployment of patient-centered services. With appointment booking and reminders in place, the next step is to integrate another burdensome practice task: repeat prescription ordering. Ordering repeat prescriptions is, in the UK and other health systems, a task initiated by a patient who has a long-term need for prescription drugs. It requires a GP to sign off, which takes both GP time, and a physical appointment, involving both travel and time for the patient, as well as practice overhead in the form of receptionist time for booking. With agents representing all the parties, it becomes relatively simple for the request to be routed from the patient to the GP, confirmed (or otherwise) by the GP, and then forwarded to the appropriate pharmacist, from where the patient can collect their prescription. Of course this is far from the first time that a proposal has been tabled for streamlining this process, nor is it the only IT-based solution. The novelty lies in that it is exactly the same system as is currently used for appointment booking, made available for repeat prescription ordering by virtue of the direct representation of the people involved and the relationships between them. One of the additional benefits that arises “for free” with the approach is an audit trail at the individual patient level, so that health care professionals and patients can track where a request is in the system so cutting out unnecessary phone calls to the practice or pharmacist. Also, with the advent of electronic prescribing, our agent system represents a solution anchored to existing practice systems.
5 An EPR Under Agent Control? In Scotland, well over 80% of primary care practices have adopted the GPASS system to manage patient data and other practice functions1. With a wide variety of proposals for electronic patient record (EPR) management currently under discussion and development, integration with primary care systems is a key concern. GPASS, like many of its competitors, does provide programmatic access to the data it stores, via mediated SQL queries. Such a clean interface makes it a good starting point for investigating the alignment of the notion of an EPR with that of one agent for every patient. The EPR can be analysed to yield a key subset of patient medical information which has general utility for both patients and health-care professionals. This subset contains information regarding a patient's significant clinical events, current prescribed medication (both acute, and repeat), key indicators such as tobacco and alcohol history, and their age, height, weight and body mass index. Together with primary personal information – name, address, date-of-birth, and CHI (unique health service number) – this constitutes what we call the Core Clinical Summary (CCS). The information which makes up the CCS is stored in the MS SQL database which acts as the data store for the GPASS clinical system. In order to access the information required to construct the CCS, a JUDE module was developed for use by the appropriate agents to log into and make the relevant queries to GPASS, from 1
See http://www.gpass.co.uk
382
C. Reed, B. Boswell, and R. Neville
which those agents might then construct the CCS accordingly. Figure 2 summarises the mechanisms by which this information is extracted in response to a patient initiated request.
Fig. 2. Sample Patient-Agent-GP-GPASS interaction
In this example, at (1), the patient requests some part of CCS data via the web, email or their mobile. Then, at (2), the patient's agent requests CCS from the GP's agent, and by the implemented semantics of the inter-agent communication language, the patient's agent (2a) and GP's agent (2b) update their respective belief sets to reflect that the request was made. This update is defined by the semantics of the J-ACL communication language implemented in JUDE [5]. At (3) the GP's agent logs in to GPASS clinical system and then makes set of requests to GPASS for patient information. In turn, GPASS calls into the SQL database (4) which returns the appropriate patient data (5). At (6) the GP's agent then uses that data to construct the CCS which it then embeds in the appropriate nested wrappers of a well formed HL7 message for subsequent transmission (7). At (8), the GP's agent informs the patient's agent with the CCS, again leading to updates in the belief sets of the GP's agent (8a) and patient's agent (8b). Finally, at (9) the patient's agent extracts CCS from HL7 wrappers and at (10) transmits CCS data to the appropriate device (where what is appropriate is determined through rich contextual reasoning). This same architecture also fits snugly into a more radical, long-term and ambitious picture in which patient agents have ultimate responsibility for maintaining an up-to-date picture of EPRs. The ability of a patient to have to hand their own medical information via entirely ordinary devices, promotes the position of patient as genuine stakeholder in their own health care. The information available to patients via the CCS allows them to accurately inform their lifestyle using straightforward mechanisms, and on their terms. So, for example, a patient can easily match their recollection of past clinical events with the corresponding sections of their CCS. Further, in situations where clinical systems are not available (e.g., an accident in a remote location) the patient themselves has the means to provide key information (such as allergies, or current medications) to assist the health care professional. Putting the patient in a position where they can easily interact proactively with their representation in the health care system also supports more interactive relationships between health care providers and patients. For GP's (and other health care professionals), the MAS based primary care solution provides all parties from all
Multi-agent Patient Representation in Primary Care
383
aspects of health care with a single point of contact to the patient, viz., the patients agent. Whilst providing GPS with mobile access to CCS records may be convenient (immediately prior to house calls, for example), in positions where urgent access to medical information is vital, such as an emergency situation with an incapacitated patient outwith the health care infrastructure, the ability for the patient's agent to be accessed for information in the CCS (such as current medication regimes or illnesses) offers the potential for significant advantages.
6 Conclusions and Directions The focus of this work is on the design, implementation, deployment and evaluation of initially very simple solutions to very real problems that despite their simplicity nevertheless use a full, mature, industrial-strength multi-agent system that can not only tackle the simple problems as they scale numerically, but also incrementally take on an ever more significant role, tackling ever more complex problems that demand ever more of the underlying technology. Multi-agent systems have great potential to transform the process, and possibly even the outcome, of medical care. It is important that innovation goes hand in hand with evaluation. Our own next steps are to evaluate the outcomes of the pilot trial in designing larger scale trials which are based upon enriched functionality within individual agents. It is a logical extension to offer a MAS-based text message/email/web service including appointment booking, repeat prescription ordering and provision of clinical advice. Qualitative work will explore the views, aspirations and experiences of people who chose to use this service, with the first component of this work – health care staff interviewing – already almost complete. Quantitative work will measure when and how often people use the service, and the impact that the service has on the functioning of the practice. It will be important to look at the language and exchange of information in any patient – doctor text dialogue. Preliminary technical work and evaluation will then allow a larger cluster randomised trial to proceed in several practices involving hundreds of patients. Multi-agent systems are likely to transform health care within the next decade. At a basic level the very nature of a consultation between a patient and a health care professional will need to be re-defined. At a more sophisticated level, patients will have the opportunity to integrate health-related activity and decision making into everyday behaviour. Professionals will be able to support and advise patients in a completely new and different way. The effect of these changes on health outcome needs to be addressed. The reactions of people using and working within the health service need to be explored. The major barrier to the implementation of multi-agent systems in health care may not be technical, but attitudinal, and it is this aspect that needs to be brought in to every stage of the development and deployment lifecycle.
References 1. Alsinet, T., Ansotegui, C., Bejar, R., Fernandez, C., Manya, F.: Automated monitoring of medical protocols, Art. Int. in Med. 27(3) (2003) 367-392 2. Black, E.: Using Agent Technology to Model Complex Organisations AgentLink 12 (2003)
384
C. Reed, B. Boswell, and R. Neville
3. Calabretto, J.P., Couper, D., Mulley, B., Nissen, M., Siow, S., Tuck, J., Warren, J.: Agent Support for Patients and Community Pharmacists, Proc. of the 35th HICSS, IEEE (2002) 4. Calico Jack: Multi-Agent Systems in Mobile Telecommunications Networks, Calico Jack White Paper #2004/01 (2004) 5. Calico Jack: “An Introduction to the Jackdaw University Development Environment”, Calico Jack Developer Documentation (2004) 6. Calisti, M., Funk, P., Biellman, P., Bugnon, T.: Multi-Agent System for Organ Transplant Management, in (Moreno & Nealon, 2003) 199-212 7. Fox, J., Beveridge, M., Glasspool, D.: Understanding intelligent agents: analysis and synthesis, AI Communications 16 (3) (2003) 139-152 8. GA, http://www.ga.org Guardian Angel website, last updated 15 March 2002. 9. Godo, L., Puyol-Gruart, J., Sabater, J., Torra, V., Barrufet, P., Fabregas, X.: A multi-agent systems approach for monitoring the prescription of restricted use antibiotics, Art. Int. in Med. 27(3) (2003) 259-282. 10. Huang, J., Jennings, N.R., Fox, J.: An Agent-based Approach to Health Care Management. Applied Artificial Intelligence 9 (4) (1995) 401-420 11. Knublauch, H., Rose, T., Sedlmayr, M.: Towards a Multi-Agent System for Pro-Active Information Management in Anaesthesia. In Proc. of the Workshop on Autonomous Agents in Health Care at Agents 2000, Barcelona, Spain 2000. 12. Lanzola, G. Gatti, L. Falasconi, S., Stefanelli, M.: A Framework for Building Co-operative Software Agents in Medical Applications. Art. Int. in Med., 16(3) (1999) 223-249 13. Luck, M., Ashri, R. & D'Inverno, M.: Agent-Based Software Development, Artech 2004. 14. Mandl, K.D., Szolovits, P., Kohane, I.S.: Public Standards and Patients' Control: How to Keep Electronic Medical Records Accessible But Private. BMJ 322 (2001) 283-287. 15. Mark, A., Pencheon, D., Elliott, R.: Demanding Healthcare. Intl. J. of Health Planning & Mgmt. 15 (2000) 237-253 16. Moreno, A., Isern, D., Sanchez, D.: Provision of Agent-Based Healthcare Services, AI Comms. 16 (3) (2003) 167-178 17. Moreno, A., Valls, A., Bocio, J.: Management of Hospital Teams for Organ Transplant Coordination, Proc. of AIME 2001 (2001) 18. Neville R.G., Greene A.C., McLeod J., Tracy A., Surie J.: Mobile phone text messaging can help young people manage asthma. BMJ 2002 325 600 19. Neville R.G., McCowan C., Pagliari H.C., Mullen H., Fannin A.: E-mail Consultations in General Practice (in press) 20. Pouloudi A., Reed, C.A.: Towards a Multi-agent representation of stakeholder interests in NHSnet, Proc. PAAM98 (1998) 393-407 21. Riva, A., Mandl, K.D., Oh, D.H., Nigrin, D.J., Butte, A., Szolovits, P., Kohane, I.S.: The Personal Internetworked Notary and Guardian. Intl. J. of Med. Informatics 62 (2001) 2740 22. Sabater, J., Sierra, C.: Reputation and Social Network Analysis in Multi-Agent Systems, Proc. AAMAS 2002, ACM Press. 23. White Paper – Patients as Partners in Care: A Plan for Action a Plan for Change HMSO (2003) The Scottish Executive 24. Smith, R.G.: The Contract Net Protocol: High Level Communication and Control in a Distributed Problem Solver. IEEE Trans. Computers 29 (12) (1980) 1104-1113 25. Wooldridge, M., Jennings, N.R., Kinny, D.: The Gaia Methodology for Agent-Oriented Analysis and Design. J. of Auton. Agents and Multi-Agent Systems 3 (3) (2000) 285-312
Clinical Reasoning Learning with Simulated Patients Froduald Kabanza1 and Guy Bisson2 1
Department of Computer Science, University of Sherbrooke, Sherbrooke, Québec J1K2R1, Canada [email protected] 2 Faculty of Medicine, University of Sherbrooke, Sherbrooke, Québec J1K2R1, Canada [email protected]
Abstract. In this paper we introduce clinical reasoning automata to model states and transitions about different cognitive processes that occur during a clinical reasoning activity. A state of the automaton represents a particular process in a complex patient diagnosis using influence diagrams encoding clinical knowledge about the case. Transitions model switch between diagnosis cognitive processes, such as collecting evidences, formulating hypothesis or explicitly asking for assistance at a given point during the reasoning process. That way, we can efficiently model tutoring feedback hints for clinical reasoning learning that are based not only on the clinical knowledge, but also on the sequencing of the tutoring processes.
1 Introduction Medical diagnostic problems that are the focus of clinical reasoning are examples of complex decision making problems for which solutions depend not only upon clinical knowledge and medical experience, but also on well drilled strategies for collecting evidences about pathologies, analyzing them, formulating hypotheses and evaluating them based on collected evidences. Clinical reasoning learning (CRL) in medicine deals with teaching medical students to acquire the basics of such cognitive strategies and to apply them [6]. In many medical schools, CRL is taught using a problem-based learning approach (PBL). Students, in small groups (typically six to eight students) are presented a patient case and asked to diagnose it and propose a treatment or a management plan. Under the supervision of an instructor, students will usually formulate initial hypotheses just after getting a brief presentation of the case, under the form of a textual problem statement, a live encounter with a patient simulated by a student playing a patient scenario prepared by the instructor, or a live encounter with a real patient recruited from a clinic or hospital. Then they will begin to collect evidences by asking questions to the patient (e.g., the motive of the consultation, in which parts of the body he is experiencing problems, how long and frequently the symptoms occur, his life style, and so on), then they will proceed with a directed physical exam (e.g., sensing reflexes) and after that order laboratory/imaging tests (e.g., urine tests, blood tests, S. Miksch et al. (Eds.): AIME 2005, LNAI 3581, pp. 385 – 394, 2005. © Springer-Verlag Berlin Heidelberg 2005
386
F. Kabanza and G. Bisson
X-rays or MRI). The evidences are analyzed and then the hypotheses are reevaluated to discard those not fitting with the analysis, to keep those still uncertain, and to collect again new evidences in order to discard or confirm remaining hypotheses or generate new ones. This process of evidence collection, evidence analysis, hypothesis evaluation and hypothesis reformulation is iterated until the students are able to narrow their list of probable hypotheses to the one or two most probable diagnosis. Students conduct their investigation based upon their declarative knowledge they have learned in PBL sessions and also on similar cases they have already seen in clinical settings or in previous CRL sessions. The clinical knowledge linking symptoms to diseases is at least in part a probabilistic exercise whereas the actions used to collect evidences (i.e., questions, physical exams and lab tests) involve tradeoffs about costs, harm to the patient, and the value of the information obtained from the tests with regard to the current hypotheses. The instructor uses his own experience and knowledge to provide hints to the students during their investigation and sometimes to comment on variant of similar cases. Because of the cognitive complexity of CRL, it is crucial that students have a varied, very well organized and structured knowledge base, and be exposed to as many cases as possible in their curriculum, since this is mostly an experience-based learning process. To foster the acquisition of their clinical reasoning skills, it is mandatory that the CRL involves small group sizes. However, this requires that a lot of instructors be available, yet for the most part these instructors are also physicians and/or researchers, hence with very limited availability, particularly given today severe shortage of physicians in most countries. Many medical schools are therefore turning towards the use of software simulations to support at least in part the CRL activities. One possible strategy is to keep an affordable number of CRL courses directly supervised by instructors and to have students practice on computer-based cases that are variants of those learned in classes, in groups or individually, with the assumption that computer-based cases will require little supervision by instructors or no supervision at all if the computer has enough intelligence capabilities to automatically provide the required feedback to students. Providing feedback to students in computer-based CRL requires a system that not only is able to assess the student’s strengths and weaknesses of his clinical reasoning process to solve a patient case, but is also able to apply metacognitive strategies to identify and correct flaws in his reasoning strategies, for instance, with regard to how he interleaves evidence collection and the hypothesis formulation processes. However, computer-based CRL systems proposed to date focus mostly on providing feedback with regard to the application of declarative clinical knowledge, with very little emphasis on feedback related to cognitive strategies about the sequencing of clinical reasoning sub-processes [3,4,9,10]. DxR is one of the most used commercial CRL software with a database of about a hundred clinical cases, covering both patient diagnosis and management [3]. However, DxR has no built-in student’s model other than a simple decision tree checking whether the student has reached the correct diagnosis within a given deadline and has considered all relevant hypotheses. The decision tree is explicitly specified by the instructor who authors the case. Consequently, students using DxR autonomously learn mostly by reinforcement as the decision tree indicates whether they have the
Clinical Reasoning Learning with Simulated Patients
387
right hypotheses or not, but with little ongoing feedback about their reasoning processes. Other commercial CRL systems such as VIPS have the same limitations [10]. The Adele [4] and COMET [9] systems provide more advanced automated tutoring feedback by using an influence diagram (i.e., a Bayesian network with actions and action utilities) to model the clinical knowledge about the patient. In fact the influence diagram represents expert knowledge about how symptoms probabilistically relate to diseases and how evidence-collect actions or clinical-knowledge-apply actions relate to evidences. It is essentially the same kind of Bayesian networks used in medical decision support systems to recommend diagnosis, to collect new evidences, or treatment to clinicians in real-life cases [7,8]. The generation of tutoring feedbacks is done by comparing the recommendations obtained from the network, to actions, evidences or hypotheses found by the students, and by providing hints accordingly. More specifically, at any step of the interaction between the student and the CRL interface that simulates a patient, we can recognize the evidence collected by the student and we can also determine the hypothesis he is currently working on, either by asking them to the student directly, or by inferring them. Thus, given the current evidences as collected by the student we can use Bayesian inference algorithms to determine the most likely hypotheses from the expert point of view and compare them with the hypotheses currently inferred by the student. Based on the comparison, we can for example give hints to the student about whether he is on the right track or not, or give hints that will put him on the right track. Conversely, given the current hypotheses collected by the student, we could use again Bayesian inference algorithms to determine the most valuable evidences to collect next, and exploit this information to give hints about which evidences to collect next and how. However, this approach does not model CRL strategies about how to interleave and refine the different phases of evidence collection, evidence analysis, hypothesis evaluation and hypothesis reformulation. An instructor supervising CRL will provide hints not just aimed at instructing the student how to apply clinical knowledge, but also hints aimed at sharing with him his “rules of thumb” about how to interleave the different phases of evidence collection, evidence analysis, hypothesis evaluation and hypothesis reformulation. For instance, he may remind the student of the appropriate time to stop the current phase of evidence collection in order to re-evaluate his hypotheses. He may also explain why at a given step it’s better to collect evidences that rule out one current hypothesis rather than evidences that confirm a current active hypothesis, for instance as a measure to minimize the cost of evidence collection and not jeopardize the patient health. He may also intervene at certain points to provide structured knowledge about variant cases related to the one under study. For instance after a student has correctly interpreted evidences gathered from a lab test based on the current patient’s data, the instructor may discuss patient cases for which the same test results would have yield a different interpretation. Such strategic interventions are not easily conveyed by Bayesian networks; in fact none of the above systems addresses them. We introduce clinical reasoning automata (CRA) in our model on top of influence diagrams to fill this gap. These are hybrid finite state automata representing allowed
388
F. Kabanza and G. Bisson
sequences and refinements of the different phases of evidence collection, evidence analysis, hypothesis evaluation and hypothesis reformulation. The automaton has a finite number of states, but each state represents a clinical reasoning sub-process, that is, the system continues to evolve within a state, hence the qualification “hybrid”. A process modeled by a state has access to a global influence diagram encoding clinical knowledge as in Adele [4] and COMET [9], and local state-based production rules to provide feedback. This provides a means for modeling sequenced feedback rules, and also leads to a modular specification of tutoring feedbacks. In the next section we present the system that we are currently developing to simulate patients and foster clinical reasoning cognitive strategies. Then we discuss the modeling of tutoring feedbacks using CRA. After a conclusion, we discuss future developments.
2 Patient Simulation System Our patient simulator, TeachMed, has a graphic user interface (GUI) that provides functionalities similar to DxR [3] and VIPS [10]. The student GUI allows the student to select a patient case, access the patient’s medical record, interview the patient, proceed with a physical exam on a 3D model of the patient (Fig. 1), interact with a laboratory module to order tests (Fig. 2), and finally treat and manage the patient. Instructors use another GUI to author patient cases.
Fig. 1. The student decides to take the patient’s blood pressure
Clinical Reasoning Learning with Simulated Patients
389
Fig. 2. Abdominal plain x-ray ordered by the student
3 Automated Tutor A key component of TeachMed is an automated tutor that monitors the student’s actions to provide rapid feedback. Tutoring feedback is implemented based on a clinical influence diagram, the current system state, feedback rules triggered by the current system state, and a clinical reasoning automaton that sequences clinical reasoning sub-processes and corresponding feedback rules. 3.1 Clinical Influence Diagram and Bayesian Inference Processes A clinical influence diagram is a Bayesian network encoding the clinical diagnostic knowledge from an expert point of view. It specifies causal links between evidences, symptoms and pathologies (with corresponding probability distributions), as well as actions that investigate these symptoms (with corresponding utilities). Fig. 3 shows a fragment of an influence diagram we use for a case of acute diarrhea. Probabilities and utility specifications are omitted for clarity. The fragment illustrates only some nodes with few evidence-collection actions (questions to the patient and lab tests). The entire model for the diarrhea case contains 70 nodes, including physical exam actions and curriculum-knowledge application actions to collect evidences. The influence diagram is implemented using Smile Bayesian network library [2]. Evidences gathered by the student are recorded into a student evidence table. Given this table, we use Smile library to compute the posterior probability for a set of hypotheses from the expert point of view and rank them according to their likelihood, that is, the expert hypothesis table. Intuitively, assuming the influence diagram cor-
390
F. Kabanza and G. Bisson
rectly models the expert clinical knowledge, the expert would infer these hypotheses from the given evidences. On the other hand, based on his evidence table, the student will infer his own hypothesis table (i.e., the student hypothesis table) which may or may not match the expert hypothesis table.
Fig. 3. Fragment of the Bayesian network for the acute diarrhea case
In principle, the next best evidence-gathering process should be one yielding a value of information that exceeds the cost of gathering information. However, in general it is very difficult to express the utilities of evidence-gathering actions with a level of accuracy that takes into account all possible evidence observations [7,8]. Therefore, we do not rely entirely on the utilities specified for evidence-gathering actions to provide feedback about the next evidence-gathering process. Rather we use these utilities as heuristic input to feedback rules that provide the actual hints about the next evidence-gathering steps. More specifically, based on the current student’s evidence and hypothesis tables, we use Smile library to compute evidence-collection actions and store them into a next student evidence-action table, on top of which we can express facts that are preconditions of feedback rules. For instance, we can specify a feedback rule stating that “if not acute-infection-diarrhea and urine test’s value of information is less than blood test no more by x threshold, then perform urine test”. In order to provide feedback TeachMed needs to determine the current hypothesis the student is trying to confirm or reject (i.e., the student’s hypothesis goal) or the current evidence he is trying to collect (i.e., the student’s evidence goal). We could obtain this information by simply asking it to the student, but this would result in too much intervention, disturbing the tutoring strategy, particularly for advanced students who just need occasional help. Instead, based on the observed student’s actions (i.e., questions he asks, physical exams he makes, or lab tests he orders), we use Smile library to infer the evidence he is trying to collect, similarly for the student’s hypothesis goal. If the output of this goal recognition is below a validity threshold, we prompt a question to the student to get directly from him the goal he is working on. If the recognition output is a list of at most three elements with valid probability above the
Clinical Reasoning Learning with Simulated Patients
391
threshold, the question looks like “are you trying to establish x, y or z?” Otherwise, the question is a direct one such as “what hypothesis are you trying to establish?” 3.2 Student Model and Model Tracer TeachMed maintains a student model, reflecting the student’s progress information inferred from the influence diagram as explained above (i.e., student evidence table, expert evidence table, student’s hypothesis table, the student’s hypothesis goal, and the student’s evidence goal), and student’s input for feedback request (e.g., the student can ask TeachMed to help him decide between two interpretations of a lab test). A process, called the student’s model tracer, monitors the student’s actions and periodically invokes the update processes described above to maintain the student’s model. 3.3 Feedback Rules and Feedback Generator The feedback generator is a process that periodically checks the current student’s model to trigger tutoring feedback to the student, using production rules that are preconditioned on data in the student’s model and having consequents that are tutoring hints. These are “teaching” expert rules and can be as efficient as the available teaching expertise allows. To illustrate, a feedback rule of the form “if student-asks-help (interpret, test, results) then hint-test-interpretation (test, results)” will trigger a hint helping the student about the interpretation of a lab test. Following Andes’ approach [5], hints are implemented using template-based heuristic dialogues, with slots replaced by data in the student’s model. With the previous example, TeachMed will engage a tree-structured dialog with the student by asking him his impressions about the interpretation and issuing hints until the dialog concludes in the correct interpretation. Feedback rules can also be triggered by TeachMed initiative by having rules with antecedents based upon the evaluation of the student’s progress as reflected by the student’s model. 3.4 Clinical Reasoning Automaton A clinical reasoning automaton (CRA) is a state transition system that abstracts the evolution of the student’s model under the student’s actions. The feedback generator starts in a given state of the automaton. In the current state, it is synchronized with the model tracer monitoring the student’s actions, focusing on one sub-goal, and using a set of feedback rules to help the student. A transition to a new state switches the subgoal and the feedback rules to those of the new state. From another perspective, a state of the CRA represents a sub-process of the clinical reasoning (e.g., the student is collecting evidence for “Acute Diarrhea”). Therefore, a CRA state is a dynamic state in that, the student’s model continues to evolve; for example, the evidence and hypothesis table continue to change under the student’s actions. A transition signals the occurrence of an event that is sufficiently interesting (from the point of view of a given tutoring strategy) to switch a context (e.g., the student’s has finished collecting evidences for the current student’s evidence goal). More formally, a CRA is a hybrid automaton [1], consisting of:
392
F. Kabanza and G. Bisson
• A finite set of variables (composing the student’s model), a time variable (counting the time elapsed in a state), and possibly additional variables defined by the patient-case author (e.g., a student performance variable). • A finite set of states, consisting of the variables (time and student’s model variables remain implicit), feedback rules and functions about each variable evolution in the state. • A start state. • A finite set of transition between states. A transition is a triplet of the form “(u, test, v)”, where test is a Boolean expression over variables in state u (implicit and/or explicit), possibly involving author-defined Boolean functions over these variables. The transition means that if the test holds (i.e., the transition is enabled), then the feedback generator moves into state v. If many transitions are simultaneously enabled, then it moves in a nondeterministic fashion into one of the successor states. The author must define functions determining the evolution of the variables he specifies. For the built-in variables, the functions are defined as follows. The time variable is initialized to zero upon entering the state and is automatically incremented as time passes. The variables in the student’s model are updated by the model tracer. There is no explicit notion of accepting a run for a task automaton. Goodness or badness of runs is conveyed to the learner via the feedback given in states. However, it’s possible to implement rules with a side effect of updating a performance score for the student. One can also specify transitions to states in which the feedback indicates complete success for the task or complete failure. Assuming that we are in a state monitoring the student collecting evidence for some symptom, we can have an outgoing transition labeled “hypothesis-goalchanged” (i.e., the model tracer has established that the student has switched to another hypothesis) to a state in which we watch whether the student remembers to reformulate his hypotheses. In this new state, if a student attempts any action not related to the evaluation and re-formulation of his hypotheses he will get feedback hints aimed at making him remember why at this stage he should reformulate his hypotheses. It could be possible to express this feedback without using a CRA transition to a new state, but then this will result in complex, unstructured feedback rules that not only become difficult to maintain, but also are executed less efficiently since at every stage the system has a large number of rules to consider. In contrast, CRA feedback rules only depend on states and in the current state there are only few of them to consider, that is, those active in that state. It is possible to encode feedback rules that depend on a trace (i.e., a sequence of states), by defining a variable that records relevant part of the trace into a state and specifying feedback rules that have facts in their antecedents that depend on the current value of the variable.
4 Experiments TeachMed is still under development and so far has been tested on simple scenarios. To illustrate the use of CRA, at given point the student may be asking a series of questions to the patient focusing on one particular hypothesis among his current list,
Clinical Reasoning Learning with Simulated Patients
393
to confirm or reject it. We say that this is the working hypothesis. Usually it’s not a good idea to switch the focus of the working hypothesis back and forth without exhausting all relevant questions on the current working hypothesis, until more evidences are introduced in to make re-activation of the hypothesis relevant. Therefore, whenever the student asks a question that is irrelevant to the working hypothesis, we determine whether he has changed his working hypothesis without explicitly stating it (hence the question may actually be relevant to the new working hypothesis). In this case, we check whether he has exhausted all questions related to the working hypothesis at this point. We have two possibilities: (a) if he has indeed exhausted all relevant questions, a feedback message is displayed asking him to formulate his new working hypothesis, with a transition to a new state of the CRA, activating feedback rules for hypothesis formulation; (b) otherwise, Teachmed initiates a dialogue with him aimed at making him realize that there are still evidences to collect using questions, for the current working hypothesis. The following dialogue illustrates (a). Some answers of the simulated patient to questions by the student are omitted to make it short. There is no natural language processing contrary to what the dialogue may suggest. Student’s questions are used for a keyword search into a database of questions and matched ones are displayed for the student to pick one of them corresponding to the question he actually wants to ask. Collected Evidence: Working Hypothesis: Student (Question): Student (Question): Student (Question): Student (Question): Student (Question) Student (Question): Tutor: Student: Tutor:
Acute lower abdominal pain Urinary infection “Any fever?” “Could you describe your pain?” “Is it worse on one side?” “How many times did you urinate since the beginning of your pain?” “Do you have a burning sensation on urination?” “Do you have a sexual partner?” Is this question relevant to the hypothesis you are working with? (Yes / No - I’m examining another hypothesis.) No - I’m examining another hypothesis. Formulate your hypothesis.
5 Conclusion In our medical school, CRL are currently conducted live in small groups of eight students under the supervision of a professor who is a domain expert. One student is picked to play the role of a patient and another plays the role of the physician. The remaining students are observers but also actively participate in the discussion. The patient student behaves according to instructions prepared by the professor on paper based on pedagogical objectives that are fixed by the curriculum committee. The school is planning to start using TeachMed to complement these simulations in the winter of 2006. The students will use TeachMed in an autonomous mode. Besides this forthcoming deployment of TeachMed in our medical curriculum, future work includes the implementation of an interface between TeachMed and a real-
394
F. Kabanza and G. Bisson
world health record containing information about patients and how they were treated. As noted by Gertner et al. in their physics tutoring system [5], a Bayesian network implicitly models solutions, but not the solution process itself. Yet, being able to model the solution process can improve the quality of feedback given to the student. A similar observation holds in CRL, with the difference that the solving process is harder to formalize other than giving general strategies about the interleaving of the different clinical reasoning sub-processes. Nevertheless, we feel that the possibility of having one expert-solution, albeit not encompassing all experts’ solutions, can be exploited to provide enriched feedback. In the more trivial case, we could illustrate the solution to the student as one example of expert’s approach. Given that an electronic heath-record does not contain clinical reasoning traces, but contains just the solutions (i.e., the diagnosis and the patient treatment), the challenge will be to find techniques for inferring expert solving processes from expert solutions.
References 1. Alur, R., Courcoubetis, C., Henzinger, T., and Po. H.: Hybrid automata: an algorithmic approach to the specification and verification of hybrid systems. In: Proceedings of Hybrid Systems. Lecture Notes in Computer Science, Vol. 736, Springer Verlag, (1993):209-229. 2. Druzdzel, M.: SMILE: Structural Modeling, Inference, and Learning Engine and GeNIE: A Development Environment for Graphical Decision-Theoretic Models. In Proceedings of the Eleventh Conference on Innovative Applications of Artificial Intelligence, (1999) 902903. (See http://www.sis.pitt.edu/~genie/ for system download). 3. DxR Clinician: http://www.dxrgroup.com/ 4. Ganeshan, R., Johnson, L., Shaw, E. and Wood, B.: Tutoring Diagnostic Problem Solving. In: Proceedings of Conference on Intelligent Tutoring Systens, Lecture Notes in Computer Science, Vol. 1839, (2000) 33-42 5. Gertner, A., Conati, C. and VanLehn, K.: Procedural help in Andes: Generating Hints Using a Bayesian Network Student Model. In: Proceedings of National Conference on Artificial Intelligence, AAAI Press, (1998)106-111. 6. Gruppen, L. and Frohna, A.: Clinical Reasoning. International Handbook of Research in Medical education. Kluwer Academic (2000) 205–230 7. Heckerman, D., Horvitz, E. and Nathwani, B.: Towards Normative Expert Systems: Part I. The Pathfinder Project. In: Methods of Information in Medecine (1992): 90-105. 8. Onisko, A., Lucas, P. and Druzdzel M.: Comparison of Rule-Based and Bayesian Network Approaches in Medical Diagnostic Systems. In: Proceedings of the Eighth {Annual Conference on Artificial Intelligence in Medicine, Lecture Notes in Computer Sciences, vol. 2101, Springer Verlag, (2001): 281-292. 9. Suebnukarn, S. and Haddawy, P. Collaborative Intelligent Tutoring System for Medical Problem-Based Learning. In: Proceedings of the Ninth Conference on Intelligent User Interfaces, (2004)14-21. 10. VIPS: Virtual Internet Patient Simulator: https://www.swissvips.ch/fr/index.htm
Implicit Learning System for Teaching the Art of Acute Cardiac Infarction Diagnosis* Dmitry Kochin1, Leonas Ustinovichius2, and Victoria Sliesoraitiene 2, 3 1
Institute for System Analysis, 117312, prosp. 60-letija Octjabrja, 9, Moscow, Russia [email protected] 2 Vilnius Gediminas Technical University, LT-10223, Sauletekio al. 11, Vilnius, Lithuania [email protected] 3 Vilnius University Children hospital Rehabilitation and education centre, LT-08406, Santariskių 7, Vilnius, Lithuania [email protected]
Abstract. There are two types of knowledge – declarative (theoretical) and procedural (practical skills). While the former knowledge may be acquired by reading books, the latter requires long intensive practice. The majority of computer-aided learning systems teach declarative knowledge only. This paper presents basic ideas of building intellectual computer systems for teaching procedural expert knowledge, such as medical diagnostic skills. Two subproblems are under consideration – the elicitation of experienced physician’s decision rules and the construction of the computer system for teaching these rules. Such systems utilize the principle of implicit learning. The authors present the methodology of practical realization of these ideas in application of teaching the art of acute cardiac infarction diagnosis.
1 Introduction There are two types of knowledge – declarative and procedural. Declarative knowledge includes theories, facts and data. It can be described in books. The most illustrative example of declarative knowledge is a textbook. Procedural knowledge is the ability to apply declarative knowledge in practice. A person who acquired mastery in any professional field is called expert. To become an expert one requires at least 10 years of intensive practice as well as guidance by an experienced tutor [1, 8]. One of the approaches to shortening this time is computerized learning systems. They allow intensifying the learning process, increasing the motivation of novices and *
This work is supported by the Scientific Programs ”Mathematical Modeling and Intelligent Systems”, “Fundamentals of Information Technology and Systems” of the Russian Academy of Sciences; the projects 04-01-00290, 05-01-00666 of the Russian Foundation for Basic Research; the grant 1964.2003.1 of the President of the Russian Federation for the support of the prominent scientific schools.
S. Miksch et al. (Eds.): AIME 2005, LNAI 3581, pp. 395 – 399, 2005. © Springer-Verlag Berlin Heidelberg 2005
396
D. Kochin, L. Ustinovichius, and V. Sliesoraitiene
facilitating the job of a tutor. However, the majority of modern learning systems teach only declarative knowledge, that is appear to be just an interactive textbook. Of course, in many cases, such learning systems allow students to learn better due to more “alive” presentation of the material, but still, on the completion of learning the student doesn’t have enough practical skills and requires additional practice. In order to build a learning system we have to solve two problems [5]: 1. To build a knowledge base imitating the knowledge of the given expert. 2. To transfer this knowledge base to a novice, thus teaching him/her to solve practical problems like experts. In this paper we consider a particular approach to the construction of the learning systems for teaching procedural knowledge, allowing a novice to gain the experience of solving practical tasks in learning. We present the methodology of learning system application to teaching the art of Acute Cardiac Infarction (ACI) diagnosis.
2 Building the Knowledge Base To solve the first problem, we can use expert classification approach [6, 2], allowing us to build a comprehensive and consistent knowledge base in an individual knowledge field for a relatively short time. This approach was first developed by Oleg Larichev [6] and is designed for the problems where an expert assigns different objects to different classes. The medical diagnosis perfectly suits this model. A doctor assigns a patient (object) described by a set of distinct attributes (symptoms) to a particular class (diagnosis). Contrary to machine learning methods, this approach results in precise knowledge base, because all decisions affecting the base are manually made by an interviewed expert. Let us formally state the problem of expert classification. 1. G – a property, corresponding to the target criterion of the problem (for example, presence of the specific disease and degree of its development). 2. K = {K1, K2, …, KN} – a set of attributes (such as groups of symptoms), by which each object (patient) is evaluated. 3. S q = {k1 , k 2 , … , kω } for q = 1, …, N – a set of verbal attribute values on the q
q
q
q
scale of attribute Kq; ωq – number of values for Kq; values in Sq are in descending order by typicality for the property G. That is, for each Sq there is defined linear reflexive antisymmetric transitive relation
(
)
Qq: ki , k j ∈ Qq ⇔ i ≤ j . For example, in application to ACI diagnosis, we q
q
have an attribute “Age” with values (“Older than 30 years”, “Younger than 30 years”). Age more than 30 is more typical of ACI. 4. Y = S1 × S2 ×…× SN – a Cartesian product of attribute scales defining a space of object states to be classified. Each object is described by a set of values for attributes K1, …, KN and can be represented by a vector y ∈ Y, where y = (y1, y2,…,yN), yq equals the index of the corresponding value in Sq.. N
5. L = Y =
∏ω q =1
q
– Cardinality of Y.
Implicit Learning System for Teaching the Art of Acute Cardiac Infarction Diagnosis
397
6. C = {C1, C2, …, CM} – a set of decision classes in descending order by intensity of the property G. On the C set there is defined linear reflexive antisymmetric transitive relation QC: ( Ci , C j ) ∈ QC ⇔ i ≤ j . With respect to ACI, there are
three classes: “ACI”, “Possible ACI, but additional study is required” and “No ACI (may be another disease)”. The classes are ordered by the degree of intensity of ACI in the class formulation. Let us introduce binary relation of the dominance: P=
{ ( x, y ) ∈ Y × Y
∀q = 1 … N xq ≥ y q è ∃q0 : xq > y q 0
0
}
(1)
Required: to build with the help of the expert a reflection F: Y → {Yi}, i = 1…M, M
such, that Y =
∪Y ; Y ∩Y l
l
k
= ∅, k ≠ l (where Yi – a set of vectors, belonging to
l =1
class Ci), satisfying the consistency condition:
∀x,y ∈ Y : x ∈ Yi , y ∈ Yj , (x, y) ∈ P ⇒ i ≥ j
(2)
The knowledge base is comprehensive in the sense that any possible combination of values of diagnostic attributes is contained within the base along with its class (diagnosis). And it is logically consistent as long as condition (2) is satisfied. The method solving this problem uses the expert as the only information source. To assign an object to a class the method requires questioning the expert. In general, to assign each object to a class, the expert would need to consider all objects. However, utilizing relations (1) and (2) it is possible to build a far more efficient procedure [0], when the majority of objects are classified indirectly. For example, provided that the expert assigned an object with moderate dyspnea to class ACI, then we can judge that the object with the same symptoms, but asthmatic fit, should also be assigned to class ACI, because the asthmatic fit is more typical to ACI than moderate dyspnea. To build a comprehensive knowledge base of ACI diagnosis using method CLARA (SAC) [0] (about 2500 real-life combinations of symptoms), the expert was directly presented with only 250 clinical situations. Analyzing the knowledge base it is possible to reveal decision rules defining the whole classification [4, 6].
3 Implicit Learning The goal of learning is the creation of subconscious rules in the long-term memory of a young specialist, which would allow him/her to make decisions identically to an expert. One of the first learning system used rule-based learning. A student was presented with diagnostic rules for memorizing, and then the system asked him/her to diagnose hypothetic patients. If the student made a mistake, the system showed the correct decision rule. This scheme being perfect for solving standard mathematical problems was ineffective in teaching the art of medical diagnosis. It was discovered that learning by explicitly specified decision rules did not develop any clinical thinking based on the
398
D. Kochin, L. Ustinovichius, and V. Sliesoraitiene
analysis of patient descriptions, but led to routine logic and arithmetic manipulations. Forgetting only a small fragment of a rule produced catastrophic results in the knowledge verification tests. The way out of the deadlock became the idea that one should not try to "transmit" the decision rules of the expert to the novice, but should rather help the novice to "grow" them on his/her own by using human abilities for implicit learning. Research in this interesting direction of cognitive psychology was aimed at solving abstract test problems like the artificial grammar [3]. In such problems the subject was presented with different sets of symbols generated by some rule, and the task was to recognize whether this unknown rule is applied in control sets of the verification test. The unique feature of knowledge acquired in this way was the fact that subjects kept the acquired skills for several years without being able to verbalize the rule itself. We redesigned the computer system for the new paradigm. The system does not show the rules anymore, but only presents a series of hypothetic patients (randomly selected from the knowledge base), asking a student for a diagnosis (Fig. 1). Experimenting with the new system brought very interesting results. During the whole course of learning (two days, four hours a day), every student (the students of the Russian Academy of the Post-diploma Education and young physicians of the Botkin hospital of Moscow) solved about 500 tasks of different complexity. During the verification test the subjects gave 90-100% answers identical to the expert's
Fig. 1. The learning system with the comment on a diagnostic task solution
Implicit Learning System for Teaching the Art of Acute Cardiac Infarction Diagnosis
399
answers, but failed, as well as the expert, to verbalize the decision rules describing their decisions [7]. It should be noted that this method of teaching, based on students' creative analysis of their own decisions and comparing them to expert decisions, helps them not just start solving diagnostic problems at a higher quality level, but significantly improve their clinical thinking. The teaching methodology developed includes gradual increasing of tasks complexity depending on the progress of a student, theoretical course in ACI, comments to the mistakes of a student.
4 Conclusion This paper presents basic ideas of building the intellectual computer systems for teaching procedural expert knowledge. The system contains the knowledge base built with the help of a prominent expert. This base is used to pick sample tasks during the learning process. The system utilizes the principle of implicit learning to transfer knowledge to a student. Implicit learning results in better and longer memorizing than rule-based approaches. Such systems can be used for training young specialists as well as for retraining specialists experienced in other knowledge fields who could acquire expert skills in lesser time and with fewer fatal mistakes. New computer systems create the conditions for developing a new type of university education aimed at preparing young specialists possessing not only theoretical knowledge, but also practical skills.
References 1. Ericsson K.A., Lehnmann A.C. Expert and Exceptional Performance: Evidence of Maximal Adaptation to Task Constraints. Annual Review of Psychology. 1996. V. 47. pp. 273-305. 2. Kochin D.Yu., Larichev O.I., Kortnev A.V. Decision Support System for Classification of a Finite Set of Multicriteria Alternatives, Journal of Decision Support Systems 33 (2002), pp. 13-21 3. Reber A.S. Implicit Learning of Artificial Grammars. Journal of Verbal Learning and Verbal Behaviour. 1967. V. 77. pp. 317-327 4. Asanov A.A., Kochin D.Yu. The Elicitation of Subconscious Expert Decision Rules in the Multicriteria Classification Problems. CAI-2002. Conference proceedings. Vol. 1, pp. 534544. M.: Fizmatlit, 2002. (in Russian) 5. Larichev O.I., Brook E.I., The Computer Learning as the Part of University Education. Universitetskaya Kniga, №5, 2000, pp. 13-15 (in Russian) 6. Larichev O.I., Mechitov A.I., Moshkovich E.M., Furems E.M. Expert Knowledge Elicitation. M.: Nauka, 1989 (in Russian) 7. Larichev O.I., Naryzhniy E.V. The Computer Learning of the Procedural Knowledge. Psyhologicheskiy Zhurnal, 1999, V. 20, № 6, pp. 53-61 (in Russian) 8. Peldschus, F., Messing, D., Zavadskas, E.K., Ustinovičius L.: LEVI 3.0 – Multiple Criteria Evaluation Program. Journal of Civil Engineering and Management, Vol 8. Technika, Vilnius (2002) 184-191
Which Kind of Knowledge Is Suitable for Redesigning Hospital Logistic Processes? Laura M˘ aru¸ster and Ren´e J. Jorna Faculty of Management and Organization, University of Groningen, PO Box 800, 9700 AV Groningen, The Netherlands Abstract. A knowledge management perspective is rarely used to model a process. Using the cognitive perspective on knowledge management in which we start our analysis with events and knowledge (bottom-up) instead of with processes and units (top-down), we propose a new approach for redesigning hospital logistic processes. To increase the care efficiency of multi-disciplinary patients, tailored knowledge in content and type that supports the reorganization of care should be provided. We discuss the advantages of several techniques in providing robust knowledge about the logistic hospital process by employing electronic patient records (EPR’s) and diagnosis treatment combinations (DTC’s).
1
Introduction
Due to the application of information technology and the promotion of changing the structure in certain organizations, business processes need to be redesigned, an approach coined as Business Process Reengineering [1]. As changes occur in the business environment, the existing process is at risk of becoming misaligned and consequently less effective. To increase the capabilities of enterprises to learn faster through their processes, strategies have been proposed that construct knowledge management systems around processes to: (i) increase their knowledge creating capacity, (ii) enhance their capacity to create value, and (iii) make them more able to learn [2]; in other words “... knowledge management (KM) can be thought of as a strategy of business process redesign” [2]. Modelling and redesigning a business process involves more than restructuring the workflow; therefore business professionals need to learn how to describe, analyze, and redesign business processes using robust methods. Using the KM perspective, we make distinctions between content and form of knowledge. Concerning the content of knowledge, we are providing support for people to analyze, model and reorganize the hospital logistic processes with the new knowledge that can be distilled from electronic patient records (EPR’s) and from diagnosis treatment combinations (DTC). With respect to the forms (types) of knowledge, a classification consisting of sensory, coded and theoretical knowledge has been used, based on the ideas of Boisot [3] and using the insights of cognitive science [4]. Sensory knowledge is based on sensory experience; it is very difficult to code. Coded knowledge is a representation based on a conventional relation between the representation and that which is being referred S. Miksch et al. (Eds.): AIME 2005, LNAI 3581, pp. 400–405, 2005. c Springer-Verlag Berlin Heidelberg 2005
Redesigning Hospital Logistic Processes
401
to. Coded knowledge contains all types of signs or symbols, expressed either as texts, drawings, or mathematical formulas. A person who is able to explain why certain pieces of knowledge belong together possesses theoretical knowledge concerning this specific knowledge domain. Theoretical knowledge is often used to identify causal relations (i.e. if-then-relations). Employing the knowledge categorization defined by van Heusden&Jorna [5] (i.e. sensory, coded and theoretical knowledge), we propose a KM perspective for redesigning a business process which consists of: 1. Knowledge creation. Raw data are first converted into coded knowledge, then this knowledge is used to provide theoretical knowledge about the logistic hospital process. 2. Knowledge use and transfer. New theoretical knowledge can be used for analyzing, diagnosing and reorganizing the logistic hospital process. Because of its codification, it can be easily transferred to other people, or from one part of the organization to another one. The paper is organized as follows: in Section 2 we describe two forms of knowledge conversion and the creation of new knowledge about the treatment process in a hospital. In Section 3 we illustrate how newly created knowledge can be used to reorganize the hospital logistic process. We conclude our paper with directions for further research in Section 4.
2
Knowledge Conversion. New Knowledge About Patients
The number of patients who require the involvement of different specialisms is increasing because of the increasing specialization of doctors and an aging population. Special arrangements have emerged for these patients; however, the specialisms comprising these centers are based on the specialists’ perceptions. In other words, specialists have a certain sensory knowledge about what specialisms should form a center, which eventually could be supported by some quantitative information (e.g. frequencies of visits), but knowledge expressed in models or rules is missing. Given some criteria for selecting patients, we use the records of 3603 patients from Elisabeth Hospital, Tilburg, The Netherlands. The collected data refer to personal characteristics (e.g. age, gender), policlinic visits, clinical admissions, visits to radiology and functional investigations. Medical specialists need knowledge expressed in explicit models (i.e. theoretical knowledge) as a base for creating multi-disciplinary units, where different specialisms coordinate the treatment of specific groups of patients. However, the information existing in electronic patient records (EPR’s) needs to be distilled in a meaningful way. We obtain the needed knowledge by converting (i) raw data into coded knowledge (RD⇒CK) and (ii) coded knowledge into theoretical knowledge (CK⇒TK).
402
L. M˘ aru¸ster and R.J. Jorna
2.1
The Development of Logistic Patient Groups
For the first type of conversion RD⇒CK, we use the idea of operationalizing the logistic complexity by distinguishing six aggregated logistic variables, developed in [6]. Theoretical knowledge provides knowledge expressed as causal and structural relations. This knowledge is necessary when taking decisions for re-organizing hospitals. For the second type of conversion CK⇒TK, we refer again to [6]. First, two valid clusters are identified: one cluster for “moderately complex” patients, and another cluster for “complex” ones. Second, a rule-based model is induced to characterize these clusters. As general characteristic, patients from the “moderately complex” cluster have visited up to three different specialists, while patients from the cluster “complex” have visited more than three different specialists [6]. 2.2
Discovering the Process Models for Patient Logistic Groups
Theoretical knowledge expressed in clear-cut patient logistic groups and their rule-based characterizations do not provide enough understanding and arguments for (re-)organizing multi-disciplinary units. To better understand the treatment process, we need insight into the logistic process model. Modelling an existing process is often providing a prescriptive model, that contains what “should” be done, rather than describing the actual process. A modelling method closer to reality is using the data collected at runtime, recorded in a process log, to derive a model explaining the events recorded. This activity is called process discovery (or process mining) (see [7]). Employing the process discovery method using Petri net formalism, described in [8], we construct process models for patients from the “moderately complex” and “complex” cluster. Thus, we convert coded knowledge (the process executions recorded in the process logs) into theoretical knowledge. In Figure 1 (a) the discovered Petri net model using only medical cases from the cluster “moderately complex” is shown. On every possible path, at most three different specialisms are visited, e.g. “CHR, INT” or “CRD, NEUR, NRL” (the visits for functional investigations/radiology are not counted as specialisms)1 . Using the process discovery method, we obtain insights into “aggregated” process models for the patients belonging to “moderately complex” and “complex” clusters. We want to be more patient-oriented, that is to obtain insights at the individual patient level. We obtain theoretical knowledge expressed as instance graphs, which is a graph where each node represents one log entry of a specific instance [9]. The advantage of this technique is that it can look at individual or at grouped patient instances. In Figure 1 (b) we construct the instance graph for a selection of 5 “moderately complex” patients (the instance graph has been produced with ProM software, version 1.7 [10]). Moreover, we can see the
1
The node labels mean: CHR - surgery, CRD - cardiology, INT - internal medicine, NRL - neurology, NEUR - neurosurgery, FNKT (FNKC) - functional investigations. RONT (ROEH, RMRI, RKDP, ROZA) - radiology.
Redesigning Hospital Logistic Processes E
VWDUWSURFHVV
&+5
&5'
&+5 GLDJQRVLVG
)1.& GLDJQRVLVG
1(85
5.'3 GLDJQRVLVG
15/
,17
)1.7 GLDJQRVLVG
)1.&
)1.7
403
&+5 GLDJQRVLVG
,17 GLDJQRVLVG
15/ GLDJQRVLVG
5.'3
52(+ GLDJQRVLVG
505,
52(+
5217 GLDJQRVLVG
5217
52(+ GLDJQRVLVG
5217 GLDJQRVLVG
5.'3 GLDJQRVLVG
52=$
52=$ GLDJQRVLVG
52(+ GLDJQRVLVG
5217 GLDJQRVLVG
HQGSURFHVV
H
(a)
(b)
Fig. 1. The discovered process models. The Petri net model for cluster “moderately complex”(a) patients. The instance graph for 5 patients in the “moderately complex” cluster (b)
hospital trajectory of patients with specific diagnoses. For example, in Figure 1 (b), we can inspect the path of patients with diagnosis “d440” - atherosclerosis.
3
Knowledge Use and Transfer
For a better coordination of patients within hospitals, new multi-disciplinary units have to be created, where different specialisms coordinate the treatment of specific groups of patients. According to our results, two units can be created: for “moderately complex” cases, a unit consisting of CHR, CRD, INT, NRL and NEUR would suffice. “Complex” cases need additional specialisms like OGH, LNG and ADI. The discovered process models show that both processes contain parallel tasks. When treating multi-disciplinary patients, many possible combinations of specialisms appear in the log, due to a complex diagnosis process. Without patient clustering, it would be difficult to distinguish between patients, which leads to unnecessary resource spending (e.g. for “moderately complex” cluster). Concerning the content of knowledge, it is possible to analyze, model and reorganize the hospital logistic processes in combination with the knowledge that can be distilled from Electronic Patient Records (EPR) (concerning illnesses, treatments, pharmaceutics, medical histories) and from Diagnosis Treatment Combinations (DTC). DTC’s are integrated in such a way that hospitals can predict within ranges how long a patient with a certain illness will remain in the hospital.
404
L. M˘ aru¸ster and R.J. Jorna
The first issue concerning knowledge types is that much sensory knowledge - or behavioral activities of doctors and nurses - of the medical domain has to be converted into coded knowledge. EPR’s and DTC’s are designed from theoretical knowledge about the medical domain. The integration of sensory and theoretical knowledge via codes continues, however with increasing resistance of the medical profession. The second issue concerns the distillation of sensory, coded and theoretical knowledge from EPR’s and DTC’s, which are about the medical domain, into sensory and theoretical knowledge of the logistic processes. The already mentioned clustering within the logistic processes and the methodology we explained can be of help, but the transfer of sensory and theoretical knowledge from the one content domain to another completely different content domain is very difficult. On one hand, the knowledge that triggers important decisions should be transparent, understandable and easy to be checked by all involved parties; reorganizing an institution implies difficult decisions and high costs. Therefore, our approach provides robust theoretical knowledge of a highly abstract kind of the logistic hospital process. On the other hand, the newly developed coded and theoretical knowledge transferred to people, may even lead to a new development in behavior.
4
Conclusions and Further Research
In this paper we proposed a KM approach for redesigning a business process, by employing the knowledge categorization defined by van Heusden&Jorna [5]. Our strategy focuses on (i) knowledge creation and (ii) knowledge use and transfer. We illustrated our approach by considering the treatment process of multidisciplinary patients, that require the involvement of different specialisms for their treatment. The problem is to provide knowledge for reorganizing the care for these patients, by creating new multidisciplinary units. First, we identified patients groups in need of multi-disciplinary care. The clustering of the complexity measures obtained by converting raw data into coded knowledge, resulted into new pieces of knowledge: two patient’s clusters, “complex” and “moderately complex”. Second, we found the relevant specialisms that will constitute the ingredients of the multi-disciplinary units. We converted coded knowledge into theoretical knowledge, by building instance graphs for a selection of “moderately complex” patients, and Petri net process models for the treatment of “complex” and “moderately complex” patients. This theoretical knowledge provides insights into the logistic process and supports the reorganization of the care process. Our approach is meant to be more patient-oriented, in the sense of reducing redundant and overlapping diagnostic activities, which will consequently decrease the time spent in the hospital and shorten the waiting lists. As future research, we will concentrate on EPR and DTC. These kinds of data structures will give more possibilities to improve the logistic processes and can be used to inform the patient better about the medical issues and treatment sequences.
Redesigning Hospital Logistic Processes
405
References 1. Davenport, T., Short, J.: The New Industrial Engineering: Information Technology and Business Process Redesign. Sloan Management Review 31 (1990) 11–27 2. Sawy, O., Robert A. Josefek, J.: Business Process as Nexus of Knowldge. In: Handbook of Knowledge Management. International Handbooks on Information Systems. Springer (2003) 425–438 3. Boisot, M.: Information Space: A Framework for Learning in Organizations, Institutions and Culture. Routledge (1995) 4. Jorna, R.: Knowledge Representations and Symbols in the Mind. T¨ ubingen: Stauffenburg Verlag (1990) 5. van Heusden, B., Jorna, R.: Toward a Semiotic Theory of Cognitive Dynamics in Organizations. In Liu, K., Clarke, R.J., Andersen, P.B., Stamper, R.K., eds.: Information, Organisation and Technology. Studies in Organizational Semiotics. Kluwer, Amsterdam (2001) 6. M˘ aru¸ster, L., Weijters, A., de Vries, G., Bosch, A.v., Daelemans, W.: Logisticbased Patient Grouping for Multi-disciplinary Treatment. Artificial Intelligence in Medicine 26 (2002) 87–107 7. Aalst, W., van Dongen, B., Herbst, J., Maruster, L., Schimm, G., Weijters, A.: Workflow Mining: A Survey of Issues and Approaches. Data and Knowledge Engineering 47 (2003) 237–267 8. M˘ aru¸ster, L.: A Machine Learning Approach to Understand Business Processes. PhD thesis, Eindhoven University of Techology, The Netherlands (2003) 9. van Dongen, B., van der Aalst, W.: Multi-phase Process Mining: Building Instance Graphs. In Atzeni, P., Chu, W., Lu, H., Zhou, S., Ling, T., eds.: Conceptual Modeling - ER 2004. LNCS 3288, Springer-Verlag (2004) 362–376 10. : Process Mining. http://tmitwww.tm.tue.nl/research/processmining/ (2004)
Web Mining Techniques for Automatic Discovery of Medical Knowledge David Sánchez and Antonio Moreno Department of Computer Science and Mathematics, University Rovirai Virgili (URV), Avda. Països Catalans, 26. 43007 Tarragona (Spain) {david.sanchez, antonio.moreno}@urv.net
Abstract. In this paper, we propose an automatic and autonomous methodology to discover taxonomies of terms from the Web and represent retrieved web documents into a meaningful organization. Moreover, a new method for lexicalizations and synonyms discovery is also introduced. The obtained results can be very useful for easing the access to web resources of any medical domain or creating ontological representations of knowledge.
1 Introduction The World Wide Web is an invaluable tool for researchers, information engineers, health care companies and practitioners for retrieving knowledge. However, the extraction of information from web resources is a difficult task due to their unstructured definition, their untrusted sources and their dynamically changing nature. To tackle these problems, we base our proposal is the redundancy of information that characterizes the Web, allowing us to detect important concepts and relationships for a domain through statistical analysis. Moreover, the exhaustive use of web search engines is a great help for selecting “representative” resources and getting global statistics of concepts; they can be considered as our particular “experts”. So, in this paper we present a methodology to extract knowledge from the Web to build automatically structured representations in the form of taxonomies of concepts and web resources for a domain. Moreover, if named entities for a discovered concept are found, they are considered as instances. During the building process, the most representative web sites for each subclass or instance are retrieved and categorized according to the topic covered. With the final hierarchy, an algorithm for discovering different lexicalizations and synonyms of the domain keywords is also performed, which can be used to widen the search. A prototype has been implemented and tested in several knowledge areas (results for the Disease domain are included). The rest of the paper is organised as follows: section 2 describes the methodology developed to build the taxonomy and classify web sites. Section 3 describes a new approach for discovering lexicalizations and synonyms of terms. Section 4 explains the way of representing the results and discusses the evaluation with respect to other systems. The final section contains the conclusions and proposes lines of future work. S. Miksch et al. (Eds.): AIME 2005, LNAI 3581, pp. 409 – 413, 2005. © Springer-Verlag Berlin Heidelberg 2005
410
D. Sánchez and A. Moreno
2 Taxonomy Building Methodology The algorithm is based on analysing a significant number of web sites in order to find important concepts by studying the neighbourhood of an initial keyword. Concretely, in the English language, the immediate anterior word for a keyword is frequently classifying it (expressing a semantic specialization of the meaning), whereas the immediate posterior one represents the domain where it is applied [5]. More concretely, it queries a web search engine with an initial keyword (e.g. disease) and analyses the obtained web sites to retrieve the previous (e.g. heart disease) and posterior words (e.g. disease treatment). Previous words are analysed to distinguish if they will be considered as future subclasses or concrete instances into the taxonomy. In order to perform this differentiation, we take into consideration the way in which named entities are presented: English language distinguishes named entities through capitalisation [5]. More in detail, the method analyses the first web sites returned by the search engine when querying the instance candidates and counts the number of times that they are presented with upper and lower letters. Those which are presented in capital letters in most cases will be selected as instances (see some examples in Table 1). Table 1. Examples of named entities (instances) found for several classes of the obtained taxonomy for the Disease domain (100 web documents evaluated for each candidate) Class Chronic Lyme Lyme Lyme Lyme Mitochondrial Parkinson
Instance Wisconsin American California Connecticut Yale European American
Full name Wisconsin Chronic Disease Program American Lyme Disease Foundation California Lyme Disease Association Connecticut Lyme Disease Coalition Yale University Lyme Disease Clinic European Mitochondrial Disease Network American Parkinson Disease Association
Conf. 92.59 81.69 83.33 100.0 87.23 100.0 87.5
Regarding to subclass candidates, a statistical analysis is performed, taking into consideration their relevance in the whole Web for the specific domain. Concretely, the web search engine is queried with the new concepts and relationships found (e.g. “heart disease”) and the estimated amount of hits returned is evaluated in relation to the initial keyword’s measure (hits of “disease”). This web scale statistic combined with other local values obtained from the particular analysis of individual web resources are considered to obtain a relevance measure for each candidate and reject extreme cases (very common words or misspelled ones). Only the most relevant candidates are finally selected (see Fig. 1). The selection threshold is automatically computed in function on the domain’s generality itself. For each new subclass, the process is repeated recursively to create deeper-level subclasses (e.g. coronary heart disease). So, at each iteration, a new and more specific set of relevant documents for the subdomain is retrieved. Each final component of the taxonomy (classes and instances) stores the set of URLs from where they have been selected. Regarding to the posterior words of the keyword they are used to categorize the set of URLs associated to each class (see Fig. 2). For example, if we find that for a URL associated to the class heart disease, this keyword is followed by the word prevention, the URL will be categorized with this domain of application.
Web Mining Techniques for Automatic Discovery of Medical Knowledge
411
Fig. 1. Disease taxonomy visualized on Protégé: numbers are class identifiers C DISEASE C He art Dis eas e [inflammation] http ://www.clevela ndclinic.o rg [pre vention, link] http://www.nlm .nih.gov/m edlin eplus/hea rtdise ases.h tm l [support, he alth] http://m y.web m d.co m /m ed ica l_inform a tio n/con dition_cen te rs/h eart_d ise ase [pre vention, e duca tion] h ttp://www.alzheim e rs.o rg/pub s/ad fact.h tm l
C Carc inoid Hea rt Dis ease [introduction] h ttp://www.pa th guy.co m /le ctures/he art.htm [dia gnosis] http ://hea rt.bm jjo urn als.com /cg i/conten t/full/8 0/6/623
C Congenita l He art Diseas e [index, spe cia lty] http://www.m edicinen et.com /co nge nita l_he art_disea se /a rticle.htm [surgery ] http://www.e quip.nh s.uk/top ics/ca rdiov/co nghd .h tm l [directory ] http ://ca .d ir.ya hoo .com /He alth/dise ases_an d_cond itio ns/cong enital_h eart_ dise ase
C Coronary Hea rt Dis ease [death, ris k, sy mptom, chronic, overview ] h ttp://h eart-d ise ase-reso urces.com /coro naryhe artdisease [forum, emphasis, risk] http://www.hea lth -nexus.co m /coro nary_ heart_ disease.htm [surgery, e me rge ncy ] h ttp://www.ba ngkokhospital.com /corona ry-hea rt-disease.htm
C Atherosclerotic Corona ry Dise ase [forum] http ://www.h eart1 .com /co m m unity/foru m s20 .cfm /1 63
Fig. 2. Examples of categorized URLs for the Heart subclasses of Disease
When dealing with polysemic domains (e.g. virus), a semantic disambiguation algorithm is performed in order to group the obtained subclasses according to the specific word sense. The methodology performs a clusterization process of those classes depending on the amount of coincidences between their associated URLs sets. Finally, a refinement is performed to obtain a more compact taxonomy: classes and subclasses that share a high amount of their URLs are merged because we consider them as closely related. For example, the hierarchy “jansky->bielschowsky” results in “jansky_bielschowsky”, discovering automatically a multiword term.
3 Lexicalizations and Synonyms Discovery A very common problem of keyword-based web indexing is the use of different names to refer to the same entity (lexicalizations and synonyms). The goal of a web search engine is to retrieve relevant pages for a given topic determined by a keyword, but if a text doesn’t contain this specific word with the same spelling as specified it will be ignored. So, in some cases, a considerable amount of relevant resources are omitted. In this sense, not only the different morphological forms of a given keyword are important, but also synonyms and aliases.
412
D. Sánchez and A. Moreno
We have developed a novel methodology for discovering lexicalizations and synonyms using the taxonomy obtained and, again, a web search engine. Our approach is based on considering the longest branches (e.g. atherosclerotic coronary heart disease) of our hierarchy as a contextualization constraint and using them as the search query ommiting the initial keyword (e.g. “atherosclerotic coronary heart“) for obtaining new webs. In some cases, those documents will contain equivalent words for the main keyword just behind the searched query that can be candidates for lexicalizations or synonyms (e.g. atherosclerotic coronary heart disorder). From the list of candidates, those obtained from a significant amount of multiword terms are selected as this implies that they are comonly used in the particular domain. For example, for the Disease domain, some alternative forms discovered are: diseases, disorder(s), syndrome and infection(s).
4 Ontology Representation and Evaluation The final hierarchy is stored in a standard representation language: OWL. The Web Ontology Language is a semantic markup language for publishing and sharing ontologies on the World Wide Web [2]. Regarding to the evaluation of taxonomies, we have performed evaluations of the first level of the taxonomies for different sizes of the search computing, in each case, the precision and the number of correct results obtained. These measures have been compared to the ones obtained by a human-made directory (Yahoo), and automatic classifications performed by web search engines (Clusty and AlltheWeb). For the Disease domain (see Fig. 3), comparing to Yahoo, we see that although its precision is the highest, as it has been made by humans, the number of results are quite limited. The automatic search tools have offered good results in relation to precision, but with a very limited amount of correct concepts obtained.
b) Disease Test: # correct results
# correct results
% Precision
a) Disease Test: Precision 100 90 80 70 60 50 40 30 20 10 0 0
50
Proposed
100
150 # Pages
Clusty
200
AlltheWeb
250
Yahoo
300
200 175 150 125 100 75 50 25 0 0
50
Proposed
100
Clusty
150 # Pages
200
AlltheWeb
250
300
Yahoo
Fig. 3. Evaluation of results for the Disease domain for different sizes of web corpus
For the named-entities discovered, we have compared our results to the ones returned by a named-entity detection package trained for the English language that is able to detect with high accuracy some word patterns like organisations, persons, and locations. For illustrative purposes, for the Disease domain, we have obtained a precision of a 91%.
Web Mining Techniques for Automatic Discovery of Medical Knowledge
413
5 Related Work, Conclusions and Future Work Several authors have been working in the discovery of taxonomical relationships from texts and more concretely from the Web. In [1, 6] authors propose approaches that rely in previously obtained knowledge and semantic repositories. More closely related, in [4] authors try to enrich previously obtained semantic structures using webscale statistics. Other authors using this approach are [3], that learn taxonomic relationships, and [7], for synonyms discovery. In contrast, our proposal does not start from any kind of predefined knowledge and, in consequence, it can be applied over domains that are not typically considered in semantic repositories. Moreover, the automatic and autonomous operation eases the updating of the results in highly evolutionary domains. In addition to the advantage of the hierarchical representation of web sites for easing the accessing to web resources, the taxonomy is a valuable element for building machine processable representations like ontologies. This last point is especially important due to the necessity of ontologies for achieving interoperability in many knowledge intensive environments like the Semantic Web [1]. As future work, we plan to study the discovery of non-taxonomical relationships through the analysis of relevant sentences extracted from web resources (text nuggets), design forms of automatic evaluation of the results and study the evolution of a domain through several executions of the system in different periods of time to extract high level valuable information from the changes detected in the results.
Acknowledgements This work has been supported by the "Departament d'Universitats, Recerca i Societat de la Informació" of Catalonia.
References 1. Agirre, E., Ansa, O., Hovy, E., and Martinez, D.: Enriching very large ontologies using the WWW. Workshop on Ontology Construction (ECAI-00). 2000. 2. Berners-lee T., Hendler, J., Lassila O.: The semantic web. Available at: http://www.sciam.com72001/050lissue/0501berners-lee.html 3. Cimiano, P. and Staab, S.: Learning by Googling. SIGKDD, 6(2), pp. 24-33. 2004. 4. Etzioni, O., Cafarella, M., Downey, D., Kok, S., Popescu, A., Shaked, T., Soderland, S. and Weld, D.: Web Scale Information Extraction in KnowItAll. WWW2004, USA. 2004. 5. Grefenstette G.: SQLET: Short Query Linguistic Expansion Techniques. In: Information Extraction: A Multidisciplinary Approach to an Emerging Information Technology, volume 1299 of LNAI, chapter 6, 97-114. Springer. SCIE-97. Italy, 1997. 6. Navigli, R. and Velardi, P.: Learning Domain Ontologies from Document Warehouses and Dedicated Web Sites. In Computational Linguistics, Volume 30, Issue 2. June 2004. 7. Turney, P.D.: Mining the Web for synonyms: PMI-IR versus LSA on TOEFL. In Proceedings of the Twelfth European Conference on Machine Learning. 2001.
Resource Modeling and Analysis of Regional Public Health Care Data by Means of Knowledge Technologies Nada Lavrač1,2, Marko Bohanec1,3, Aleksander Pur4, Bojan Cestnik5,1, Mitja Jermol1, Tanja Urbančič2,1, Marko Debeljak1, Branko Kavšek1, and Tadeja Kopač6 1
Jožef Stefan Institute, Jamova 39, SI-1000 Ljubljana, Slovenia 2 Nova Gorica Polytechnic, Nova Gorica, Slovenia 3 University of Ljubljana, Faculty of Administration, Ljubljana, Slovenia 4 Ministry of the Interior, Štefanova 2, Ljubljana, Slovenia 5 Temida, d.o.o. Ljubljana, Slovenia 6 Public Health Institute, Celje, Slovenia
Abstract. This paper proposes a selection of knowledge technologies for health care planning and decision support in regional-level management of Slovenian public health care. Data mining and statistical techniques were used to analyze databases collected by a regional Public Heath Institute. Specifically, we addressed the problem of directing patients from primary health care centers to specialists. Decision support tools were used for resource modeling in terms of availability and accessibility of public health services for the population. Specifically, we analyzed organisational aspects of public health resources in one Sovenian region (Celje) with the goal to identify the areas that are atypical in terms of availability and accessibility of public health services.
1 Introduction Regional Public Health Institutes (PHIs) are an important part of the system of public health care in Slovenia. They are coordinated by the national Institute of Public Health (IPH) and implement actions aimed at maintaining and improving public health in accordance with the national policy whose goal is an efficient health care system with a balanced distribution of resources in different regions of Slovenia. The network of regional PHIs, coordinated by the national IPH, collect large amounts of data and knowledge that need to be efficiently stored, updated, shared, and transferred to help better decision making. These problems can efficiently be solved by appropriate knowledge management [6], which is recognized as the main paradigm for successful management of networked organizations [2, 3]. Knowledge management can be supported by the use of knowledge technologies, in particular by data mining and decision support [4]. This paper describes applications of data mining and decision support in public health care, carried out for the Celje region PHI within a project called MediMap. The main objective was to set up appropriate models and tools to support decisions concerning regional health care, which could later serve as a reference model for other S. Miksch et al. (Eds.): AIME 2005, LNAI 3581, pp. 414 – 418, 2005. © Springer-Verlag Berlin Heidelberg 2005
Resource Modeling and Analysis of Regional PH Care Data
415
regional PHIs. We approached this goal in two phases: first, we analyzed the available data with data mining techniques, and second, we used the results of data mining for a more elaborate study using decision support techniques. In the first phase we focused on the problem of directing the patients from primary health care centers to specialists. In the second phase we studied organisational aspects of public health resources in the Celje region with the goal to identify the areas that are atypical in terms of availability and accessibility of public health services. Some of the achieved results are outlined in Sections 2 and 3, respectively. Section 4 provides the summary and conclusions.
2 Selected Results of Statistical and Data Mining Techniques In order to model the Celje regional health care system we first wanted to better understand the health resources and their connections in the region. For that purpose, data mining techniques were applied to the data of eleven community health centers. The dataset was formed of three databases: − the health care providers database, − the out-patient health care statistics database (patients’ visits to general practitioners and specialists, diseases, human resources and availability), and − the medical status database. To model the processes of a particular community health center (the patient flow), data describing the directing of patients to other community health centers or specialists were used. Similarities between community health centers were analyzed according to four different categories: (a) patients’ age, (b) patients’ social status, (c) the organization of community health centers, (d) their employment structure. For each category, similarity groups were automatically constructed using four different clustering methods. Averages over four clustering methods per category were used to detect the similarities between the community health centers of the Celje region. The similarities of community health centers were presented and evaluated by
Fig. 1. Left: Results of hierarchical clustering of community health centers. Right: A decision tree representation of the top two clusters, offering an explanation for the grouping
416
N. Lavrač et al.
PHI Celje domain experts, as described in more detail in [5]. An illustration of clusters, generated by agglomerative hierarchical clustering using the Ward's similarity measure, based on four categories (a)-(d), is given in Figure 1. In order to get the descriptions of particular groups, a decision tree learning technique was used. The decision tree in Figure 1 shows that the most informative attribute, distinguishing between the groups of health centers, is category (a) - the age of patients. Community health centers in which pre-school children (PRED) constitute more than 1.41 % of all visits to the center form Cluster 1 (consisting of seven health centers). The experts’ explanation is that these centers lack specialized pediatrician services, hence pre-school children are frequently treated by general practitioners.
3 Selected Results of Decision Support Techniques The key issues of the second phase of the project were the detection of outliers and the analysis of causes for different availability and accessibility of services. Out of several analyses carried out, two of them are described in this section: (1) analysis of pharmacy capacities, and (2) analysis of availability of public health care resources. Analysis of Pharmacy Capacities. explored the regional distribution of pharmacies and the detection of atypical cases. Capacity deviations were observed by comparing the demand (population) with the capacity of pharmacies, where the capacity computation is the result of a weighted average of several parameters: infrastructure (evaluated by the number of pharmacies) and human resources (evaluated by the number of pharmacists of low and high education level and the number of other staff). Hierarchical multi-attribute decision support system DEXi [1] was used for the evaluation of composed criteria from the values of low-level criteria. Hierarchical structuring of decision criteria by DEXi is shown in Figure 2. Qualitative aggregation of the values of criteria is based on rules, induced from human-provided cases of individual pharmacy evaluations. Figure 3 shows the capacity deviations of pharmacies in the Celje region, where the X axis denotes the actual number of people (city inhabitants), and the Y axis is the capacity of pharmacies. Atypical pharmacies are those deviating from the diagonal (Žalec and Velenje), while other five regional pharmacies (Celje, Šmarje and three others in the lower-left corner of Figure 3) are balanced in terms of the requirementcapacity tradeoff. Attribute Appropriateness of pharmacies Expected requirements Number of inhabitants Capacity of pharmacies Infrastructure Number of pharmacies Human resources Pharmacists, Low education Pharmacists, High education Other staff
Scale over-developed; appropriate; under-developed low; decreased; medium; increased; high 15000-25000; 25000-35000; 35000-45000; 45000-55000; 55000-65000 inadequate; adequate; good; v_good; excellent basic; good; excellent to_4; 4-8; 9_more inadequate; adequate; good; excellent 1; 2 to_20; 20-60; 61-100; 101_more to_2; 2-10; 11_more
Fig. 2. The structure and value scales of criteria for the evaluation of the appropriateness of pharmacies in terms of requirements and capacity
Resource Modeling and Analysis of Regional PH Care Data
417
Capacity of pharmacies
excellent
CELJE
v_good
good
ŠMARJE
VELENJE
adequate
LAŠKO MOZIRJE SEVNICA ŠENTJUR
ŽALEC
inadequate
low
decreased
medium
increased
high
Expected requirements
Fig. 3. Analysis of pharmacy capacity deviations in the Celje region
Analysis of the Availability of Public Health Care Resources. was carried out to detect local communities that are underserved concerning general practice health care. We evaluated 34 local communities in the Celje region according to criterion AHSPm – the Availability of Health Services (measured in hours) for Patients from a given community (c), considering the Migration of patients into neighbouring communities – computed as follows: 1 AHSPm = (1) ∑ ai pci pc i In AHSPm computation, ai is the available time of health care services (i) per person, defined as the ratio of total working time of health care services and total number of visits, pc is the total number of accesses of patient from community (c), and pci is the number of accesses of patient from community (c) to health service (i).
Fig. 4. Availability of health services for patients from the communities in the Celje region, considering the migration of patients to neighboring communities (AHSPm) for year 2003
418
N. Lavrač et al.
The evaluation of Celje communities in terms of AHSPm is shown in Figure 4. The color of communities depends on the availability of health services for patients. Darker communities have higher health care availability for the population. This representation was well accepted by the experts, as it clearly shows the distribution of the availability of health care resources for patients from the communities.
4 Conclusions In the MediMap project we have developed methods and tools that can help regional Public Health Institutes and the national Institute of Public Health to perform their tasks more effectively. The knowledge technologies tools developed for the reference case of the PHI Celje were applied to selected problems related to health care organization, the accessibility of health care services to the citizens, and the health care providers network. The main achievement of applying decision support methods to the problem of planning the development of public health services was the creation of the model of availability and accessibility of health services to the population of a given area. With this model it was possible to identify the regions that differ from the average and to consequently explain the causes for such situations. The developed methodology and tools have proved to be appropriate to support the PHI Celje, serving as a reference model for other regional PHIs.
Acknowledgements We acknowledge the financial support of the Public Health Institute Celje, the Slovenian Ministry of Education, Science and Sport, and the 6FP integrated project ECOLEAD (European Collaborative Networked Organizations Leadership Initiative).
References 1. Bohanec, M., Rajkovič, V.: DEX: An Expert-System Shell for Decision Support, Sistemica, Vol. 1, No. 1 (1990) 145–157; see also http://www-ai.ijs.si/MarkoBohanec/dexi.html 2. Camarinha-Matos, L.M., Afsarmanesh, H.: Elements of a Base VE Infrastructure. J. Computers in Industry, Vol. 51, No. 2 (2003) 139–163 3. McKenzie, J., van Winkelen, C.: Exploring E-collaboration Space. Henley Knowledge Management Forum (2001) 4. Mladenić, D., Lavrač, N., Bohanec, M., Moyle, S. (eds.): Data Mining and Decision Support: Integration and Collaboration, Kluwer (2003) 5. Pur, A., Bohanec, M., Cestnik, B., Lavrač, N., Debeljak, M., Kopač, T.: Data Mining for Decision Support: An Application in Public Health Care. Proc. of the 18th International Conference on Industrial & Engineering Applications of Artificial Intelligence & Expert Systems, Bari, (2005) in press 6. Smith, R.G., Farquhar, A.: The Road Ahead for Knowledge Management: An AI Perspective. AI Magazine, Vol. 21, No. 4 (2000) 17–40
An Evolutionary Divide and Conquer Method for Long-Term Dietary Menu Planning Bal´azs Ga´ al, Istv´ an Vass´anyi, and Gy¨ orgy Kozmann University of Veszpr´em, Department of Information Systems, Egyetem u. 10, H-8201 Veszpr´em, Hungary
Abstract. We present a novel Hierarchical Evolutionary Divide and Conquer method for automated, long-term planning of dietary menus. Dietary plans have to satisfy multiple numerical constraints (Reference Daily Intakes and balance on a daily and weekly basis) as well as criteria on the harmony (variety, contrast, color, appeal) of the components. Our multi-level approach solves problems via the decomposition of the search space and uses good solutions for sub-problems on higher levels of the hierarchy. Multi-Objective Genetic Algorithms are used on each level to create nutritionally adequate menus with a linear fitness combination extended with rule-based assessment. We also apply case-based initialization for starting the Genetic Algorithms from a better position of the search space. Results show that this combined strategy can cope with strict numerical constraints in a properly chosen algorithmic setup.
1
Introduction
The research in the field of computer aided nutrition counseling has begun with a linear programming approach for optimizing menus for nutritional adequacy and cost, developed by Balintfy in 1964 [1]. In 1967 Eckstein used random search to satisfy nutritional constraints for meals matching a simple meal pattern [2]. Artificial intelligence methods mostly using Case-Based Reasoning (CBR) combined with other techniques were developed later [3, 4]. A hybrid system, CAMPER has been developed recently, using CBR and Rule-Based Reasoning for menu planning [5], that integrates the advantages of two other independent implementations, the case-based menu planner (CAMP) and PRISM [6]. A more recent CBR approach is MIKAS, menu construction using an incremental knowledge acquisition system, developed to handle the menu design problem for an unanticipated client. The system allows the incremental development of a knowledge-base for menu design [7]. Human professionals still surpass computer algorithms on nutrition advice regarding the quality of the generated dietary plan, but computer based methods can be used on demand and in unlimited quantities. Furthermore, with the computational potential of today’s computers the gap between human experts and algorithmic methods can be narrowed on this particular field. S. Miksch et al. (Eds.): AIME 2005, LNAI 3581, pp. 419–423, 2005. c Springer-Verlag Berlin Heidelberg 2005
420
B. Ga´ al, I. Vass´ anyi, and G. Kozmann
The work presented in this paper deals with the problem of planning weekly menus, made up of dishes stored in a database and used to compose meals, daily plans and the weekly plan.
2
Objectives
The evaluation of a dietary plan has at least two aspects. Firstly, we must consider the quantity of nutrients. There are well defined constraints for the intake of nutrient components such as carbohydrate, fat or protein which can be computed, given one’s age, gender, body mass, physical activity, age and diseases. Optimal and extreme values can be specified for each nutrient component. The nutritional allowances are given as Dietary Reference Intakes [8]. As for quantity, the task of planning a meal can be formulated as a constraint satisfaction and optimization problem. Secondly, the harmony of the meal’s components should be considered. Plans satisfying nutrition constraints have also to be appetizing. By common sense some dishes or nutrients do not appeal in the way others do. In our approach, this common sense of taste and cuisine may be described by simple rules recording the components that do or do not go together. We must cope with conflicting numerical constraints or harmony rules. The fact base of the proposed method and our software tool MenuGene is a commercial nutritional database widely used in Hungarian hospitals, that at present contains the recipes of 569 dishes with 1054 ingredients. The nutrients contained in the database for each ingredient are energy, protein, fat, carbohydrates, fiber, salt, water, and ash.
3
Methods
In order to satisfy the numerical constraints and the requirements about harmony on each nutritional level, we divide the weekly menu planning problem into subproblems (daily menu planning and meal planning). Genetic Algorithms (GA) are run on these levels, driven by our C++ framework GSLib, which is also responsible for scheduling the multi-level evolution process. A similar multilevel multi-objective GA was proposed and tested in [9]. On each level, GSLib optimizes a population of solutions (individuals) composed of attributes (genes). We also maintain a Case-Base for storing dietary plans made by experts or created through adaptation. Whenever a new dietary plan is requested, the Case-Base is searched for plans that were made with similar objectives regarding quantity and harmony. In our abstract framework, a solution composed of its attributes, represents one possible solution of the problem. An attribute represents either a solution found by a GA, selected from a POOL, or it is predefined as SINGLE. Example solutions and an outline of the hierarchical structure are shown in tables 1 and 2 respectively.
An Evolutionary Divide and Conquer Method
421
Table 1. Example solutions and their attributes for meal and dish. (pcs: pieces) Solution for a typical lunch attribute minimal default maximal asoup 1pc 1pc 1pc agarnish 1pc 1pc 1pc aredmeat 1pc 1pc 2pcs adrink 1pc 2pcs 3pcs adessert 1pc 1pc 2pcs
Solution for a garnish (mashed potato) attribute minimal default maximal amilk 120g 120g 120g apotato 357.1g 357.1g 357.1g abutter 5g 5g 5g asalt 5g 5g 5g
Table 2. Solutions and their Attributes in the Multi-Level Hierarcy solution weekly plan daily plan meal plan dish foodstuff nutrient
attribute attribute type daily plan GA meal plan GA dish POOL foodstuff SINGLE nutrient SINGLE – –
In each iteration of a population, solutions are recombined and mutated. Recombination involves two solutions and exchanges their attributes from a randomly selected crossover point. If the attribute is GA type, mutation either fires evolution process on the population connected to the attribute (mutation-based scheduling), or reinitializes the population. If the attribute is POOL type, mutation randomly chooses a solution from its pre-defined pool. No mutation occurs on SINGLE type attributes. Other possible scheduling alternatives include the top-down and the bottomup strategies. The top-down and bottom-up strategy runs the uppermost/lowest level populations for a given iteration step and continues on the lower/upper level. Each solution maintains a list storing its underlying hierarchy. For example, a solution for a daily plan knows how much protein, carbohydrate or juice it may contain. So, a solution can be assessed by processing its whole underlying structure. Numerical constraints (minimal, optimal and maximal values) are given by the nutrition expert for solutions (nutrients, foodstuffs) and solution sets on a weekly basis. These constraints are propagated proportionally to lower levels. We calculate a non-linear, second order penalty function (see Fig. 1.) for each constraint and use linear fitness combination to sum up the fitness of the solutions according to the numerical constraints. The fitness function will provide a non-positive number as a grade where zero is the best. The fitness function does not punish too much small deviations from the optimum (default) but it is strict on values that are not in the interval defined by the minimal and maximal limits. The function is non-symmetric to the optimum, because the effects of the
422
B. Ga´ al, I. Vass´ anyi, and G. Kozmann
Fig. 1. An example penalty function for a constraint with min = 100, opt = 300, max = 700 parameters.
Fig. 2. Fitness of the best solution in function of runtime
deviations from the optimal value may be different in the negative and positive directions. Rule-Based Assessment is used to modify the calculated fitness value, if there are rules for harmony that are applicable for the solution being assessed. Rules apply a penalty on the simultaneous occurrence of solutions or solution sets. Only the most appropriate rules are applied. For example if a meal contains tomato soup and orange juice, and we have the rules r1 = ([Ssoups , Sdrinks ], 70%) and r2 = ([Sorangedrinks , Sjuices ], 45%) then only r2 is applied, because the set of orange drinks is more specific to orange juice than the set of drinks.
4
Results and Discussion
We found that the plans generated by our tool MenuGene satisfy numerical constraints on the level of meals, daily plans and weekly plans. The multi-level generation was tested with random and real world data. Our tests showed that the rule-based classification method successfully omits components that don’t harmonize with each other. We performed tests to explore the best algorithmic setup. First, we analyzed the effect of the crossover and mutation probabilities on the fitness. The results showed that the probability of the crossover does not influence the fitness too much, but a mutation rate well above 10% is desirable, particularly for smaller populations. We examined the connection between the runtime and the quality of the solution in a wide range of algorithmic setups. As Fig. 2. shows, although the quality of the solutions improves with runtime, the pace of improvement is very slow after a certain time, and, on the other hand, a solution of quite good quality is produced within this time. This means that it is enough to run the algorithm until the envelope curve starts to saturate. Using the optimal algorithmic setup, we tested how the algorithm copes with very close upper and lower limits (minimal and maximal values). Minimal and maximal values were gradually increased/decreased to be finally nearly equal.
An Evolutionary Divide and Conquer Method
423
The results in ca. 150.000 test runs showed that our method is capable of generating nutritional plans, even where the minimal and maximal allowed values of one or two constraints are virtually equal and the algorithm found a nearly optimal solution when there are three or four constraints of this kind. According to our nutritionist, there is no need for constraints with virtually equal minimal and maximal values, and even in most pathological cases the strict regulation of four parameters is sufficient.
5
Conclusions and Future Work
The paper gave a summary on a novel dietary menu planning method that uses a Hierarchical Evolutionary Divide and Conquer strategy for weekly or longerterm dietary menu plan design with a fitness function that assesses menus according to the amount of nutrients and harmony. Algorithmic tests revealed that the method is capable, for non-pathological nutrition, of generating long-term dietary menu plans that satisfy nutritional constraints even with very narrow constraint intervals. Further work includes the extension of the present rule set to a comprehensive rule base to ensure really harmonic menu plans and the integration of the service into a lifestyle counseling framework.
References 1. Balintfy, J.L.: Menu planning by computer. Commun. ACM 7 (1964) 255–259 2. Eckstein, E.F.: Menu planning by computer: the random approach. Journal of American Dietetic Association 51 (1967) 529–533 3. Yang, N.: An expert system on menu planning. Master’s thesis, Department of Computer Engineering and Science Case Western Reserve University, Cleveland, OH. (1989) 4. Hinrichs, T.: Problem solving in open worlds: a case study in design (1992) Erlbaum, Northvale, NJ. 5. Marling, C.R., J., P.G., Sterling, L.: Integrating case-based and rule-based reasoning to meet multiple design constraints. Computational Intelligence 15 (1999) 308–332 6. Kovacic, K.J.: Using common-sense knowledge for computer menu planning. PhD thesis, Cleveland, Ohio: Case Western Reserve University (1995) 7. Khan, A.S., Hoffmann, A.: Building a case-based diet recommendation system without a knowledge engineer. Artif Intell Med. 27 (2003) 155–179 8. Food and Nutrition Board (FNB), Institute of Medicine (IOM): Dietary reference intakes: applications in dietary planning. National Academy Press. Washington, DC. (2003) 9. Gunawan, S., Farhang-Mehr, A., Azarm, S.: Multi-level multi-objective genetic algorithm using entropy to preserve diversity. In: EMO. (2003) 148–161
Human/Computer Interaction to Learn Scenarios from ICU Multivariate Time Series Thomas Guyet1 , Catherine Garbay1 , and Michel Dojat2 1
Laboratoire TIMC, 38706 La Tronche, France [email protected] http//www-timc.imag.fr/Thomas.Guyet/ 2 Mixt Unit INSERM/UJF U594
Abstract. In the context of patients hospitalized in intensive care units, we would like to predict the evolution of the patient’s condition. We hypothesis that human-computer collaboration could help with the definition of signatures representative of specific situations. We have defined a multi-agent system (MAS ) to support this assumption. Preliminary results are presented to support our assumption.
1
Introduction
Physiological monitors currently used in high dependency environments such as Intensive Care Units (ICUs) or anaesthesia wards generate a false alarm rate that approximately reaches 86% [1]. This bad score is essentially due to the fact that alarm detection is mono-parametric and based on predefined thresholds. To really assist clinicians in diagnosis and therapy tasks we should design intelligent monitoring systems with capacities of predicting the evolution of the patients state and eventually of triggering alarms in case of probable critical situations. Several methods have been proposed for false alarm reduction based on data processing algorithms (using trend calculation)[2, 3] or data mining techniques [1]. In general, the methods proposed are based on two steps, first data processing algorithms detect relevant events in time series data and secondly information analysis techniques are used to identify well-known critical patterns. When applied to multivariate time series, these methods can be powerful if patterns to discover are well-known in advance. In order to predict the evolution of the patient’s state, typical high level scenes, combination of several patterns that are representative of specific situations, have to be formalized. Unfortunately, clinicians have some difficulties in defining such situations. We propose an interactive environment for an in-depth exploration by the clinician of the set of physiological data. In the interactive environnement TSW, used in [4], allows to collect overall statistics. Our approach is based on a collaborative work between a computer, which efficiently analyzes a large set of data and classifies information, and a clinician, who has skills and pragmatic knowledge for data interpretation and drives the computerized data exploration. We assume that such a collaboration could facilitate the discovery of signatures representative of specific situations. To test this hypothesis we S. Miksch et al. (Eds.): AIME 2005, LNAI 3581, pp. 424–428, 2005. c Springer-Verlag Berlin Heidelberg 2005
Human/Computer Interaction to Learn Scenarios
425
have designed a collaborative multi-agent system (MAS ) with the capacities of segmenting, classifying and learning.
2
Scenarios Construction Based on Clinician/Computer Interaction
The philosophy of the approach is the creation of a self-organized system guided by an clinician. The system dynamically builds its own data representation for scenario discovery. The clinician interacts with the system in introducing annotations, solving ambiguous cases, or introducing specific queries. Consequently, the system should be highly adaptive. Data, information and knowledge are dynamically managed at several steps during clinician/computer interactions. Data, numeric or symbolic, are represented as multivariate time series. The whole system is an experimental workbench. The clinician can add or delete pieces of information, explore time series by varying the time granularity, using trends computed on various slidingwindows, and in selecting or combining appropriate levels of abstraction. Annotations can be introduced by the clinician generally as binary information (for instance, cough (0 or 1), suction (0 or 1)). The goal of our method is to learn from time series data the more likely scenario, i.e. a set of events linked by temporal constraints, that explains a specific event. In the following, we distinguish between events, that contribute to the description of a scenario, and specific events that are explained by scenarios. Specific events are characterized by a decision tree that is built from “instantaneous” patient states described by numeric and symbolic time series data and annotated as positive or negative examples. Then the decision tree is used to find new occurrences of the same type of specific events. To explain the so called specific events, scenarios are learned by the MAS (cf. section 3). A clinician interacts with the MAS to guide the learning and be informed of discoveries. A short example can briefly describe our method 1) Starting from time series data, the clinician annotates some SpO2 desaturation episodes in specific regions of interest. Annotations are encoded as binary time series data. 2) From data plus annotations, the system learns desaturation episodes, their characteristics (a tree) and possible scenario (learned by the MAS ) that explains these episodes (specific events). 3) The system, in browsing a part of time series data, searches for similar characteristics and scenario. 4) Based on similar characteristics, new desaturation episodes can be discovered by the system. They are used as new examples for the decision tree or if they present an atypical signature compared to others, they are shown to the clinician for examination. 5) When similar scenarios not followed by a desaturation episode are discovered, they are provided to the clinician who decides either to keep or reject them.
426
T. Guyet, C. Garbay, and M. Dojat
They are then used respectively as positive or negative examples for the learning phase 6) The system loops to step 2 until no new scenarios or characteristics can be discovered. Depending on its discoveries and information provided by the clinician, the system continuously adapts its behavior. Its complex self-organisation is masked to the clinician who interacts with the system only via time series data.
3
A Multiagent System
MASs present good properties to implement self-organization and reactivity to user interactions. Our general architecture was inspired by [5]. Three data abstraction levels, namely segments, classes of segments and scenarios, are constructed in parallel with mutual and dynamic adaptations to improve the final result. These levels are associated to a numerical, a symbolic and a semantic level respectively. The MAS ensures an equivalence between what we call specific events and classes of segments. For symbolic time series, like annotations, symbols are associated to some classes of segments. For instance, the clinician can annotate each SpO2 desaturation episode. Learning a scenario that explains this specific event is equivalent to learning a scenario that explains occurrences of a class of segments (with the corresponding symbol) in the time series “SpO2 annotations”. Other classes of segments discovered by the system should be interpreted by the clinician. Three types of agents exist in the MAS. At the numerical level, reactive agents segment time series. Each reactive agent represents a segment, i.e. a time series region bounded by two borders with neighboring agents. Frontiers move, disappear or are created dynamically by interactions between agents based on the approximation error of an SVR 1 model of segment. Then, at the symbolic level, classification agents build classes of segments. To compare segments, we have defined a distance between SVR models that takes into account dilatation and scaling. A vocabulary is then introduced to translate times series into symbolic series. At the semantic level, learning agents build scenarios that explain classes of segments (or specific events for clinicians). In fact, we consider a scenario that occurs before a specific event e2 as a possible explanation for e2 . Consequently, learning a scenario that explains e2 consists in finding common points (signature) in temporal windows preceding examples of the class of e2 . To find such signatures, we have adapted the algorithm presented in [7] to multivariate series inputs. This algorithm takes into account temporal constraints to aggregate frequent relevant information and to progressively suppress noisy information. Then, the learning agent proposes modifications (via feedback) to segmentation and classification agents by focussing on discrepancies found between the learned scenario and the examples. In this way, it improves its global confidence. 1
Support Vector Regression [6].
Human/Computer Interaction to Learn Scenarios
427
Interactions are the key of the self-organization capability of the system and are used at three main levels 1) Feedback that dynamically modifies, adds or deletes segmentation agents, thus leading to segmentation modifications. 2) Feedback that adjusts classification. 3) Improvement of the overall learning accuracy (discrimination power) by pointing out, between learning agents, inconsistencies in the resulting associations of scenarios and specific events. We highlight that at the two first levels, symbolization of all series are performed independently. However, learning agents use multivariate symbolic series to build scenarios and to deliver feedback to obtain a global coherence.
4
Preliminary Results
We used data from ICU patients under weaning from mechanical ventilation. We selected four physiological signals (SpO2 , Respiratory rate, Total expired volume and heart rate) sampled at 1 Hz during at least 4 hours. Each signal was preprocessed by computing sliding-windows means (3 min width) and was abstracted by following a specific methodology [8] into symbolic values and trends. Moreover, we used Sp02 desaturation annotations inserted at the patient’s bedside by a clinician. Consequently, each patient’s recordings was a multivariate time series that contained 17 (4 ∗ 4 + 1) long times series. Presently, the implementation of our MAS is partial, feedbacks are not fully implemented. Data acquisition is still undergoing. Thus, the quantity of data currently available is not sufficient to ensure a robust learning. The figure 1 displays the experimental computer’s interface to manage the MAS. The panel A) shows the way to access to patient’s data and agents classes. The Panel B) shows recordings for 2 patients (P 1 and P 2). For each one, we display two time series “Heart Rate data” (numeric) and “SpO2 Annotations”
B) "Heart Rate" and "SpO2 Annotations" time series segmentation of P1 and P2
A) P1
{
{
P2
HR SpO2 An.
t=2500s HR SpO2 An.
t=11500s
C) Cl
f
f"
ies D)
Class Id=1 (low level)
Class Id=2, (high level =Desaturation events)
Fig. 1. MAS interface. (see text for details)
428
T. Guyet, C. Garbay, and M. Dojat
(symbolic). Vertical bars represent frontiers of segments obtained by segmentation agents. SpO2 annotations are segmented in 3 (for P 1) and 5 (for P 2) segments. Segmentation is not perfect especially for heart rate but will be refined by feedback. In panel C), are displayed the 2 classes of segments that classify the 8 segments (only 7 are shown) of “SpO2 Annotations” time series. In panel D) the more probable scenario that explains class 2 is indicated as a set of events linked by temporal relations. A clinician interface should be developed to mask technical aspects of this interface and translate scenarios into comprehensible representations.
5
Conclusion and Perspectives
The system we present proposes an experimental workbench to assist clinicians in the exploration of physiological data in poorly formalized domains such as ICUs. The central concept of our system is to support interactions between a selforganized MAS, with data processing, data abstraction and learning capabilities and a clinician, who browses the provided mass of data to guide the learning of scenarios representative of the patient’s state evolution. Preliminary results invite us to pursue the implementation and show the necessity to reinforce the MAS capabilities to segment, classify and learn from time series data.
References 1. Tsien, C.L., Kohane, I.S., McIntosh, N.: Multiple Signal Integration by Decision Tree Induction to Detect Artifacts in the Neonatal Intensive Care Unit. Art. Intel. in Med. 19 (2000) 189–202 2. Salatian, A., Hunter, J.: Deriving Trends in Historical and Real-Time Continuously Sampled Medical Data. J. Intel. Inf. Syst. 13 (1999) 47–71 3. Lowe, A., Harrison, M.J., Jones, R.W.: Diagnostic Monitoring in Anaesthesia Using Fuzzy Trend Templates for Matching Temporal Patterns. Art. Intel. in Med. 16 (1999) 183–199 4. Hunter, J., Ewing, G., Freer, Y., Logie, F., McCue, P., McIntosh, N.: NEONATE: Decision Support in the Neonatal Intensive Care Unit - A Preliminary Report. In Dojat, M., Keravnou, E.T., Barahona, P., eds.: AIME. Volume 2780 of Lecture Notes in Computer Science., Springer-Verlag, Berlin Heidelberg New York (2003) 41–45 5. Heutte, L., Nosary, A., Paquet, T.: A Multiple Agent Architecture for Handwritten Text Recognition. Pattern Recognition 37 (2004) 665–674 6. Vapnik, V., Golowich, S., Smola, A.: Support Vector Method for Function Approximation, Regression Estimation, and Signal Processing. Ad. in Neural Inf. Proc. Sys. 9 (1997) 281–287 7. Dousson, C., Duong, T.V.: Discovering Chronicles with Numerical Time Constraints from Alarm Logs for Monitoring Dynamic Systems. In Dean, T., ed.: IJCAI, Morgan Kaufmann (1999) 620–626 8. Silvent, A.S., Dojat, M., Garbay, C.: Multi-Level Temporal Abstraction for Medical Scenarios Construction. Int. J. of Adaptive Control Signal Process. (2005) Processing to appear.
Mining Clinical Data: Selecting Decision Support Algorithm for the MET-AP System Jerzy Blaszczynski1 , Ken Farion2 , Wojtek Michalowski3 , Szymon Wilk1, , Steven Rubin2 , and Dawid Weiss1 1
Poznan University of Technology, Institute of Computing Science, Piotrowo 3a, 60-965 Poznan, Poland {jurek.blaszczynski, szymon.wilk, dawid.weiss}@cs.put.poznan.pl 2 Children’s Hospital of Eastern Ontario, 401 Smyth Road, Ottawa, ON, K1H 8L1, Canada {farion, rubin}@cheo.on.ca 3 University of Ottawa, School of Management, 136 Jean-Jacques Lussier Street, Ottawa, ON, K1N 6N5, Canada [email protected]
Abstract. We have developed an algorithm for triaging acute pediatric abdominal pain in the Emergency Department using the discovery-driven approach. This algorithm is embedded into the MET-AP (Mobile Emergency Triage - Abdominal Pain) system – a clinical decision support system that assists physicians in making emergency triage decisions. In this paper we describe experimental evaluation of several data mining methods (inductive learning, case-based reasoning and Bayesian reasoning) and results leading to the selection of the rule-based algorithm.
1
Introduction
Quest to provide better quality care to the patients results in supplementing regular procedures with intelligent clinical decision support systems (CDSS) that help to avoid potential mistakes and overcome existing limitations. A CDSS is “a program designed to help healthcare professionals make clinical decisions” [1]. In our research we concentrate on systems providing patientspecific advice that use clinical decision support algorithms (CDSAs) discovered from data describing past experience. We evaluate various methods of data mining for extracting knowledge and representing it in form of a CDSA. The evaluation is based on the decision accuracy of the CDSA as the outcome measure. CDSAs discussed in this paper aim at supporting triage of abdominal pain in the pediatric Emergency Department (ED). Triage is one of ED physician’s functions that requires the following disposition decisions: discharge (discharge
Corresponding author.
S. Miksch et al. (Eds.): AIME 2005, LNAI 3581, pp. 429–433, 2005. c Springer-Verlag Berlin Heidelberg 2005
430
J. Blaszczynski et al.
and possible follow-up by a family doctor), observation (further investigation in the ED or hospital), or consult (consult a specialist). MET-AP [2] is a mobile CDSS that provides support at the point of care for triaging abdominal pain in the ED. It suggests the most appropriate dispositions and their relative strengths. The paper is organized as follows. We start with a short overview of data mining methodologies used for developing CDSAs. Then, we describe the experiment and conclude with the discussion.
2
CDSA and Data Mining
One of the oldest data mining approaches applied in developing discovery-driven CDSA is Bayesian reasoning [3]. It is a well-understood methodology that is accepted by physicians [3], as it mimics clinical intuition [4]. Bayesian reasoning was adopted in the Leeds system [5] for diagnosing abdominal pain. Other techniques of data mining, including case-based reasoning [3] and inductive learning [4], have not been as widely accepted in clinical practice [4]. However, some practical applications [3, 4] of CDSAs constructed with these techniques have attracted the attention of healthcare professionals. Case-based reasoning [6] allows arriving at a recommendation by analogy. This approach was employed in the development of CDSA for the CAREPARTNER system [7] designed to support the long-term follow-up of patients with stem-cell transplant. Inductive learning [4] represents discovered knowledge in the form of decision rules and decision trees that are later used in CDSAs. It is less often used in practice despite some successful clinical applications (e.g., for evaluating coronary disease [8]).
3
Selecting the Most Appropriate CDSA
In order to select the CDSA for triage of abdominal pain we designed and conducted a three-phase computational experiment. The first phase of the experiment dealt with the development of CDSAs on the basis of learning data transcribed retrospectively from ED charts of patients with abdominal pain, who visited the ED in the Children’s Hospital of Eastern Ontario between 1993 and 2002 (see Table 1). Each record was described by values of 13 clinical attributes. Charts were assigned to one of three decision classes by reviewing the full hospital documentation. The following data mining methods were evaluated: induction of decision rules (LEM2 algorithm [9]); Bayesian reasoning (naive Bayesian reasoning [10]); case-based reasoning (IBL algorithm [11]); and induction of decision trees (C4.5 algorithm [12]). During the second phase of the experiment we tested performance of the algorithms on new retrospectively collected independent testing data set (see Table 1). There is general consensus among the physicians that it is important
Mining Clinical Data
431
Table 1. Learning and testing data sets
Decision class
Number of charts Learning set Testing set
Discharge Observation Consult
352 89 165
52 15 33
Total
606
100
to differentiate between the classification accuracies obtained for different decision classes (high accuracy for the consult class is more important than for the discharge class). This means than the assessment and selection of the most appropriate CDSA should consider not only its overall classification accuracy, but also performance for critical classes of patients. The results of the second phase are given in Table 2. Although all CDSAs had very similar overall accuracies, the most promising results considering the observation and consult classes were obtained for case-based CDSA and for the rule-based CDSA. The tree-based CDSA had the same accuracy for the consult class as the rule-based one, however, its performance for the observation class was much lower. Finally, naive Bayesian CDSA offered the highest accuracy for the discharge class, and the lowest accuracy for the consult class, which is unacceptable from the clinical point of view. Table 2. Classification accuracy of the CDSAs CDSA Rule-based Naive Bayesian Case-based Tree-based
Overall 59.0% 56.0% 58.0% 57.0%
Discharge Observation Consult 55.8% 65.4% 57.7% 59.6%
46.7% 20.0% 20.0% 20.0%
69.7% 57.6% 75.8% 69.7%
In the third phase of the experiment we tested and compared the CDSAs considering the misclassification costs (in terms of penalizing undesired misclassifications). The costs, given in Table 3, were defined by the physicians participating in the study and introduced into the classification scheme. We excluded rule-based CDSA from this phase, as its classification strategy did not allow cost-based evaluation. The results of the third phase are given in Table 4. The most remarkable change comparing to the second phase is the increased classification accuracy for the observation class resembling the way a cautious physician would proceed. Such an approach also has drawbacks – the decreased classification accuracy for the consult class.
432
J. Blaszczynski et al. Table 3. Misclassification costs Recommendation Discharge Observation Consult
Outcome Discharge Observation Consult
0 5 15
5 0 5
10 5 0
Table 4. Classification accuracy of the CDSAs (costs considered) CDSA Naive Bayesian Case-based Tree-based
Overall 56.0% 49.0% 55.0%
Discharge Observation Consult 59.6% 42.3% 59.6%
46.7% 60.0% 40.0%
54.6% 54.6% 54.6%
Comparison of the performance of the cost-based CDSAs with the performance of the rule-based CDSA from the second phase favors the latter one as the preferable solution for MET-AP. This CDSA offers high classification accuracies for crucial decision classes (observation and consult) without decreasing accuracy in the discharge class. It is important to note that a rule-based CDSA represents clinical knowledge similarly to practice guidelines, and thus it is easy for clinicians to understand. Taking all of the above into account, we have chosen to implement the rule-based CDSA in MET-AP.
4
Discussion
Our experiment showed better performance of the rule-based CDSA in terms of classification accuracy for crucial decision classes (observation and consult), even when misclassification costs were introduced. This observation proves the viability and robustness of the rule-based CDSA implemented the MET-AP system. More detailed evaluation of the results suggested that the learning data did not contain obvious classification patterns. This was confirmed by a large number of rules composing the rule-based CDSA (165 rules) and their relatively weak support – for each class there were only a few stronger rules. This might explain why rule-based CDSA and case-based CDSA that consider cases in an explicit manner (a rule can be viewed as a generalized case) were, in overall, superior to other algorithms. During further studies we plan to research for more possible dependencies between the characteristics of the analyzed data and data mining methods used to build CDSAs. This should enable us to select the most appropriate CDSA in advance without a need for computational experiments. Moreover, it will be worth checking how the integration of different CDSAs impacts the classification accuracy.
Mining Clinical Data
433
Acknowledgments Research reported in this paper was supported by the NSERC-CHRP grant. It was conducted when J. Blaszczynski, Sz. Wilk, and D. Weiss were part of the MET Research at the University of Ottawa.
References 1. Musen, M., Shahar, Y., Shortliffe, E.: Clinical decision support systems. In Shortliffe, E., Perreault, L., Wiederhold, G., Fagan, L., eds.: Medical Informatics. Computer Applications in Health Care and Biomedicine. Springer-Verlag (2001) 573– 609 2. Michalowski, W., Slowinski, R., Wilk, S., Farion, K., Pike, J., Rubin, S.: Design and development of a mobile system for supporting emergency triage. Methods of Information in Medicine 44 (2005) 14–24 3. Hanson, III, C.W., Marshall, B.E.: Artificial intelligence applications in the intensive care unit. Critical Care Medicine 29 (2001) 427–35 4. Kononenko, I.: Inductive and Bayesian learning in medical diagnosis. Applied Artificial Intelligence 7 (1993) 317–377 5. de Dombal, F.T., Leaper, D.J., Staniland, J.R., McCann, A.P., Horrocks, J.C.: Computer-aided diagnosis of acute abdominal pain. British Medical Journal 2 (1972) 9–13 6. Schmidt, R., Montani, S., Bellazzi, R., Portinale, L., Gierl, L.: Cased-based reasoning for medical knowledge-based systems. International Journal of Medical Informatics 64 (2001) 355–367 7. Bichindaritz, I., Kansu, E., Sullivan, K.: Case-based reasoning in CAREPARTNER: Gathering evidence for evidence-based medical practice. In Smyth, B., Cunningham, P., eds.: Advances in Case-Based Reasoning: 4th European Workshop, Proceedings EWCBR-98, Berlin, Springer-Verlag (1998) 334–345 8. Krstacic, G., Gamberger, D., Smuc, T.: Coronary heart disease patient models based on inductive machine learning. In Quaglini, S., Barahona, P., Andreassen, S., eds.: Proceedings of 8th Conference on Artificial Intelligence in Medicine in Europe (AIME 2001), Berlin, Springer-Verlag (2001) 113–116 9. Grzymala-Busse, J.: LERS - a system for learning from examples based on rough sets. In Sowiski, R., ed.: Intelligent Decision Support - Handbook of Applications and Advances of the Rough Set Theory. Kluwer Academic Publishers, Dordrecht/Boston (1992) 3–18 10. John, G.H., Langley, P.: Estimating continuous distributions in Bayesian classifiers. In: Proceedings of the 11th Conference on Uncertainty in Artificial Intelligence. Morgan Kaufmann, San Mateo (1995) 338–345 11. Aha, D., Kibler, D., Albert, M.: Instance-based learning algorithms. Machine Learning 6 (1991) 37–66 12. Quinlan, R.: C4.5: Programs for Machine Learning. Kaufmann Publishers, San Mateo, CA (1993)
A Data Pre-processing Method to Increase Efficiency and Accuracy in Data Mining Amir R. Razavi1, Hans Gill1, Hans Åhlfeldt1, and Nosrat Shahsavar1,2 1 Department
of Biomedical Engineering, Division of Medical Informatics, Linköping University, Sweden 2 Regional Oncology Centre, University Hospital, Linköping, Sweden {amirreza.razavi, hans.gill, hans.ahlfeldt, nosrat.shahsavar}@imt.liu.se http://www.imt.liu.se
Abstract. In medicine, data mining methods such as Decision Tree Induction (DTI) can be trained for extracting rules to predict the outcomes of new patients. However, incompleteness and high dimensionality of stored data are a problem. Canonical Correlation Analysis (CCA) can be used prior to DTI as a dimension reduction technique to preserve the character of the original data by omitting non-essential data. In this study, data from 3949 breast cancer patients were analysed. Raw data were cleaned by running a set of logical rules. Missing values were replaced using the Expectation Maximization algorithm. After dimension reduction with CCA, DTI was employed to analyse the resulting dataset. The validity of the predictive model was confirmed by ten-fold cross validation and the effect of pre-processing was analysed by applying DTI to data without pre-processing. Replacing missing values and using CCA for data reduction dramatically reduced the size of the resulting tree and increased the accuracy of the prediction of breast cancer recurrence.
1 Introduction In recent years, huge amounts of information in the area of medicine have been saved every day in different electronic forms such as Electronic Health Records (EHRs) [1] and registers. These data are collected and used for different purposes. Data stored in registers are used mainly for monitoring and analysing health and social conditions in the population. In Sweden, the long tradition of collecting information on the health and social conditions of the population has provided an excellent base for monitoring disease and social problems. The unique personal identification number of every inhabitant enables linkage of exposure and outcome data spanning several decades and obtained from different sources [2]. The existence of accurate epidemiological registers is a basic prerequisite for monitoring and analysing health and social conditions in the population. Some registers are nation-wide, cover the whole Swedish population, and have been collecting data for decades. They are frequently used for research, evaluation, planning and other purposes by a variety of users [3]. With the wide availability and comprehensivity of registers, they can be used as a good source for knowledge extraction. Data mining methods can be applied to them S. Miksch et al. (Eds.): AIME 2005, LNAI 3581, pp. 434 – 443, 2005. © Springer-Verlag Berlin Heidelberg 2005
A Data Pre-processing Method to Increase Efficiency and Accuracy in Data Mining
435
in order to extract rules for predicting the outcomes of new patients. Extraction of hidden predictive information from large databases is a potent method with extensive capability to help physicians with their decisions [4]. A Decision Tree is a classifier in the form of a tree structure and is used to classify cases in a dataset [5]. A set of training cases with their correct classifications is used to generate a decision tree that, optimistically, classifies each case in the test set correctly. The resulting tree is a representation that can be verified by humans and can be used by either humans or computer programs [6]. Decision Tree Induction (DTI) has been used in different areas of medicine including oncology [7] and respiratory diseases [8]. Before the data undergo data mining, they must be prepared in a pre-processing step that removes or reduces noise and handles missing values. Relevance analyses for omitting unnecessary and redundant data, as well as data transformation, are needed for generalising the data to higher-level concepts [9]. Pre-processing techniques take the most effort and time, i.e. almost 80% of the whole project time for knowledge discovery in databases (KDD) [10]. However, replacing the missing values and finding a proper method for selecting important variables prior to data mining can make the mining process faster and even more stable and accurate. In this study the Expectation Maximization (EM) method was used for replacing the missing values. This algorithm is a parameter estimation method that falls within the general framework of maximum likelihood estimation and is an iterative optimisation algorithm [11]. Hotelling (1936) developed Canonical Correlation Analysis (CCA) as a method for evaluating linear correlation between sets of variables [12]. The method allows investigation of the relationship between two sets of variables and identification of the important ones. It can be used as a dimension reduction technique to preserve the character of the original data stored in the registers by omitting data that are nonessential. The objective is to find a subset of variables with predictive performance comparable to the full set of variables [13]. CCA was applied in the pre-processing step of Decision Tree Induction. DTI was trained with the cases with known outcome and the resulting model was validated with ten-fold cross validation. In order to show the benefits of pre-processing, DTI was also applied to the same database without pre-processing, as well as after just handling missing values. For performance evaluation, the accuracy, sensitivity and specificity of the three models were compared.
2 Materials and Methods Preparing the data for mining consisted of several steps: choosing proper databases, integrating them to one dataset, cleaning and replacing missing values, data transformation and dimension reduction by CCA. To see the effect of the pre-processing, DTI was also applied to the same dataset without prior handling of missing values and dimension reduction.
436
A.R. Razavi et al.
2.1 Dataset In this study, data from 3949 female patients, mean age 62.7 years, were analysed. The earliest patient was diagnosed in January 1986 and the last one in September 1995, and the last follow-up was performed in June 2003. The data were retrieved from a regional breast cancer register operating in south-east Sweden. In order to cover more predictors and obtain a better assessment of the outcomes, data were retrieved and combined from two other registers, namely the tumour markers and cause of death registers. After combining the information from these registers, the data were anonymised for security reasons and to maintain patient confidentiality. If patients developed symptoms following treatment they were referred to the hospital; otherwise follow-up visits occurred at fixed time intervals for all patients. There were more than 150 variables stored in the resulting dataset after combining the databases. The first criterion for selecting appropriate variables for prediction was consulting domain experts. In this step, sets of suggested predictors and outcomes (Table 1) were selected. Age of the patient and variables regarding tumour specifications based on pathology reports, physical examination, and tumour markers were selected as predictors. There were two variables in the outcome set, distant metastasis and loco-regional recurrence, both observed at different time intervals after diagnosis indicating early and late recurrence. 2.2 Data Pre-processing After selecting appropriate variables, the raw data were cleaned and outliers were removed by running a set of logical rules. For handling missing values, the Expectation Maximization (EM) algorithm was used. The EM algorithm is a computational method for efficient estimation from incomplete data. In any incomplete dataset, the observed values provide indirect evidence about the likely values of the unobserved values. This evidence, when combined with some assumptions, comprises a predictive probability distribution for the missing values that should be averaged in the statistical analysis. The EM algorithm is a general technique for fitting models to incomplete data. EM capitalises on the relationship between missing data and the unknown parameters of a data model. When the parameters of the data model are known, then it is possible to obtain unbiased predictions for the missing values [14]. All continuous and ordinal variables except for age were transformed to dichotomised variables. The fundamental principle behind CCA is the creation of a number of canonical solutions [15], each consisting of a linear combination of one set of variables, Ui, and a linear combination of the other set of variables, Vi. The goal is to determine the coefficients that maximise the correlation between canonical variates Ui and Vi. The number of solutions is equal to the number of variables in the smaller set. The first canonical correlation is the highest possible correlation between any linear combination of the variables in the predictor set and any linear combination of the variables in the outcome set. Only the first CCA solution is used, since it describes most of the variations. The most important variables in each canonical variate were identified based on the magnitude of the structure coefficients (loadings). Using loadings as the criterion for
A Data Pre-processing Method to Increase Efficiency and Accuracy in Data Mining
437
finding the important variables has some advantages [16]. As a rule of thumb for meaningful loadings, an absolute value equal to or greater than 0.3 is often used [17]. SPSS version 11 was used for transforming, handling missing values, and implementing CCA [18]. Table 1. List of variables in both sets
Outcome Set ‡
Predictor Set Age
DM, first five years
Quadrant
DM, more than 5 years
Side Tumor size
LRR, first five years *
LN involvement
LRR, more than 5 years *
LN involvement † Periglandular growth * Multiple tumors * Estrogen receptor Progesterone receptor S-phase fraction DNA index DNA ploidy Abbreviations: LN: lymph node, DM: Distant Metastasis, LRR: Loco-regional Recurrence * from pathology report, † N0: Not palpable LN metastasis, ‡ all periods are time after diagnosis
2.3 Decision Tree Induction One of the classification methods in data mining is decision tree induction (DTI). In a decision tree, each internal node denotes a test on variables, each branch stands for an outcome of the test, leaf nodes represent an outcome, and the uppermost node in a tree is the root node [19]. The algorithm uses information gain as a heuristic for selecting the variable that will best separate the cases into each outcome [20]. Understandable results of acquired knowledge and fast processing make decision trees one of the most frequently used data mining techniques [21]. Because of the approach used in constructing decision trees, there is a problem of overfitting the training data, which leads to poor accuracy in future predictions. The solution is pruning of the tree, and the most common method is post-pruning. In this method, the tree grows from a dataset until all possible leaf nodes have been reached,
438
A.R. Razavi et al.
and then particular subtrees are removed. Post-pruning causes smaller and more accurate trees [22]. In this study, DTI was applied to the reduced model resulting from CCA after handling missing values to achieve a predictive model for new cases. For the training phase of DTI, the resulting predictors in CCA with the absolute value of loadings ≥ 0.3 were used as input. As the outcome in DTI, important outcomes with the absolute value of loadings ≥ 0.3 in CCA were used, i.e. the occurrence of distant metastasis during a five-year period after diagnosis.
Fig. 1. Canonical structure matrix and loadings for the first solution
As a comparison, DTI was also applied to the dataset without handling the missing values and without dimension reduction by CCA (Table 1). In this analysis of outcome, any recurrence (either distant or loco-regional) at any time during the follow-up after diagnosis was used as outcome. Another similar analysis was done after just handling missing values with the EM algorithm. DTI was carried out using the J48 algorithm in WEKA, which is a collection of machine learning algorithms for data mining tasks [23]. Post-pruning was done to trim the resulting tree. The J48 algorithm is the equivalent of the C4.5 algorithm written by Quinlan [5]. 2.4 Performance Comparison Ten-fold cross validation was done to confirm the performance of predictive models. This method estimates the error that would be produced by a model. All cases were randomly re-ordered, and then the set of all cases was divided into ten mutually disjointed subsets of approximately equal size. The model then was trained and tested ten times. Each time it was trained on all but one subset, and tested on the remaining single subset. The estimate of the overall accuracy was the average of the ten individual accuracy measures [24]. The accuracy, sensitivity and specificity were used for
A Data Pre-processing Method to Increase Efficiency and Accuracy in Data Mining
439
comparing the performance of the two DTIs, i.e. the one following pre-processing and the other without any dimension reduction procedure [25]. In addition, the size of the tree and the number of leaf nodes in each tree were also compared.
3 Result CCA was applied to the dataset and in each solution the loadings for predictor and outcome sets were calculated. The first solution and the loadings (in parentheses) for variables are illustrated in Figure 1. The canonical correlation coefficient (rc) is 0.49 (p ≤ .001). The important variables are shown in bold type.
Fig. 2. Extracted model from CCA in the pre-processing step
After the primary hypothetical model was refined by CCA and the important variables were found by considering their loadings, then they were ready to be analysed by DTI in the mining step. The important outcome was distant metastasis during the first five years and it was used as a dichotomous outcome in DTI. This model is illustrated in Fig. 2. DTI was also applied without the pre-processing step that included handling missing values and dimension reduction. Three different pre-processing approaches were used before applying DTI, and the accuracy, sensitivity, specificity, number of leaves in the decision tree and the tree size were calculated for predictive models. The results are shown and compared in Table 2. Table 2. Results for the different approaches
DTI
Without pre-processing
Accuracy Sensitivity Specificity Number of Leaves Tree Size
54% 83% 41% 137 273
With replacing missing values 57% 82% 46% 196 391
With pre-processing 67% 80% 63% 14 27
440
A.R. Razavi et al.
In the analysis done after handling missing values and dimension reduction using CCA, the accuracy and specificity show improvement but the sensitivity is lower. The tree size and number of leaves in the decision tree that was made after the preprocessing are smaller and more accurate.
4 Discussion An important step in discovering the hidden knowledge in databases is effective data pre-processing, since real-world data are usually incomplete, noisy and inconsistent [9]. Data pre-processing consists of several steps, i.e. data cleaning, data integration, data transformation and data reduction. It is often the case that a large amount of information is stored in databases, and the problem is how to analyse this massive quantity of data effectively, especially when the data were not collected for data mining purposes. For extracting specific knowledge from a database, such as predictions of the recurrence of a disease or of the risk of complications in certain patients, a set of important predictors is assumed to be of central importance, while other variables are not essential. These unimportant variables simply cause problems in the main analysis. They may contain noise and may increase the prediction error and increase the need for computational resources [26]. Dimension reduction results in a subset of the original variables, which simplifies the model and reduces the risk of overfitting. The first step in reducing the number of unimportant variables is discussion with domain experts who have been involved in collecting the data. Their involvement in the knowledge discovery and data mining process is essential [27]. In this study, the selection by domain experts was based on their knowledge and experience concerning variables related to the recurrence of breast cancer. In this step the number of variables was reduced from more than 150 to 21 (17 in the predictor set and 4 in the outcome set). One of the problems in real-life databases is low quality data such as where there are missing values. This is frequently encountered in medical databases, since most medical data are collected for purposes other than data mining [28]. One of the most common reasons for missing values is not that they have not been measured or recorded, but that they were not necessary for the main purpose of the database. There are different methods for replacing missing values, the most common of which is simply to omit any case or patient with missing values. However, this is not a good idea because it causes the lost of information. A case with just one missing value has other variables with valid values that can be useful in the analysis. Deleting a case is implemented as default in most statistical packages. The disadvantage is that this may result in a very small dataset and can lead to serious biases if the data are not missing at random. Methods such as imputations, weighting and model-based techniques can be used to replace the missing values [29]. In this study, replacing the missing values was done after the variables were selected by the domain experts. This results in fewer calculations and less effort, and a better analysis with CCA in the dimension reduction step.
A Data Pre-processing Method to Increase Efficiency and Accuracy in Data Mining
441
In the CCA, non-important variables were removed from the dataset by considering loadings as criteria. Loadings are not affected by the presence of strong correlations among variables and can also handle two sets of variables. After the dimension reduction by CCA, the number of suggested variables from domain experts was reduced to 9 (8 + 1). The benefits can be more apparent when the number of variables is much higher. This small subset of the initial variables contains most of the information required for predicting recurrence of the cancer. DTI is a popular classification method because its results can be easily changed to rules or illustrated graphically. In the case of small and simple trees, even a paper copy of the tree can be used for predicting purposes. The results can be presented and explained to domain experts and can be easily modified by them for a higher degree of explainability. As an important step in KDD, the pre-processing step is vital for successful data mining. In some papers, no specific pre-processing was noted before mining the data with DTI [8], and in others data discretization was performed [21]. We combined different techniques: substituting missing values, categorisation and dimension reduction. The results of the three DTIs that were performed (Table 2) show that the accuracy and specificity of the analysis with replacement of missing values and CCA prior to DTI are considerably better, but the sensitivity has decreased. The number of leaves and tree size show a considerable decrease. This simplifies the use of the tree by medical personnel and makes it easier for domain experts to study the tree for probable changes. This simplicity is gained along with an increase in the overall accuracy of the prediction, which shows the benefits of a well-studied pre-processing step. In this study, the role of dimension reduction by CCA in preprocessing is more apparent than that of handling missing values. This is shown by the better accuracy and reduced size of the tree after adding CCA to pre-processing. Combining proper methods for handling missing values and a dimension reduction technique before analysing the dataset with DTI is an effective approach for predicting the recurrence of breast cancer.
5 Conclusion High data quality is a primary factor for successful knowledge discovery. Analysing data containing missing values and redundant and irrelevant variables requires a proper pre-processing before application of data mining techniques. In this study, a pre-processing step including CCA is suggested. Data on breast cancer patients stored in the registers of south-west Sweden have been analaysed. We have presented a method that consists of replacing missing values, consulting with domain experts and dimension reduction prior to applying DTI. The result shows an increased accuracy for predicting the recurrence of breast cancer. Using the described pre-processing method prior to DTI results in a simpler decision tree and increases efficiency of predictions. This helps oncologists in identifying high risk patients.
442
A.R. Razavi et al.
Acknowledgments This study was performed in the framework of Semantic Mining, a Network of Excellence funded by EC FP6. It was also supported by grant No. F2003-513 from FORSS, the Health Research Council in the South-East of Sweden. Special thanks to the South-East Swedish Breast Cancer Study Group for fruitful collaboration and support in this study.
References [1] Uckert, F., Ataian, M., Gorz, M., Prokosch, H. U.: Functions of an electronic health record. Int J Comput Dent 5 (2002) 125-32 [2] Sandblom, G., Dufmats, M., Nordenskjold, K., Varenhorst, E.: Prostate carcinoma trends in three counties in Sweden 1987-1996: results from a population-based national cancer register. South-East Region Prostate Cancer Group. Cancer 88 (2000) 1445-53 [3] Rosen, M.: National Health Data Registers: a Nordic heritage to public health. Scand J Public Health 30 (2002) 81-5 [4] Windle, P. E.: Data mining: an excellent research tool. J Perianesth Nurs 19 (2004) 355-6 [5] Quinlan, J. R.: C4.5: Programs for Machine Learning. Morgan Kaufmann, San Mateo, CA (1993) [6] Podgorelec, V., Kokol, P., Stiglic, B., Rozman, I.: Decision trees: an overview and their use in medicine. J Med Syst 26 (2002) 445-63 [7] Vlahou, A., Schorge, J. O., Gregory, B. W., Coleman, R. L.: Diagnosis of Ovarian Cancer Using Decision Tree Classification of Mass Spectral Data. J Biomed Biotechnol 2003 (2003) 308-314 [8] Gerald, L. B., Tang, S., Bruce, F., Redden, D., Kimerling, M. E., Brook, N., Dunlap, N., Bailey, W. C.: A decision tree for tuberculosis contact investigation. Am J Respir Crit Care Med 166 (2002) 1122-7 [9] Han, J., Kamber, M.: Data Mining Concepts and Techniques. Morgan Kaufmann (2001) [10] Duhamel, A., Nuttens, M. C., Devos, P., Picavet, M., Beuscart, R.: A preprocessing method for improving data mining techniques. Application to a large medical diabetes database. Stud Health Technol Inform 95 (2003) 269-74 [11] McLachlan, G. J., Krishnan, T.: The EM algorithm and extensions. John Wiley & Sons (1997) [12] Silva Cardoso, E., Blalock, K., Allen, C. A., Chan, F., Rubin, S. E.: Life skills and subjective well-being of people with disabilities: a canonical correlation analysis. Int J Rehabil Res 27 (2004) 331-4 [13] Antoniadis, A., Lambert-Lacroix, S., Leblanc, F.: Effective dimension reduction methods for tumor classification using gene expression data. Bioinformatics 19 (2003) 563-70 [14] Dempster, A. P., Laird, N. M., Rubin, D. B.: Maximum Likelihood from Incomplete Data via the EM Algorithm. J R Stat Soc Ser B 39 (1977) 1-38 [15] Vogel, R. L., Ackermann, R. J.: Is primary care physician supply correlated with health outcomes? Int J Health Serv 28 (1998) 183-96 [16] Dunlap, W., Landis, R.: Interpretations of multiple regression borrowed from factor analysis and canonical correlation. J Gen Psychol 125 (1998) 397-407 [17] Thompson, B.: Canonical correlation analysis: Uses and interpretation. Sage, Thousand Oaks, CA (1984) [18] SPSS Inc.: SPSS for Windows. SPSS Inc. (2001)
A Data Pre-processing Method to Increase Efficiency and Accuracy in Data Mining
443
[19] Pavlopoulos, S. A., Stasis, A. C., Loukis, E. N.: A decision tree--based method for the differential diagnosis of Aortic Stenosis from Mitral Regurgitation using heart sounds. Biomed Eng Online 3 (2004) 21 [20] Luo, Y., Lin, S.: Information gain for genetic parameter estimation with incorporation of marker data. Biometrics 59 (2003) 393-401 [21] Zorman, M., Eich, H. P., Stiglic, B., Ohmann, C., Lenic, M.: Does size really matter-using a decision tree approach for comparison of three different databases from the medical field of acute appendicitis. J Med Syst 26 (2002) 465-77 [22] Esposito, F., Malerba, D., Semeraro, G., Kay, J.: A comparative analysis of methods for pruning decision trees. IEEE Trans Pattern Anal Mach Intell 19 (1997) 476-491 [23] Witten, I. H., Frank, E.: Data Mining: Practical machine learning tools with Java implementations. Morgan Kaufmann, San Francisco (2000) [24] Kohavi, R.: A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection. In: Proc. International Joint Conference on Artificial Intelligence (1995) 1137-1145 [25] Delen, D., Walker, G., Kadam, A.: Predicting breast cancer survivability: a comparison of three data mining methods. Artif Intell Med In press (2004) [26] Pfaff, M., Weller, K., Woetzel, D., Guthke, R., Schroeder, K., Stein, G., Pohlmeier, R., Vienken, J.: Prediction of cardiovascular risk in hemodialysis patients by data mining. Methods Inf Med 43 (2004) 106-113 [27] Babic, A.: Knowledge discovery for advanced clinical data management and analysis. Stud Health Technol Inform 68 (1999) 409-13 [28] Cios, K. J., Moore, G. W.: Uniqueness of medical data mining. Artif Intell Med 26 (2002) 1-24 [29] Myrtveit, I., Stensrud, E., Olsson, U. H.: Analyzing data sets with missing data: an empirical evaluation of imputation methods and likelihood-based methods. IEEE Trans Softw Eng 27 (2001) 999-1013
Rule Discovery in Epidemiologic Surveillance Data Using EpiXCS: An Evolutionary Computation Approach John H. Holmes1 and Jennifer A. Sager2 1
University of Pennsylvania School of Medicine, Philadelphia, PA, USA [email protected] 2 University of New Mexico Department of Computer Science, Albuquerque, NM, USA [email protected]
Abstract. This paper describes the architecture and application of EpiXCS, a learning classifier system that uses reinforcement learning and the genetic algorithm to discover rule-based knowledge in epidemiologic surveillance databases. EpiXCS implements several additional features that tailor the XCS paradigm to the demands of epidemiologic data and users who are not familiar with learning classifier systems. These include a workbench-style interface for visualization and parameterization and the use of clinically meaningful evaluation metrics. EpiXCS has been applied to a large surveillance database, and shown to discover classification rules similarly to See5, a well-known decision tree inducer.
1 Introduction Evolutionary computation encompasses a variety of machine learning paradigms that are inspired by Darwinian evolution. The first paradigm was the genetic algorithm which is based on chromosomal knowledge representations and genetic operators to ensure “survival of the fittest” as a means of efficient search and optimization. A number of evolutionary computation paradigms have appeared since John Holland [3] first developed the genetic algorithm. Of particular interest to this investigation is the learning classifier system (LCS). Its natural, rule-based knowledge representation, accessible components, and easily understood output positions the LCS above other evolutionary computation approaches as a robust knowledge discovery tool. LCSs are rule-based, messagepassing systems that learn rules to guide their performance in a given environment. Numerous LCS architectures have been described, starting with CS-1 [3], the first system in the so-called Michigan approach. Another design, Smith’s LS-1 system [9], is the first in a series of Pittsburgh approaches to learning classifier systems. Wilson’s Animat [10] represents a departure from the Michigan approach. Since these early efforts, LCS and particularly XCS has been a focus of a great deal of recent research activity [2, 6]. XCS is a LCS paradigm first developed by Wilson [11] and refined by others [2, 7] that diverges considerably from traditional Michigan-style LCS. First, classifier fitS. Miksch et al. (Eds.): AIME 2005, LNAI 3581, pp. 444 – 452, 2005. © Springer-Verlag Berlin Heidelberg 2005
Rule Discovery in Epidemiologic Surveillance Data Using EpiXCS
445
ness is based on the accuracy of the classifiers payoff prediction, rather than the prediction itself. Second, the genetic algorithm is restricted to niches in the action set, rather than applied to the classifier population as a whole. These two features are responsible for XCS’s ability to evolve accurate, maximally general classifiers, much more so than earlier LCS paradigms. To date, no one has investigated XCS’s possible use as a medical knowledge discovery tool, particularly in epidemiologic surveillance. As is typical of LCSs, the fundamental unit of knowledge representation in XCS is the classifier, which corresponds to an IF-THEN rule represented as a genotypephenotype pair. The left-hand side, or genotype, is commonly referred to as a taxon, and the right-hand side is the action. Other components of the classifier include various indicators such as classifier fitness, prediction, birth date, numerosity (number of a classifier’s instances in the classifier population), and classifier error. A more complete description of XCS is given in Butz and Wilson [1]. However, it is helpful to consider the schematic of generic XCS architecture as shown in Figure 1.
Input case
Detector
Genetic algorithm
Reward
Classifier population [P] Action Set [A]
Data Set
Match Set [M]
Decision
Prediction Array
Fig. 1. Schematic of generic XCS architecture
At each time step, or iteration, a case is drawn from the data set (either training or testing) and passed through a detection algorithm that performs a matching function against the classifier population to create a Match Set [M]. Classifiers in [M] match the input case only on its taxon, so [M] may contain classifiers advocating any or all classes present in the problem domain. Using the classifiers in [M], a prediction array is created for each action. The values of the prediction array are based on fitnessweighted averages of the classifier predictions for each action; thus, for a two-class problem, there will be two elements in the prediction array, one for each possible action, or class. Actions can be selected from the array based on exploration (randomly) or exploitation (the action with the highest value). Classifiers in [M] with
446
J.H. Holmes and J.A. Sager
actions similar to the action selected from the prediction array form the Action Set [A], and that action is used as the decision of the system. Classifiers in [A] are rewarded, using a variety of schemes. For single-step supervised learning problems, such as those found in data mining, the reward is simply some value applied to the prediction of the classifiers in [A]: correct decisions reward classifiers in [A] by adding to their prediction; incorrect decisions receive no reward. In addition to reward, classifiers in [A] are candidates for reproduction (with or without crossover and/or mutation) using the genetic algorithm (GA). Parents are selected by the GA based on their fitness, weighted by their numerosity. This nichebased approach represents a major departure from other LCS approaches, which apply the GA panmictally, or across the entire classifier population. The advantage of the XCS approach is that reproduction occurs within the subset of the classifier population that is relevant to an input case, and therefore it is faster, but also it is also potentially more accurate. If the creation of a new classifier would result in exceeding the population size, a classifier is probablistically deleted from the population, based on fitness. Thus, over time, selection pressure causes low-fitness classifiers to be removed from the population, in favor of high-fitness classifiers, and ultimately, an optimally generalized and accurate population of classifiers emerges, from which classification rules can be derived. During testing, the GA and reward subsystems are turned off, and the system then functions like a forward-chaining expert system.
2 EpiXCS EpiXCS was developed by the authors to accommodate the needs of epidemiologists and other clinical researchers in performing knowledge discovery tasks on surveillance data [4, 5]. Typically, these data are large, in that they contain many features and/or observations, and in turn, they often contain numerous relationships between predictors and outcomes that can be expressed as rules. EpiXCS adheres to the architectural paradigm of XCS in general, but with several differences that make it more appropriate for epidemiologic research. These features include a workbench-style interface that and the use of clinically relevant evaluation parameters. The EpiXCS Workbench. EpiXCS is intended to support the needs of LCS researchers as they experiment with such problems as parameterization, as well as epidemiologic researchers as they seek to discover knowledge in surveillance data. The EpiXCS Workbench is shown in Figure 2. The workbench supports multitasking to allow the user to run several experiments with different learning parameters, data input files, and output files. For example, this feature could be used to compare different learning parameter settings or to run different datasets at the same time. Visual clues convey the difference between experiments and runs to the user. Each simultaneous experiment is displayed in a separate child window. The learning parameters can be set for each individual experiment at run-time by means of a dialog box. Each run is displayed on a separate tab in the parent experiment’s window. Each run tab displays the evaluation (obtained at each 100th iteration) of accuracy, sensitivity, specificity, area under the receiver operating characteristic curve, indeterminate rate (the proportion of training or testing cases that
Rule Discovery in Epidemiologic Surveillance Data Using EpiXCS
447
can’t be matched), learning rate, and positive and negative predictive values in both graphical and textual form. These values help the user to determine how well the system is learning and classifying the data. Since the overall results of a batch run may be of interest to the user, additional tabs are provided for the training and testing evaluations. These tables include the mean and standard deviation of the aforementioned metrics calculated for all runs in the experiment.
Fig. 2. The EpiXCS Workbench. The left pane is an interface to the learning parameters which are adjustable by the user prior to runtime. The middle pane contains classification metrics, obtained on each 100th iteration and at testing, as well as a legend to the graph in the right pane, which displays classification performance in real time during training (here at iteration 18600). The tabs at the bottom of the figure are described in the text
Rule Visualization. EpiXCS affords the user two means of rule visualization. The first of these is a separate listing of rules predicting positive and negative outcomes, respectively, ranked by their predictive value. Each rule is displayed in easily-read IFTHEN format, with an indication of the number of classifiers in the population which match the rule, as well as the number of those classifiers which predict the opposite outcome. This ratio provides an indication of the relative importance of the rule: even though a rule might have a high predictive value, if it matches only a few of the population classifiers, it is likely to be less general and potentially less important than if it matched multiple classifiers. On the other hand, it is entirely possible that such a rule
448
J.H. Holmes and J.A. Sager
could be “interesting,” in that it represents an anomaly or outlier. As a result, the researcher would do well not to ignore such low-match rules. The second means of rule visualization provides a graphical representation of each rule’s conjunct using an X-Y graph. The X-axis represents the relative value of a specific conjunct as framed by range, from minimum to maximum, associated with the feature represented by the conjunct. The Y-axis represents each feature in the dataset. The user can select any number of rules to visualize. This tool is most useful when visualizing all rules, to develop a sense of the features and their values that seem to cluster for a particular outcome (Figures 3 and 4). Evaluation Metrics. EpiXCS uses a variety of evaluation metrics, most of which are well-known to the medical decision making community. These include sensitivity, specificity, area under the receiver operating characteristic curve (AUC), and positive and negative predictive values. In addition, two other metrics are employed in EpiXCS: indeterminate rate, or IR, and learning rate, orλ. The IR indicates the proportion of input cases that can’t be matched in the current population state. The IR will be relatively high in the early stages of learning, but should decrease as learning progresses and the classifier population becomes more generalized. While important for understanding the degree of generalization, the IR is also used for correcting the other evaluation metrics. The learning rate,λ, indicates the velocity at which 95% of the maximum AUC is attained, relative to the number of iterations.
3 Applying EpiXCS to Epidemiologic Surveillance Research Questions. This investigation focuses on two research questions. The first is the epidemiologic question: What features are associated with fatality in teenaged (15 to 18 years of age) automobile drivers? To answer this question, EpiXCS was applied to the FARS database and compared with a well-known rule inducer, See5 [8]. Data: Fatality Analysis Reporting System. The Fatality Analysis Reporting System (FARS) is a census of all fatal motor vehicle crashes occurring in the United States Table 1. Data elements in the final FARS dataset
Age Sex Restraint use Air bag deployment Ejection Body type Rollover Impact1 Impact2 Fire Road function Manner of collision Fatal
Age in years, 15 to 18 Sex of driver Type of restraint used Air bag deployed in crash, and if so, type of airbag Driver ejection from vehicle: No ejection, partial, complete Type of vehicle Vehicle rollover during crash: No; once, multiple rollovers Initial impact, based on clock face (12=head-on) Principal impact, based on clock face Fire or explosion at time of crash Major artery, minor artery, street, classed by urban/rural Front-to-front, front-to-side, sideswipe, front-to-rear Driver died: Yes, No
Rule Discovery in Epidemiologic Surveillance Data Using EpiXCS
449
and Puerto Rico, and is available at http://www-fars.nhtsa.dot.gov/. Records in FARS represent crashes involving a motor vehicle with at least one death. In order to address the first research question, records were selected from the 2003 FARS Person File (the most recent version of the data available) using the following criteria: drivers of passenger automobiles, aged 15-18 years. Thus drawn, the dataset contained 1,997 fatalities and 2,427 non-fatalities. Additional criteria were applied to select data elements thought to be important for answering the first research question. The final dataset contained the data elements shown in Table 1, and was partitioned into training and testing sets, such that the original class distribution was maintained. Experimental Procedure. EpiXCS was trained over 30,000 iterations using a population size of 1,000. The values for all other parameters were those described in (Butz and Wilson). See5 was trained using 10 boosting trials. Rules were examined for clinical plausibility and accuracy.
4 Results 4.1 Rule Discovery Rules Discovered by EpiXCS. A total of 30 rules advocating the positive class and 21 rules advocating the negative class were discovered. The graphic representations
Fig. 3. Graphical representation of all 30 class-positive (fatality) rules discovered by EpiXCS on FARS data restricted to teenaged passenger car drivers. Three sample rules are shown in the bottom pane. Feature values are shown for all 30 rules in the top pane, and the features are referenced in the left-hand pane
450
J.H. Holmes and J.A. Sager
of the positive and negative rules are shown in Figures 3 and 4, respectively. Several differences stand out. First, the youngest (15 year-old) drivers do not appear in the non-fatal rules. Second, vehicle fire, and ejection from the vehicle, either partial or total, appear only in the fatal rules. Third, the initial and principal impact points cluster on the driver’s side of the vehicle in the fatal class.
Fig. 4. Graphical representation of all 21 class-negative (non-fatality) rules discovered by EpiXCS on FARS data restricted to teenaged passenger car drivers
Comparison with Rules Discovered by See5. See5 discovered 415 rules, 262 in the fatal class, and 138 in the non-fatal class. Nearly all of the rules (97%) demonstrated predictive values of 1.0, but applied to very small numbers of training cases (<15), indicating that even with rule pruning, See5 required many more rules than EpiXCS to cover the problem domain. This was also evident in the large number of rules that contained conjuncts repeated from rule to rule. Even though the final See5 rule set was large, an experienced automotive injury epidemiologist would have considerable difficulty identifying rules of importance to the clinical research question. In this rule set, several features emerged as potential associations with the outcome, as shown in Table 2, which compares the prominent features found by the two rule discovery methods, and demonstrates considerable agreement except for driver restraint and the youngest drivers.
Rule Discovery in Epidemiologic Surveillance Data Using EpiXCS
451
Table 2. Key features associated with the classes, as identified by rule set inspection. Discrepancies are shown in boldface
Features Driver restrained Age=15 Road type: urban Road type: rural Fire or explosion Ejection from vehicle Driver’s side impact
Non-fatal cases EpiXCS See5 Yes Yes Yes Yes Yes Yes Yes Yes No No No No No No
Fatal cases EpiXCS See5 No Yes Yes No No Yes Yes Yes Yes Yes Yes Yes Yes Yes
4.2 Classification of Novel Cases The classification accuracy of EpiXCS and See5 is comparable, as shown in Table 3. The differences in AUC and NPV are not significant, although PPVs are (p=0.04). Table 3. Classification accuracy metrics for EpiXCS and See5 on the FARS testing set
EpiXCS See5
AUC .66 .70
PPV .77 .70
NPV .66 .64
5 Discussion and Conclusion Both EpiXCS and See5 were able to discover clinically meaningful classification rules in the 2003 FARS dataset. However, there were qualitative differences in the rules, as well as the number of rules discovered. For example, EpiXCS discovered several important feature-outcome relationships that were missed by See5. These included driver restraint status, young driver age, and type of roadway. EpiXCS discovered a total of 51 rules, comprising 30 fatalities and 21 non-fatalities. This is a manageable number of rules for qualitative analysis, but in order to reach this small rule set, a small tradeoff had to be considered in terms of classification accuracy when applying the rules to novel FARS data. Given that the tradeoff was primarily seen in the positive predictive value, and that the dataset had an unbalanced class distribution in favor of the non-fatal class, opting for a smaller rule set may result in even more compromised predictive values with a larger or different dataset. This issue is clearly one demanding further attention in future work. In comparison, See5 discovered over 400 rules over 10 boosts, after pruning. While the classification accuracy at the end of the 10th boost is marginally better than that of EpiXCS, one is left with sifting through a very large rule set, the members of which, like those in the EpiXCS rule set, apply to small numbers of cases. This is not to say that See5’s rules are not useful. Rather, they would be more useful if visual-
452
J.H. Holmes and J.A. Sager
ized through the same type of tool used in EpiXCS, and this too is a line of research for future work. Finally, there is the fundamental issue of whether or not the rules from EpiXCS or See5 are useful in a clinical sense. Clearly, both systems induced rules that were clinically plausible. For example, it is reasonable to suggest that teen drivers who are not properly restrained are more likely to die in a crash where at least one fatality occurred. In addition, it is reasonable to posit that crashes where the teen driver’s vehicle was hit from the driver’s side are likely to result in that driver’s demise. Finally, fires and ejection of the driver from the vehicle would be plausibly associated with teen driver fatality. However, some of the rules appear to be equivocal, even unnervingly so. For example, one would assume that an airbag exposure would protect the driver, yet there is little difference on this among the EpiXCS rules in either class. The See5 rules on airbag deployment are just as equivocal, although the deployment of side air bags occurred more frequently in these rules, albeit equally in both classes. In conclusion, EpiXCS and See5 clearly discover rules in epidemiologic surveillance data, and most of these are clinically plausible. However, EpiXCS provides a workbench-style interface with a rule visualization tool that may provide a degree of marginal utility for LCS and epidemiologic researchers alike. Characterizing this utility is a focus of ongoing research by the authors.
References 1. Butz MV and Wilson SW: An algorithmic description of XCS. Soft Computing 6:144-52, 2002. 2. Butz MV, Kovacs T, Lanzi PL, and Wilson SW: Toward a theory of generalization and learning in XCS. IEEE Transactions on Evolutionary Computation, 8(1):28-46, 2004. 3. Holland, JH; Reitman, J: Cognitive systems based in adaptive algorithms. Waterman, D; Hayes-Roth, F (eds.). Pattern-directed inference systems. New York: Academic Press, 1978. 4. Holmes, JH, Lanzi, PL, Stolzmann W, and Wilson SW: Learning classifier systems: new models, successful applications. Information Processing Letters, 82(1), 23-30, 2002. 5. Holmes JH and Bilker WB: The effect of missing data on learning classifier system classification and prediction performance. 6. Advances in Learning Classifier Systems. Lecture Notes in Artificial Intelligence. Lanzi PL, Stolzmann W, and Wilson SW (eds.). Berlin, Springer Verlag, Vol. 2661: 46-60, 2003. 7. Lanzi PL: Learning classifier systems from a reinforcement learning perspective. Soft Computing, 6(3-4):162-70, 2002. 8. Rulequest Systems, www.rulequest.com 9. Smith S: A learning system based on genetic algorithms. Ph.D. dissertation, University of Pittsburgh, 1980. 10. Wilson, SW: Knowledge growth in an artificial animal. Grefenstette, JJ. Proceedings of the First International Conference on Genetic Algorithms, Lawrence Erlbaum Associates; 16-23, 1985. 11. Wilson SW: Classifier fitness based on accuracy. Evolutionary Computation. 3(2):149175, 1995.
Subgroup Mining for Interactive Knowledge Refinement Martin Atzmueller1 , Joachim Baumeister1 , Achim Hemsing2 , Ernst-Jürgen Richter2 , and Frank Puppe1 1
Department of Computer Science, University of Würzburg, 97074 Würzburg, Germany {atzmueller, baumeister, puppe}@informatik.uni-wuerzburg.de 2 Department of Prosthodontics, University of Würzburg, 97070 Würzburg, Germany {hemsing_a, richter_e}@klinik.uni-wuerzburg.de
Abstract. When knowledge systems are deployed into a real-world application, then the maintenance of the knowledge is a crucial success factor. In the past, some approaches for the automatic refinement of knowledge bases have been proposed. Many only provide limited control during the modification and refinement process, and often assumptions about the correctness of the knowledge base and case base are made. However, such assumptions do not necessarily hold for real-world applications. In this paper, we present a novel interactive approach for the user-guided refinement of knowledge bases. Subgroup mining methods are used to discover local patterns that describe factors potentially causing incorrect behavior of the knowledge system. We provide a case study of the presented approach with a fielded system in the medical domain.
1
Introduction
In the medical domain knowledge systems are commonly built manually by domain specialists. When such systems are deployed into a real-world application, then often the correctness needs to be improved according to the practical requirements. In the past, many approaches for the automatic refinement of knowledge bases have been proposed [1, 2, 3, 4]. However, such methods make two important assumptions that do not necessarily hold in a real-world setting. The first assumption states that the considered knowledge base is mainly correct and only requires minor modifications in the refinement step, i.e., the tweak assumption. This assumption does not hold, if the development of the knowledge base is in an earlier stage, and if corrections or extensions are still necessary. As the second assumption a collection of correctly solved test cases is expected. These cases are used by the methods for identifying guilty (faulty) elements in the knowledge base, that are the target for refinement in a subsequent step. Unfortunately, this assumption is not valid in our setting since the available cases were manually entered. Although the user is guided by an adaptive dialog during the case acquisition phase, and consistency checks are applied, we frequently experienced falsely entered findings in our case study. In this paper, we present a novel approach for the user-guided refinement of knowledge bases. The proposed method supports the user to perform the correct refinements S. Miksch et al. (Eds.): AIME 2005, LNAI 3581, pp. 453–462, 2005. c Springer-Verlag Berlin Heidelberg 2005
454
M. Atzmueller et al.
in an interactive process. This is especially important if the formalized knowledge is still incomplete, i.e., no tweak assumption for the underlying knowledge base can be made. In such circumstances, extensions and not only modifications of the knowledge base are necessary. Furthermore, if manually acquired case bases are used to refine knowledge systems, then the applied case base may contain incorrectly solved cases, e.g., due to incorrectly entered findings or solutions. Additionally, it is possible that automatic methods overfit the learned (refinement) knowledge by over-generalization or over-specialization. This problem is increased by the presence of incorrectly solved cases. Then, automatic refinements may not be acceptable for the expert. In the presented approach subgroup mining methods are used to discover local patterns that describe factors potentially causing incorrect behavior of the knowledge system. It is important that no global refinement model of the knowledge base is generated but refinement operators are proposed based on a local model. The proposed method keeps the domain specialist in control of all steps of the refinement process. The user is supported by visualization techniques to easily interpret the (intermediate) results. The rest of the paper is organized as follows: In Section 2 we introduce subgroup mining and its application for the refinement task. In Section 3, we present the subgroup driven interactive refinement process: we discuss the refinement steps, a visualization technique and related work. Finally, we provide a case study of the presented approach with a fielded system in the medical domain in Section 4. A summary of the paper is given in Section 5.
2
Subgroup Mining
In this section, we first introduce our knowledge representation, and we describe the basics of the subgroup mining approach. After that, we introduce the adaptation of subgroup mining to the interactive refinement process. General Definitions. Let ΩD be the set of all diagnoses and ΩA the set of all attributes. For each attribute a ∈ ΩA a range dom(a) of attribute values is defined. Furthermore, we assume VA to be the (universal) set of attribute values (findings, observations) of the form (a : v), where a ∈ ΩA is an attribute and v ∈ dom(a) is an assignable value. For each diagnosis d ∈ ΩD we define a (boolean) range dom(d): ∀d ∈ ΩD : dom(d) = {established , not established }. Let CB be the case base containing all available cases. A case c ∈ CB is defined as a tuple c = (Vc , Dc ), where Vc ⊆ VA is the set of attribute values observed in the case c. The set Dc ⊆ ΩD is the set of diagnoses describing the solution of this case. The occurrence of a diagnosis d in a case c, indicates the value established . The value not established does not occur in our case base. Thus, VF = VA ∪ ΩD denotes the (universal) set of all possible "generalized" attribute values of the case base CB . (heuristic) rules. A rule r can be considA diagnosis d ∈ ΩD is derived using ered as a triple cond(r), conf (r), d , where cond(r) is the condition of the rules, conf (r) is the confirmation strength (points), and d ∈ ΩD is a diagnosis. Thus a rule r = cond(r) → d, conf (r) is used to derive the diagnosis d, where the rule condition cond(r) contains conjunctions and/or disjunctions of (negated) generalized findings fi ∈ VF . The state of a diagnosis is gradually inferred by summing all the confirma-
Subgroup Mining for Interactive Knowledge Refinement
455
tion strengths (points) of the rules that have fired; if the sum is greater than a specific threshold value, then the diagnosis is assumed to be established. 2.1
Basic Subgroup Mining
Subgroup mining [5, 6] is a method to discover "interesting" subgroups of cases, e.g., in the domain of dental medicine the subgroup "teeth with a strong attachmentloss and an increased degree of tooth lax" has a significantly higher share of "extracted teeth" than the total population. The main application areas of subgroup mining are exploration and descriptive induction: subgroups are described by relations between independent (explaining) variables and a dependent (target) variable rated by a certain interestingness measure. A subgroup mining task mainly relies on the following four main properties: the target variable, the subgroup description language, the quality function, and the search strategy. We will focus on binary target variables. The description language specifies the individuals from the reference population belonging to the subgroup. Definition 1 (Subgroup Description). A subgroup description sd = {ei } consists of a set of selection expressions (selectors) ei = (ai , Vi ) that are selections on domains of attributes, where ai ∈ ΩA , Vi ⊆ dom(ai ). A subgroup description is defined as the conjunction of its contained selection expressions. We define Ωsd as the set of all possible subgroup descriptions. A quality function measures the interestingness of the subgroup. Several quality functions were proposed, for example in [6, 7]. Definition 2 (Quality Function). A quality function q : Ωsd × VF → R evaluates a subgroup description sd ∈ Ωsd given a target variable t ∈ VF . It is used by the search method to rank the discovered subgroups during search. For binary target variables, examples for quality functions are given by √ N p − p0 pn , qT P = , n qBT = N − n (1 − p)n + g p0 · (1 − p0 ) where p is the relative frequency of the target variable in the subgroup, p0 is the relative frequency of the target variable in the total population, N = |CB | is the size of the total population, and n denotes the size of the subgroup. For quality function qT P the generalization parameter g trades of the number of true positives (pn) vs. the number of false positives (1 − p)n . For a low value of g fewer false positives are tolerated. Considering an automatic subgroup mining approach an efficient search strategy is necessary, since the search space is exponential concerning all possible selection expressions. Commonly, a beam search strategy is used because of its efficiency. We use a modified beam search strategy, where a subgroup description can be selected as the initial value for the beam. Beam search adds selection expressions to the k best subgroup
456
M. Atzmueller et al.
descriptions in each iteration. Iteration stops, if the quality as evaluated by the quality function q does not improve any further. For the characterization of the discovered subgroups we have two alternatives: Besides the principal factors contained in the subgroup description there are also supporting factors. These are generalized findings supp ⊆ VF contained in the subgroup, which are characteristic for the subgroup, i.e., the value distributions of their corresponding attributes (supporting attributes) differ significantly comparing two populations: the true positive cases contained in the subgroup and non-target class cases contained in the total population. In addition to the principal factors the supporting factors can also be used to statistically characterize a discovered subgroup, as described, e.g. in [8]. 2.2
Subgroup Mining for the Refinement Task
For subgroup mining we consider a binary target variable corresponding to a diagnosis d, that is true (established) for incorrectly solved cases. Then, we try to identify subgroups with a high share of this "error" target variable. However, we need to distinguish different error analysis states relating to the measures false positives F P d (CB ), false negatives F N d (CB ), and the total error ERRd (CB ): F P d (CB ) = {c | CDc = ∅ ∧ d ∈ SD c ∧ d ∈ / CDc } , F N d (CB ) = {c | CDc = ∅ ∧ d ∈ / SD c ∧ d ∈ CDc } , ERRd (CB ) = {c | CDc = ∅ ∧ SDc = CDc } , where CDc are the correct diagnoses of the case c, and SD c are the diagnoses derived by the system. It is easy to see that we want to minimize the measures for the (general) refinement task, while we want to maximize the measures for the discovered subgroups that are then used as candidates for refinement. To identify the "potential faulty factors" PFF we consider the subgroup descriptions of the discovered subgroups containing a high share of falsely solved cases. Then there are two options: the interesting factors are always the principal factors describing the subgroup, i.e., the attribute values contained in the subgroup description. Additionally, also the supporting factors of the subgroup can be faulty factors, since their distribution differs significantly considering the incorrectly and correctly solved cases. Then, the potential faulty factors PFF are defined as follows: PFF = { f | f is principal or supporting factor }. For the refinement task we also apply static test knowledge, i.e., immutable validation constraints, to detect inconsistent behavior of the knowledge system. These constraints are provided by the domain specialist as subgroups for which a specific diagnosis should always be derived. Then, by assessing the distribution of the diagnoses contained in these subgroups, we can validate the state of the knowledge base or the case base directly. Furthermore, after a refinement step has been performed, the test knowledge is always checked again, in order to exclude modifications which degrade the performance of the system. Examples of the static test knowledge are given in the case study in Section 4.
Subgroup Mining for Interactive Knowledge Refinement
3
457
The Subgroup-Driven Interactive Refinement Process
In this section, we introduce the process for interactive knowledge refinement and its characteristics. We present the visualization method which provides an easy interpretation of intermediate results. Finally, we discuss related work. 3.1
Subgroup Mining for Interactive Knowledge Refinement – Process Model
For the interactive refinement process we apply the subgroup mining method to discover local patterns describing a set of incorrectly solved cases. We aim to discover subgroup cases with a high share of incorrectly solved cases. The incremental process depicted in Figure 1 mainly consists of six steps: 1. Consider a diagnosis d ∈ ΩD , and select an analysis state e ∈ {F Pd , F Nd , ERRd }. 2. A set of subgroups SGS e is mined, either interactively by the domain specialist, or automatically by the system. Then, for each subgroup SGi ∈ SGS e a set of potential faulty factors PFF i contained in SGi is retrieved. 3. The subgroup descriptions/factors are interpreted by the domain specialist. 4. Based on the analysis of the potential faulty factors guilty (faulty) elements in the knowledge base or the case base are identified, and appropriate modification steps are applied. Then, the solutions of each case in the case base are recomputed. 5. The (changed) state of the system is assessed: the analysis measure e is checked for improvements; similarly the immutable validation constraints, if available, are tested whether they still indicate a valid state. 6. If necessary, restart the process. Apply
Case Base
Subgroup Mining
Knowledge Base
Subgroups/ Potential Faulty Factors
Refinement Operators
Analyze Patterns/ Factors
Fig. 1. Process Model: Subgroup Mining for Interactive Knowledge Refinement
Refinement operators can either modify the knowledge base or the applied case base. The knowledge base is usually adapted to fit the available correct cases. The case base is adapted, if particular cases are either wrong or they denote an extraordinary, exceptional state, which should not be modelled in the knowledge base. For the different refinement operators we need to distinguish two cases: if the expert decides that the subgroup descriptions are valid, i.e., they are reasonable, then probably the knowledge base needs to be corrected. Otherwise, if the subgroup descriptions, i.e., the combination of factors are not meaningful, then this can imply that the contained cases need corrections. However, these "doubtful" subgroups could also be caused by random correlations in
458
M. Atzmueller et al.
the case base. In this case, the expert needs to manually assess the subgroups and cases in detail. In summary, the following refinements can be performed: – Adapt/modify rules: generalize or specialize conditions and/or actions. This action is often appropriate if only one selector is contained in the subgroup, and if the subgroup is assessed to be valid. – Extend knowledge: add missing relations to the knowledge base. This operator is often applicable when the subgroup description consists of more than one selector, and if the dependencies between the selectors are meaningful. – Fix case: correct the solution of a single case, or correct the findings of a case, if the domain specialist concludes in a detailed case analysis that the case has been labeled with the wrong solution. – Exclude case: exclude a case completely from the analysis. If the behavior modeled by the case cannot be explained by factors inherent in the knowledge base, e.g., by external decisions, then the case should be removed. This happened in our case study only for a low number of cases. Examples of the application of the refinement operators are given in Section 4. 3.2
Visualizing Subgroups and Interesting Factors
If the user is not supported by visualization techniques, then an interactive refinement approach typically is not tractable, since the refinement space is usually large. Therefore, we provide visualization methods that enable the user to browse the space of subgroup hypotheses, while testing the hypotheses interactively. This process can also be supported by automatic subgroup mining methods that provide an initial starting point for further interaction. Additionally, visualization techniques simplify the interpretation of the subgroup mining results. Furthermore, they should guide the user to the right direction for refinement. An examplary visualization is shown in Figure 2 where the distributions of several factors are given. The subgroup toothlax = minor (Lockerungsgrad = Grad I) (An-
[3]
[1]
[2]
Fig. 2. Visualizing Subgroups and Interesting Factors (in German)
Subgroup Mining for Interactive Knowledge Refinement
459
notation 1) is shown with 39 incorrectly solved cases and 152 correctly solved cases; the general population contains 84 incorrectly and 694 correctly solved cases (Annotation 3). The rows in the table below the subgroup show the value distributions of the other attributes. Labels with a large "dark-gray" (green) sub-label, or a (red) horizontal bar that is close to the top, indicate "interesting" attribute values: the width/height, respectively, of these sub-bars indicates the improvement of the target share, if the respective attribute value (selector) would be added to the current subgroup, resulting in a virtual future subgroup. The horizontal line indicates the relative improvement of the target share: if the line is in the middle of the attribute value bar, then the target share in the future subgroup improves by 50%. In addition to the improvement of the target share of the current subgroup, the (potential) reduction of the subgroup size shown by the width of a attribute value cell also needs to be taken into account. In the example visualization the cell attachmentloss = strong (Attachmentloss = gravierend, 31-50%) is the best one considering its size, and also the target share (Annotation 2). Another interesting factor could be Position = 4 which, however, can be regarded as a random finding. In this visualization the user is able to inspect different subgroups directly by one click on the corresponding cells. All elements, i.e., subgroups, rules, and cases, can be browsed directly by one click, and changes can be assessed immediately. The changes are also intuitively reflected by the size of the bars (Annotation 3). Therefore, the user-guided integrated method provides direct interaction and instant feedback to the user. 3.3
Discussion
In the past, various approaches for knowledge refinement were proposed, e.g. [1]. More recently, Knauf et al. [4] presented a refinement approach embedded in a complete validation methodology. Carbonara and Sleeman [2] describe an efficient method for selecting effective refinements, and Boswell and Craw [3] introduce a set of general refinement operators that are applicable in various application domains and that can be used within different problem-solving tasks. All these approaches are classified as automatic refinement techniques modifying rule based knowledge. The modifications are motivated by a previous analysis step performing a blame allocation, i.e., identifying faulty knowledge. Then, alternative strategies are applied in order to automatically generate possible and select suitable refinements of the knowledge base. However, all automatic methods make the tweak assumption [2], which implies that the knowledge base is almost valid and only small improvements need to be performed. In our application scenario the validity of the knowledge base was quite poor (about 86% accuracy) and therefore no tweak assumption could be made. In contrast, we expected that important rules were missing and that we have to acquire additional knowledge during the process. For this reason, we decided to choose a mixed refinement/elicitation process, which emphasizes the interactive analysis and modification of the implemented rules based on found subgroup patterns. Similarly, Carbonara and Sleeman [2] use an inductive approach for generating new rules using the available cases. Additionally, in our application we cannot expect that all cases contain the correct solution, and thus a thorough analysis of the cases within the process was also necessary. In contrast, automatic approaches mainly do assume a correct case base.
460
4
M. Atzmueller et al.
Case Study
In this section, we introduce the applied medical domain and we present practical experiences with the described approach. 4.1
The Prothetic System
The case study was implemented with a consultation and documentation system for dental findings regarding any kind of prosthetic appliance. The system was developed by the department of prosthodontics at the Würzburg University Hospital in cooperation with the department of computer science VI of the University of Würzburg. The domain specialists used the knowledge system D3 [9] to implement the knowledge base. The systems aims to decide about a diagnostic plan using the clinical findings to accommodate the patient with denture. In the first level the system proposes the teeth that could be conserved and the teeth that should be extracted. The cases contain always the standard findings acquired in the first consultation with the patient, and additional findings from x-ray examinations, e.g., abnormal x-ray findings (apical, periradicular), grade of tooth lax, endodontic state (root filling, pulp vitality), root quantity, root length, crown length, level of attachment loss, root caries, tooth angulation and elongation/extrusion. The cases are manually entered by the examiners using an interactive dialog. For a given tooth all findings are stored in a single case in a data base. In the knowledge base each finding obtains a point score depending on its quality. The outcome of the addition of the single scores is the dental score of the examined tooth. If the total dental score is less or equal than 40 points, then the tooth should be conserved. If the dental score is greater than 40 points, then the tooth has to be extracted (EX). The system tries to support the dentist with time-efficient planning of patients’ denture. Additionally, it should increase the efficiency of clinical work by chairside taking findings that are immediately translated to a prosthodontic therapy decision. In the future, it is also envisioned to use the diagnostic decision tool as a knowledge-based system for dental student education in order to train the ensuing diagnostic work up. Then, students can learn recognition and interpretation of symptoms and clinical findings by comparing their diagnosed solutions with the derived solutions of the system. 4.2
Results
To assess the quality of the system we compared the results of the system with the solutions of a domain specialist, both using the same set of findings. The initial case base contained 802 cases. 24 cases were removed from the case base, because the corresponding teeth had been extracted by prosthodontic reasons during planning denture. Although these teeth had a better dental score their extraction (EX) was decided, e.g., to prevent irregular construction in denture which can cause problems in future. Finally, the applied case base contained 778 cases corresponding to 778 examined teeth. We investigated the diagnosis corresponding to tooth extraction/non extraction. Considering this diagnosis the case base contained 108 false positive and 670 correct cases without any refinement of the knowledge base. First subgroup mining efforts turned out unexpected subgroups with a very high share of incorrectly solved cases. However, some subgroup descriptions were very difficult to interpret by the domain specialist, since they contained finding combinations
Subgroup Mining for Interactive Knowledge Refinement
461
that should establish the diagnosis EX categorically. Therefore, the domain specialist provided immutable test knowledge represented as a set of synthetically generated test cases. Examples are shown in Table 1. The given subgroup descriptions indicate certain knowledge, when the diagnosis EX should be categorically established. Table 1. Examples for immutable test knowledge No. Subgroup Description (findings) Diagnosis 1 tooth lax = medium ∧ attachmentloss = strong EX 2 attachmentloss = very strong EX 3 tooth lax = strong ∧ root quantity = 3 EX 4 tooth lax = strong ∧ root caries = deep caries EX
Using these subgroups the domain specialist was immediately able to locate incorrect cases, due to problems concerning data acquisition, i.e., noise in cases. Either the cases contained a false solution or incorrect findings. In total, 19 cases were corrected: 16 contained false diagnoses, and 3 contained incorrect case descriptions (findings). Using the refined case base further analysis by automatic subgroup mining turned up several subgroups which were assessed as dubious by the expert, e.g., a subgroup described by tooth lax = medium ∧ tooth position = 2 ∧ tooth quadrant = 2. However, these subgroups had a high share of incorrectly solved cases. Since the combined potential faulty factors did not indicate anything particular, the domain specialist checked the contained cases. It turned out, that the false positives and false negatives had a high share of incorrect case descriptions and incorrectly assigned solutions. In total, further 12 cases were fixed. Table 2. Discovered subgroups indicating a knowledge base refinement No. 1 2 3
Subgroup Description Diagnosis Points abnormal x-ray = only apical EX 10 → 5 tooth lax = medium ∧ root length = longer than crown length EX -20 tooth lax = minor ∧ attachmentloss = strong EX -20
Using subgroup mining for the refinement of the knowledge system we managed to improve the knowledge base by reducing the number of incorrectly solved cases down to 54 cases: the domain specialist assessed several subgroups mined by the system as significant, which were then used for knowledge base refinement. We modified and added several rules, examples are given in Table 2. Subgroup description #1 is an example for a simple modification. For abnormal x-ray = only apical we modified the score, such that the rule only contributes 5 points. The last two subgroup descriptions the corresponding rules exemplify two general mechanisms: in rule #2 the condition root length = longer than crown length counts as negative for extraction, and relativizes the factor tooth lax = medium which is positive for extraction. Such an interaction can also work the other way round, i.e., when a positive factor influences a negative one. Then, for extraction, we would have to add points, e.g., for tooth lax = medium and attachmentloss = minor. For subgroup description #3 the selectors tooth lax = minor
462
M. Atzmueller et al.
and attachmentloss = strong are both positive for extraction, but since they are assessed independently in the rule base they should not be over-emphasized by being counted twice. Therefore, the score points of the corresponding rules were decreased. The system also discovered some subgroups that could not be interpreted by the experts afterwards and thus were ignored, e.g., tooth lax = medium ∧ root quantity = 1. In summary, we managed to reduce the number of incorrectly solved cases from 108 to 54 by 50%. In consequence, we increased the precision of the knowledge base from 86% to 93%. The method was very well accepted by the domain specialist, who was able to directly inspect and change the subgroups and cases by himself.
5
Summary and Future Work
In this paper, we presented a novel method using subgroup mining for interactive knowledge refinement. We introduced the application of the mining method, and proposed a process model for the refinement task. Furthermore, we described the identification of potential faulty factors and discussed the applicable refinement operators. We also motivated how the user of the process is supported by visualization techniques that guide the interactive refinement process. In a case study using cases from a fielded medical system we demonstrated the application and the benefit of the proposed techniques. In the future, we plan to investigate further visualization techniques to support the user during the refinement process. Additionally, a semi-automatic refinement method that is adapted to the used knowledge representation (rules with point scores) is another interesting issue to consider.
References 1. Ginsberg, A.: Automatic Refinement of Expert System Knowledge Bases. Morgan Kaufmann (1988) 2. Carbonara, L., Sleeman, D.: Effective and Efficient Knowledge Base Refinement. Machine Learning 37 (1999) 143–181 3. Boswell, R., Craw, S.: Organizing Knowledge Refinement Operators. In: Validation and Verification of Knowledge Based Systems. Kluwer, Oslo, Norway (1999) 149–161 4. Knauf, R., Philippow, I., Jantke, K.P., Gonzalez, A., Salecker, D.: System Refinement in Practice – Using a Formal Method to Modify Real-Life Knowledge. In: Proceedings of the 15th International Florida Artificial Intelligence Research Society Conference (FLAIRS2002), AAAI Press (2002) 5. Wrobel, S.: An Algorithm for Multi-Relational Discovery of Subgroups. In Komorowski, J., Zytkow, J., eds.: Proc. First European Symposion on Principles of Data Mining and Knowledge Discovery (PKDD-97), Berlin, Springer Verlag (1997) 78–87 6. Klösgen, W.: Subgroup Discovery. Chapter 16.3. In: Handbook of Data Mining and Knowledge Discovery. Oxford University Press, New York (2002) 7. Gamberger, D., Lavrac, N., Krstacic, G.: Active Subgroup Mining: a Case Study in Coronary Heart Disease Risk Group Detection. Artificial Intelligence in Medicine 28 (2003) 27–57 8. Gamberger, D., Lavrac, N.: Expert-Guided Subgroup Discovery: Methodology and Application. Journal of Artificial Intelligence Research 17 (2002) 501–527 9. Puppe, F.: Knowledge Reuse among Diagnostic Problem-Solving Methods in the Shell-Kit D3. Intl. Journal of Human-Computer Studies 49 (1998) 627–649
Evidence Accumulation to Identify Discriminatory Signatures in Biomedical Spectra A. Bamgbade1,2, R. Somorjai1, B. Dolenko1, E. Pranckeviciene1, A. Nikulin1, and R. Baumgartner1 1
Institute for Biodiagnostics, National Research Council Canada, 435 Ellice Avenue, Winnipeg, MB, Canada, R3B 1Y6 2 Department of Computer Science, University of Manitoba, 545 Machray Hall, Winnipeg, MB, Canada, R3T 2N2 {Nike.Bamgbade, Ray.Somorjai, Brion.Dolenko, Erinija.Pranckevie, Alexander.Nikulin, Richard.Baumgartner}@nrc-cnrc.gc.ca
Abstract. Extraction of meaningful spectral signatures (sets of features) from high-dimensional biomedical datasets is an important stage of biomarker discovery. We present a novel feature extraction algorithm for supervised classification, based on the evidence accumulation framework, originally proposed by Fred and Jain for unsupervised clustering. By taking advantage of the randomness in genetic-algorithm-based feature extraction, we generate interpretable spectral signatures, which serve as hypotheses for corroboration by further research. As a benchmark, we used the state-of-the-art support vector machine classifier. Using external crossvalidation, we were able to obtain candidate biomarkers without sacrificing prediction accuracy.
1 Introduction Biomedical data (e.g. spectra) are characterized by large numbers of features (e.g., spectral intensities), but by considerably fewer samples. However, neighboring spectral features are strongly correlated, hence highly redundant. It is therefore reasonable to assume that the data do not span the entire high-dimensional input space, but lie near some low-dimensional manifold, representable by contiguous spectral regions that form a spectral signature [1,2]. Obtaining a spectral signature (a panel of biomarkers) consisting of a small number of spectral regions is an important step in biomarker discovery and provides the domain expert with interpretable information (hypothesis), usable for further experimental corroboration. Fred and Jain [3] proposed an evidence accumulation framework (EAF) for unsupervised clustering. In this framework, results of several clusterings are combined to form a final data partition such that the most frequently clustered data points are assigned to the same cluster. A possible way to obtain multiple clusterings is to use clustering algorithms that depend on random initialization. Inspired by this EAF, we propose, in the supervised classification setting, a genetic-algorithm-driven feature extraction procedure for identifying discriminatory signatures in biomedical S. Miksch et al. (Eds.): AIME 2005, LNAI 3581, pp. 463 – 467, 2005. © Springer-Verlag Berlin Heidelberg 2005
464
A. Bamgbade et al.
spectra, in our case magnetic resonance (MR) spectra. We used external crossvalidation to avoid overoptimistic assessment of the feature extraction procedure due to selection bias [4,5]. Our gold standard is a state-of-the-art support vector machine (SVM) classifier [6].
2
Datasets and Methods
We studied two 2-class MR spectral datasets (Dataset 1 and Dataset 2). We used the amplitude spectra. The properties of the datasets, and their original partition into training and independent test sets are presented in Table 1 below. Table 1. Properties of the datasets Training Samples
Test Samples
(Class 1)+(Class 2)
(Class 1)+(Class 2)
Dimensionality Dataset 1
1500
(62) + (62)
(42) + (31)
Dataset 2
1500
(70) + (70)
(105) + (59)
2.1 Genetic Algorithm GA_ORS is a Genetic-Algorithm-based Optimal Region Selector, developed at the Institute for Biodiagnostics (IBD) [1]. In GA_ORS, the fitness function to be optimized for guiding the search for discriminatory spectral regions is the mean square error of the training set classification, using linear discriminant analysis (LDA), with internal leave-one-out (LOO) crossvalidation. The details of the parameters controlling the evolution of the genetic algorithm (GA), and its implementation are given in [1]. In this experiment, three (3) spectral regions are requested, and the probabilities of crossover and mutation of the GA-ORS are set at 0.6600 and 0.00100 respectively. The spectral signature found by the GA depends on the initial random seeds. We use this instability property of the GA in our evidenceaccumulation-based feature extraction procedure. Note, that any unstable feature selection/extraction procedure can be used in the evidence accumulation framework. For a discussion of various feature selection algorithms in high-dimensional gene microarray profiling, see for example Ref. [7]. 2.2 Evidence-Accumulation-Based Feature Extraction The experimental procedure is illustrated in the pseudo code in Figure 1. We used N = 100 different random seeds to carry out N feature extractions via the GA. We count the frequency of occurrence of each of the extracted features in the N runs. The more frequently a feature is identified, the more likely that it is important. The final spectral signature (group of features) is obtained at a chosen frequency fraction
Evidence Accumulation to Identify Discriminatory Signatures in Biomedical Spectra
465
threshold that we call the occurrence fraction (OF) of the feature groups. A 0.25 OF identifies features occurring with a frequency of 0.25N in N selections. We assessed feature occurrences at 0.25, 0.50 and 0.75 OFs. Figure 2 shows the spectral signature obtained from one of K random partitions of one of the datasets using a 0.25 OF. We also investigated the effect of varying the number of generations of the GA on the accuracy of the evidence-accumulation-based feature extraction.
1. 2.
3. 4. 5.
Input: R = Number of regions to be selected N = Maximum number of feature extractions G = Number of generations OF = Occurrence fraction Generate K random splits of the dataset. For each random split 2.1. Do N times 2.1.1. Select initial random seed r 2.1.2. Initialize GA_ORS with seed r and G then select R feature regions from split 2.1.3. Increment frequency count of selected features Plot histogram of feature frequency counts Select features with OF occurrence fraction Compute predictive accuracy for a threshold OF features, using external crossvalidation
Fig. 1. The evidence-accumulation-based feature extraction pseudo code
2.3 Support Vector Machine Classifier Support Vector Machines (SVMs) are currently considered state-of-the-art off-theshelf classifiers. To deal with the overfitting problem, SVMs use L2-norm regularization during training. Typically, SVMs give sparse solutions, using only a small number of support vectors for classification. We applied the SVM to the full pdimensional data. For the classification experiments, we used LIBSVM [6]. 2.4 External Crossvalidation We partition the dataset K times (K = 10) into training and test sets. The proportions of the samples in the training and test sets were identical to those given in Table 1 (stratification). For each split, we carried out the feature extraction procedure (or SVM) on the training set and then obtained the classification accuracy on the corresponding test set. Note, that each particular test set was used only once after the feature extraction procedure / SVM. We computed the average accuracy and standard deviation over the K splits.
466
A. Bamgbade et al.
3 Results The results of the external crossvalidation, using the evidence accumulation framework for Dataset 1 and Dataset 2, are summarized in Table 2. They show that the misclassification error increases for features with higher OFs. We also observed that from 10 to 20 generations of the GA_ORS, the misclassification error provided by the extracted features decreased. However, as the number of generations increased from 30 to 100, the misclassification error increased. This observation suggests that as the number of generations increases, the feature selection procedure starts to adapt to Table 2. External crossvalidation for Dataset 1 and Dataset 2 with the evidence-accumulationbased feature extraction method
Dataset 1 No. of Occurrence Generations Fraction 20
40
60
100
0.25 0.50 0.75 0.25 0.50 0.75 0.25 0.50 0.75 0.25 0.50 0.75
Dataset 2
Average No of Misclassification Generations Error 0.06 20 0.08 0.13 0.08 40 0.11 0.15 0.09 60 0.10 0.19 0.10 100 0.08 0.13
Occurrence Fraction 0.25 0.50 0.75 0.25 0.50 0.75 0.25 0.50 0.75 0.25 0.50 0.75
Average Misclassification Error 0.14 0.15 0.16 0.15 0.15 0.15 0.15 0.15 0.15 0.15 0.15 0.15
Features
Fig. 2. Spectral signature (shaded regions) with the centroids (means) of Class 1 and Class 2 for Dataset 1 from one of the random splits using a 0.25 OF threshold
Evidence Accumulation to Identify Discriminatory Signatures in Biomedical Spectra
467
the peculiarities of the training set used in the feature selection, and that overfitting commences beyond 30 generations. The results (average misclassification error over the 10 splits ± standard deviation) of using the SVM benchmark, evaluated by external crossvalidation, were 0.053 ± 0.028 and 0.176 ± 0.326 for Dataset 1 and Dataset 2, respectively. Comparing the classification accuracy of SVM with that of the evidence accumulation feature extraction indicates that the latter was able to obtain interpretable candidate biomarkers without sacrificing classification accuracy.
4 Conclusions We proposed a novel feature extraction method based on an evidence accumulation framework. We took advantage of the inherent randomness of the genetic-algorithmbased feature extraction to generate multiple feature subsets. We used MR spectra, but our method is directly applicable to other spectral modalities as well. We benchmarked our algorithm with the state-of-the-art support vector machine classifier. We were able to find discriminatory spectral signatures or panels of biomarkers with high predictive capability. Acknowledgements. We acknowledge partial support of this work by NSERC. The first author acknowledges support of the University of Manitoba graduate fellowship.
References 1. Nikulin, A., Dolenko, B., Bezabeh, T., Somorjai, R.: Near-optimal Region Selection for Feature Space Reduction: Novel-preprocessing Methods for Classifying MR Spectra. NMR Biomed. 11 (1998) 209-216 2. Pranckeviciene, E., Baumgartner, R., Somorjai, R.: Consensus-based Identification of Spectral Signatures for Classification of High-dimensional Biomedical Spectra. Proc. ICPR Conf. (2004) 319-322 3. Fred, A., Jain, A.K.: Data Clustering using Evidence Accumulation. Proc. ICPR Conf. (2002) 276-280 4. Ambroise, C., McLachlan, G.: Selection Bias in Gene Extraction on the Basis of Microarray Gene-expression Data. Proc. Natl. Acad. Sci. USA. 99 (2002) 6562-6566 5. Simon, R., Radmacher, M., Dobbin, K., McShane, L.M.: Pitfalls in the Use of DNA Microarray Data for Diagnostic and Prognostic Classification. J.Natl.Cancer Inst. 95 (2003) 14-18 6. Chang, C., Lin, C.J.: LIBSVM: A library for Support Vector Machines, 2001. http://www.csie.ntu.edu.tw/~cjlin/libsvm 7. Inza, I., Larrañaga, P., Blanco, R., Cerrolaza, A.J.: Filter versus Wrapper Gene Selection Approaches in DNA Microarray Domains. Artif. Intell. Med. 31 (2004) 91-102
On Understanding and Assessing Feature Selection Bias Šarunas Raudys1, Richard Baumgartner2, and Ray Somorjai2 1
Vilnius Gediminas Technical University, Sauletekio 11,Vilnius LT-2100, Lithuania 2006 [email protected] 2 Institute for Biodiagnostics, National Research Council Canada, 435 Ellice Avenue, Winnipeg, MB, Canada R3B 1Y6 {Richard.Baumgartner, Ray.Somorjai}@nrc-cnrc.gc.ca
Abstract. Feature selection in high-dimensional biomedical data, such as gene expression arrays or biomedical spectra constitutes and important step towards biomarker discovery. Controlling feature selection bias is considered a major issue for a realistic assessment of the feature selection process. We propose a theoretical, probabilistic framework for the analysis of selection bias. In particular, we derive the means of calculating the true selection error when the performance estimates of the feature subsets are mutually dependent and the distribution density of the true error is arbitrary. We demonstrate in an extensive series of experiments the utility of the theoretical derivations with real-world datasets. We discuss the importance of understanding feature selection bias for the small sample size (n) / high dimensionality (p) situation, typical for biomedical data (genomics, proteomics, spectroscopy). Keywords: Machine learning, pattern recognition, feature selection bias, biomedical data.
1 Introduction Feature selection in high-dimensional biomedical data is an important step towards biomarker discovery for disease profiling. The clinician is provided with a practical hypothesis, i.e. a panel of biomarkers (particular genes or spectral peaks) that can be validated by further biomedical tests. Ideally, three datasets are required for a feature selection procedure: a training set, a validation set and an independent test set. The training set is used to estimate the parameters of a classification rule. The validation set aids in evaluating the performance of the classifier obtained, and the selection of the best classifier. The independent test set should be used only after the best features and the optimal classifier have been identified. Biomedical data (microarrays from genomics, mass spectra from proteomics, magnetic resonance, Raman and infrared spectra of tissues and biofluids), however are characterized by very few samples, n = (O(10) - O(100)) and numerous features, p = (O(1000) - O(10000)). As a consequence, we are confronted with the twin curses of dimensionality and dataset sparsity [3]. Specific two-class examples include gene microarray data [7] (n/p = 72/7129), mass spectra for proteomics [12] (n/p = 100/15154), and MR spectra for biofluids [13] (n/p = 78/3500). S. Miksch et al. (Eds.): AIME 2005, LNAI 3581, pp. 468 – 472, 2005. © Springer-Verlag Berlin Heidelberg 2005
On Understanding and Assessing Feature Selection Bias
469
When training the classifier, we inevitably adapt to the training set. The degree of adaptation depends on the size of this set, and the complexity (number of features and number and type of classifier parameters) of the classification algorithm. When selecting the feature subset, we adapt to the validation set. An objective of the present paper is to evaluate how this adaptation occurs when imprecise criteria are used to compare the feature subsets or classification methods when feature selection is carried out. Known as the (model) feature selection bias (FSB) problem [1,4,5], its control is a major issue for any feature selection procedure. We propose a theoretical analysis of FSB when the performance estimates of the feature subsets are mutually dependent and the distribution density of true error rates in a pool of possible feature subsets is arbitrary. Our purpose is to provide theoretical scaffolding, and assess its validity and limitations by computer simulation on artificial and real-life biomedical data.
2 Theoretical Background We assume m different models (e.g., feature subsets). For each of the models, one obtains validation error estimates, εˆ val 1, εˆ val 2, … , εˆ val m. The selection of the best model is based on minimal error estimates. We call the minimal value, εˆ apparentS = min j ( εˆ val j) = εˆ val z, j = 1,…, m, the apparent selection error. An important aspect of the subsequent analysis is the assumption that there exist true error rates, ε true 1, ε true 2, ε true 3, … , ε true m. Although these are unknown, assuming their existence helps understand the feature selection problem and its analysis. For the analysis of adaptation to the validation set, we also need to define the true selection error. Given the estimates εˆ val 1, εˆ val 2, … , εˆ val m, we consider the error of the zth feature subset (best according to the above estimates), as the true selection error, denoted by ε trueS = ε true z. Assuming further that the m feature subsets were selected randomly, and both sets of estimates εˆ val 1, εˆ val 2, … , εˆ val m and ε true 1, ε true 2, ε true 3, … , ε true m are independent random variables, one can obtain theoretically the expected values of εˆ apparentS and ε trueS. Raudys [9] considered the case when the vectors to be classified derive from spherical Gaussian distributions. He showed that the increase in generalization error due to suboptimal feature selection is larger than that due to imperfect training of the classifier. Using the statistics of extremes [14], a method was developed [10] to analyze the apparent and true errors in model (feature) selection when considering m randomly selected models. It was assumed that (ε true, εˆ val) is a bivariate random vector with known (joint) distribution density function, f3( εˆ ,ε) = f2( εˆ |ε) f1(ε). The analysis assumed either Binomial distributions for both f1(ε) and f2( εˆ |ε) [6], or Generalized Beta (Pearson type I) for f1(ε) and Binomial for f2( εˆ |ε) [8], where f1(ε) is the probability density function of the true error ε and f2( εˆ |ε) is the conditional probability density function of the validation error εˆ , given the true error ε. It was shown that for the given distributions f1(ε) and f2( εˆ |ε), the expectation E(ε trueS) first decreases with increasing m, the number of feature subsets (types of classifier models) considered. However, the decrease slows down, and later stops altogether.
470
Š. Raudys, R. Baumgartner, and R. Somorjai
One of our objectives is to extend the analysis to more realistic situations, for which the generalization error increases when m is sufficiently large. Here, we consider the situation when the estimates εˆ val 1, εˆ val 2,…, εˆ val m may be mutually dependent, and the distribution density f1(ε) is arbitrary. To formalize the feature selection process, we follow the framework outlined in Refs. [8,9]. We assume that we are evaluating m competing models (feature subsets), using a validation set. We are particularly interested in the difference between the expectations of ε trueS and εˆ apparentS; this can be interpreted as a measure of the selection bias of the feature selection process. We assume that the validation error εˆ is in the interval [0,1]. We model the true, validation, apparent selection, and true selection errors (ε, εˆ , εˆ apparentS, ε trueS) as random variables. If estimates εˆ val 1, εˆ val 2,…, εˆ val m are obtained using a single validation set, they are statistically dependent, and the standard theory of extremes cannot be used. To introduce the dependency between the m validation error estimates, we assume that each estimate is obtained as a weighted sum of two binomially distributed random common unique ˆ common is obtained while classifying and εˆvalid variables, εˆvalid j j . We assume that ε valid j a set of validation vectors nvalid 1 that are common to all m feature subsets. It has a unique binomial distribution. εˆvalid j is assumed to be unique to each subset and it is also distributed binomially. With these assumptions, the expectations E( εˆ apparentS) and E(ε trueS) for the new generalized model can be found. E( εˆ apparentS) and E(ε trueS) can be calculated numerically, provided the parameters of the distributions f2( εˆ |ε) and f1(ε) are known a priori. This result is the main theoretical contribution of the paper.
3 Feature Selection Bias: Numerical Analysis For a rough practical assessment of the accuracy of the above theoretical analysis of feature selection bias, we performed experiments with simulated data, and with three real-world biomedical datasets [7,11] for which the dimensionality of the feature space exceeds hundred-fold the number of samples. Due to space limitations we present here only results for the leukemia dataset [7]. For training, we used 35 samples, and the remaining 37 vectors constituted the test set. Then 500,000 8-dimensional (8d) feature subsets were generated randomly and Fisher linear classifiers were trained and subsequently validated on the 8d subsets. The training errors (resubstitution estimates) were used for feature selection. Since only eighteen 8d vectors per class were available in the training set, the resubstitution error estimate of the Fisher classifier is very optimistically biased. To reduce this bias, we used a surrogate error estimate for each error value, εˆresubstitu tionS , in the graph to interpolate and evaluate the Mahalanobis distance. This correction for resubstitution error estimation was developed in [2]. Fig. 1a displays the histogram that characterizes the distribution of the test errors. Graph N1 (dots) in Fig. 1b shows the apparent mean feature selection error εˆ apparentS as a function of m. The mean, εˆ
apparentS,
was calculated as the average of the
On Understanding and Assessing Feature Selection Bias
471
validation errors for each value of m, using a combinatorial algorithm developed in Refs. [2,10]. Graph N2 (bold) shows the mean of the “true” selection error, m εˆ resubstitu tionS (in this case, the average of the resubstitution errors for the C500000 500000 choose m subsets). The upper graph N3 (dashed line) in Fig. 1b shows the reduced model selection bias for the true error. The minima in graphs N2 and N3 are approximately at the same location. However, due to the random character of the training and test datasets, the random bias between the two graphs is, in principle, unavoidable. Similar results were obtained using two other real-world biomedical examples (classification of pathogenic fungi [11]). 5
4
x 10
0.2
N3
(a)
3
(b)
0.15
2
N2
0.1
1
0.05
N1 0
0
0.2
0.4
εˆ validation
0.6
0.8
1
0 2
3
4
5
6
m
Fig. 1. a) Distribution of the validation error εˆvalidation in the experiments with the 7129dimensional leukemia data. b) Means of the apparent selection error (N1), true selection error (N2) and corrected selection error (N3) for the leukemia dataset
4 Concluding Remarks We presented a theoretical analysis of feature selection bias when the performance estimates of the feature subsets are mutually dependent, and the distribution density of the true error rates, in a pool of possible feature subsets, is arbitrary. Despite the fact that the true performance of a model (e.g., a particular feature subset) is unknown, and has to be evaluated by some “surrogate” estimate, the new approach readily helps appreciate important aspects of feature selection bias, and allows us to quantify more objectively the selection bias problem. This is particularly important in the small sample size/ high feature space dimension situation, for which the selection bias may be considerable. Based on our results, for a given classification problem, e.g. for classification of high-dimensional biomedical data such as gene microarrays or biomedical spectra, we recommend the following three-step procedure to assess feature selection bias: First, estimate the probability density function of εˆ using m feature subsets. Then compute εˆ val j.Finally, calculate the expectation of the true selection error E(ε trueS).
472
Š. Raudys, R. Baumgartner, and R. Somorjai
Theory, numerical study and the experiments carried out with simulated as well as realworld data show that: 1) feature selection bias is very large if the sample size used to evaluate feature subsets is insufficient, 2) the theoretical analysis presented in Section 2 can be used for a rough estimation of feature selection bias. The assumptions made in our theoretical analysis, i.e., partitioning the validation error into common and unique components, are somewhat restrictive. Development of more general models and constructive estimation of parameters of the distributions f1(ε) and f2( εˆ |ε) is a topic for further research. Acknowledgement. The first author was supported by NATO Expert Visit Grant SST.EAP.EV 980950.
References 1. S. J. Raudys, A. K. Jain. Small sample size effects in statistical pattern recognition: Recommendation for practitioners. IEEE Transactions on Pattern Analysis and Machine Intelligence 13(3) (1991) 242-254 2. S. Raudys. Statistical and Neural Classifiers - An Integrated Approach to Design. Springer-Verlag London Ltd. (2001). 3. R. L. Somorjai, B. Dolenko, R. Baumgartner. Class Prediction and Discovery Using Gene Microarray and Proteomics Mass Spectroscopy Data: Curses, Caveats, Cautions. Bioinformatics 19(12) (2003) 1484-1491 4. S.E. Estes. Measurement selection for linear discriminant used in pattern classification. PhD. Thesis, Stanford University, (1965). 5. C. Ambroise, G.J. McLachlan. Selection bias in gene extraction on the basis of microarray gene-expression data. Proc. Natl. Acad. Sci. USA 99 (10) (2002) 65626566. 6. S. Raudys. Influence of sample size on the accuracy of model selection in pattern recognition. In Statistical Problems of Control (S. Raudys, ed., Institute of Mathematics and Informatics, Vilnius) 50 (1981) 9-30 [in Russian] 7. T.R. Golub et al. Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring. Science 286 (1999) 531-537 8. S. Raudys, V. Pikelis. Collective selection of the best version of a pattern recognition system. Pattern Recognition Letters 1(1) (1982) 7-13 9. S. Raudys. Classification errors when features are selected. In S Raudys (editor), Statistical Problems of Control, 38 (1979) 9-26. Institute of Mathematics and Informatics, Vilnius (in Russian). 10. V. Pikelis. (1991) Calculating statistical characteristics of experimental process for selecting the best version. In S Raudys (editor) Statistical Problems of Control, 93 46−56. Institute of Mathematics and Informatics, Vilnius (in Russian). 11. U. Himmelreich et al. Rapid identification of Candida species by using nuclear magnetic resonance spectroscopy and a statistical classification strategy. Appl. Environ. Microbiol. 69(8) (2003) 4566-74 12. E. Petricoin et al. Use of proteomics patterns in serum to identify ovarian cancer. Lancet, 359 (2002) 572-577 13. C. Lean et al. Accurate diagnosis and prognosis of human cancers by proton MRS and a three-stage classification strategy. Annual Reports on NMR Spectroscopy, 48 (2002) 71-111 14. Gumbel E. Statistics of extremes. Dover Publications (2004).
A Model-Based Approach to Visualizing Classification Decisions for Patient Diagnosis Keith Marsolo1, Srinivasan Parthasarathy1, Michael Twa2 , and Mark Bullimore2 1 2
Department of Computer Science and Engineering College of Optometry, The Ohio State University, Columbus, OH, USA [email protected]
Abstract. Automated classification systems are often used for patient diagnosis. In many cases, the rationale behind a decision is as important as the decision itself. Here we detail a method of visualizing the criteria used by a decision tree classifier to provide support for clinicians interested in diagnosing corneal disease. We leverage properties of our data transformation to create surfaces highlighting the details deemed important in classification. Preliminary results indicate that the features illustrated by our visualization method are indeed the criteria that often lead to a correct diagnosis and that our system also seems to find favor with practicing clinicians.
1
Introduction
The field of medicine has benefited greatly from advances in computing. Medical imaging, genetic screening and large-scale clinical trials to determine drug efficacy are just three areas where increased computational power is fueling massive growth in the testing and collection of patient data. With the ability to collect and store large amounts of patient data comes the desire to discover underlying relationships between it. The fields of data mining and bioinformatics have arisen to handle this task. When applied to medical data, traditional data mining applications like classification and clustering can yield commonalities among patients that might be an indicator of a given condition. If such commonalities are found, it is possible to create a type of “signature,” where patients who are found to have that signature could be diagnosed with, or classified as potentially having, that condition. Ideally, one would like a system that can provide more information than a simple yes or no decision. The criteria used to reach a decision can be just as
This work was partially supported by the following grants: NIH-NEI T32-EY13359, NIH-NEI K23-EY16225 and American Optometric Foundation William Ezell Fellowship, Ocular Sciences Inc. (MT); NSF Career Grant IIS-0347662 (KM, SP); NIH-NEI R01-EY12952 (MB).
S. Miksch et al. (Eds.): AIME 2005, LNAI 3581, pp. 473–483, 2005. c Springer-Verlag Berlin Heidelberg 2005
474
K. Marsolo et al.
important as the decision itself. In some cases, the basis for a diagnosis may be as simple as the presence of a particular antibody in the blood. In diseases that are manifested by the existence or absence of certain physical characteristics, the decision might not be so cut and dry. For instance, keratoconus, a disease affecting the shape of the cornea, is often characterized by a large cone-shaped protrusion rising from the corneal surface. How irregular the cornea must be before the diagnosis of keratoconus is made is arguable. Having a way to visualize the criteria used in classification could provide a great benefit to clinicians, providing a safety check that could limit some of the liability that might arise from relying purely on a single answer from a computer. It is in that spirit that we present our work into developing a system of classifying corneal shape using decision trees to distinguish between several classes of corneas while providing a visualization of the classification criteria to aid in clinical decision support. Decision tree classifiers are widely used in many applications. The principle behind a decision tree is to select an attribute at each node and divide the feature space into increasingly fine-grained regions. The class label of an object is based on the final region in which it lies. One benefit of decision trees is that they are transparent, meaning the user can inspect the actual tree produced by the classifier to examine the features used in classification. While accuracy is an important factor in choosing a classification strategy, another attribute that should not be ignored is the interpretability of the final results. Decision trees may be favored over (possibly) more accurate “blackbox” classifiers because they provide more understandable results. In medical image interpretation, decision support for a domain expert is preferable to an automated classification made by an expert system. While it is important for a system to provide a decision, it is often equally important for clinicians to know the basis for an assignment. Given a dataset containing corneas labeled as diseased, post-operative or non-diseased, we use Zernike polynomials to model the corneal surface. We then use the polynomial coefficients for each geometric mode as an input feature for a decision tree. Part of the rationale behind using Zernike polynomials as a transformation method over other alternatives is that there is a direct correlation between the geometric modes of the polynomials and the surface features of the cornea. Zernike polynomials have been well-studied by the medical community and vision scientists and clinicians are familiar with their use [1]. An added benefit of Zernike polynomials is that the orthogonality of the series allows us to treat each term independently. Since the polynomial coefficients used as splitting attributes represent the proportional contributions of specific geometric modes, we are able to create a surface representation that reflects the spatial features deemed “important” in classification. These features discriminate between the patient classes and give an indication as to the specific reasons for a decision. We have designed and developed a method of visualizing our decision tree results. We partition our dataset based on the path taken through the decision tree and create a surface from the mean values of the polynomial coefficients of all the patients falling into that partition. For each patient, we create an individual
A Model-Based Approach to Visualizing Classification Decisions
475
polynomial surface from the patient’s coefficient values that correspond to the splitting attributes of the decision tree. We contrast this surface against a similar surface containing the mean partition values for the same splitting coefficients and provide a measure to quantify how “close” a patient lies to the mean. We have shown our results to practicing clinicians and preliminary validation from them indicates there is an anatomic correspondence between the features of these surfaces and corneal aberrations of patients with keratoconus.
2
Biological Background
Image formation begins when light entering the eye is focused by the cornea, passes through the pupil, and is further focused by the lens, forming an image on the retina at the back of the eye. Since light must pass through the cornea, clear vision depends very much on the optical quality of the corneal surface, which is responsible for nearly 75% of the total optical power of the eye. The normal cornea has an aspheric profile that is more steeply curved in the center relative to the periphery. Subtle distortions in the shape of the cornea can have a dramatic effect on vision. For this reason, the shape of the corneal surface is measured clinically for the treatment and diagnosis of eye disease. The process of mapping the surface features on the cornea is known as corneal topography. 2.1
Clinical Use of Corneal Topography
The use of corneal topography has rapidly increased in recent years because of the popularity of refractive surgery and the decreased cost of powerful personal computers. Refractive surgery is an elective surgical treatment intended to reduce one’s dependence on glasses or contact lenses. The most common surgical treatment performed for this purpose is laser assisted in-situ keratomileusis (LASIK). Corneal topography is used to screen patients for corneal disease prior to this surgery and to monitor the effects of treatment after surgery. Another clinical application of corneal topography is the diagnosis and management of keratoconus. Keratoconus, a progressive, non-inflammatory corneal disease, distorts corneal shape and results in poor vision that cannot be corrected with ordinary glasses or contact lenses. Patients with keratoconus frequently seek refractive surgery due to their poor vision. However, such treatment exacerbates their corneal distortion and frequently leads to corneal transplant. Thus, corneal topography is a valuable tool for diagnosis and management of keratoconus as well as for the prevention of inappropriate refractive surgery in this patient group. 2.2
Determination of Corneal Shape
The most common method of determining corneal shape is to record an image of the reflection of a series of concentric rings from the corneal surface. Any distortion in corneal shape will cause a distortion of the concentric rings. By comparing the size and shape of the imaged rings with their known dimensions, it is possible to mathematically derive the topography of the corneal surface.
476
K. Marsolo et al. Fig. 1. Example of characteristic corneal shapes for each of the three patient groups. From left to right: Keratoconus, Normal, and postLASIK. The top image shows a picture of the cornea and reflected concentric rings. The bottom image shows the topographical map representing corneal shape
Figure 1 shows an example of the output produced by a corneal topographer for a member of each patient class. These represent extreme examples of each class, i.e. they designed to clearly illustrate the differences between them. The top portion of the figure shows the imaged concentric rings. The bottom portion of the image shows a false color map representing the surface curvature of the cornea. This color map is intended to aid clinicians and the appearance of each map is largely instrument-dependent. Many methods of interpretation exist, but a lack of accepted standards among domain experts and instrument manufacturers usually results in qualitative pattern recognition from the false color maps. The data files from the corneal topographer consist of a three-dimensional matrix of approximately 7000 spatial coordinates arrayed in a polar grid. The height of each point z is specified by the relation z = f (ρ, θ), where the height relative to the corneal apex is a function of radial distance from the origin (ρ) and the counter-clockwise angular deviation from the horizontal meridian (θ). The inner and outer borders of each concentric ring consist of a discrete set of 256 data points taken at a known angle θ, but a variable distance ρ, from the origin. Using the data obtained from a corneal topographer, we can construct a surface to represent the shape of the cornea. It is not possible, however, to use this surface directly in classification. Each surface is represented by thousands of data points and trying to use these points as a feature vector in classification will results in poor performance. Alternatively, using the raw data results in a classifier that is too granular and therefore too specific to provide useful clinical decision support. We transform the raw data to address this problem.
3
Zernike Polynomials
Zernike polynomials are a family of circular polynomial functions that are orthogonal in x and y over a normalized unit circle. Each polynomial term consists of three elements [1]. The first element is a normalization coefficient. The second element is a radial polynomial component and the third, a sinusoidal angular component. The general form for Zernike polynomials is given by:
A Model-Based Approach to Visualizing Classification Decisions
477
Polynomial Azimuthal Frequency (m) Order (n) -4 -3 -2 -1 0 1 2 3 4 0 0 1 1 2 2 3 4 5 3 6 7 8 9 4 10 11 12 13 14 (a)
(b)
Fig. 2. (a) Combinations of radial polynomial order (n) and azimuthal frequency (m) that yield the 15 terms comprising a 4th order Zernike polynomial. (b) Graphical representation of the modes
⎧ ⎨2(n + 1)Rnm (ρ) cos(mθ) for m > 0 ±m Zn (ρ, θ) = 2(n + 1)Rnm (ρ) sin(|m|θ) for m < 0 ⎩ (n + 1)Rnm (ρ) for m = 0
(1)
where n is the radial polynomial order and m represents azimuthal frequency. The normalization coefficient is given by the square root term preceding the radial and azimuthal components. The radial component of the Zernike polynomial is defined as:
(n−|m|)/2
Rnm (ρ) =
s=0
(−1)s (n − s)! s!( n+|m| 2
− s)!( n−|m| − s)! 2
ρn−2s
(2)
Please note that the value of n is a positive integer or zero. Only certain combinations of n and m will yield valid polynomials. Any combination that is not valid simply results in a radial component of zero. The table shown in Fig. 2 (a) lists the combinations of n and m that generate the 15 terms comprising a 4th order Zernike polynomial. Rather than refer to each term with the two indexing scheme using n and m, each term is labeled with a number j that is consistent with the single indexing scheme created by the Optical Society of America [1]. Polynomials that result from fitting our raw data with these functions are a collection of orthogonal circular geometric modes. Examples of the modes representing a 4th order Zernike polynomial can be seen in Fig. 2 (b). The label underneath each mode corresponds to the jth term listed in the table to the left. The coefficients of each mode are proportional to their contribution to the overall topography of the original data. As a result, we can effectively reduce the dimensionality of the data to a subset of polynomial coefficients that describe spatial features, or a proportion of specific geometric modes present in the original data.
478
4
K. Marsolo et al.
Classification of Corneal Topography
Iskander et al. examined a method of modeling corneal shape using Zernike polynomials with a goal of determining the optimal number of coefficients to use in constructing the model [2]. In a follow-up study, the same group concluded that a 4th order Zernike polynomial was adequate in modeling the majority of corneal surfaces [3]. Twa et al. showed that with Zernike polynomials, lower order transformations were able to capture the general shape of the cornea and were unaffected by noise [4]. They also found that higher order transformations were able to provide a better model fit, but were also more susceptible to noise. A number of studies on the classification of corneal shape have been produced that use statistical summaries to represent the data [5, 6]. These studies used multi-layer perceptrons or linear discriminant functions in classification. Smolek and Klyce have classified normal and post-operative corneal shapes using wavelets and neural networks with impressive results, but use very few data in their experiments [7, 8]. Twa et al. were the first to use decision trees for corneal classification, reporting classification accuracy around 85% with standard C4.5 and in the mid-90% range with C4.5 and various meta-learning techniques [4]. In our experiments, we achieve roughly the same classification accuracy even though we use a more challenging, multi-class dataset.
5
Visualization of Classification Decisions
Our visualization surfaces are calculated by transforming the patient data using an algorithm detailed by Iskander et al. [2]. It involves calculating a Zernike polynomial for every data point and then computing a least-squares fit between the polynomials and the actual surface data. This results in a set of coefficients which we then use for classification. Figure 3 shows an example decision tree. In this case, the Z value in each node (diamond) refers to the jth term of a Zernike polynomial. Classification occurs by comparing a patient’s Zernike coefficient values against those in the tree. For instance, the value of Z7 for a patient is compared against the value listed in the top node. If the patient’s value is less than or equal to 2.88, we traverse the right branch. If not, we take the left (in all cases, we take the right branch with a yes decision and the left with a no). With this example, a no decision would lead to a classi- Fig. 3. Decision tree for a 4th order Zernike fication of K, or Keratoconus. If the polynomial
A Model-Based Approach to Visualizing Classification Decisions
479
value of Z7 was less than or equal to 2.88, we would proceed to the next node and continue the process until we reached a leaf and the corresponding classification label (K = Keratoconus, N = Normal, L = LASIK).
(a)
(b)
(c)
Fig. 4. Example surfaces for the strongest rule in each patient class. Figure (a) represents a Keratoconic eye (Rule 1), (b) a Normal cornea (8), and (c) a post-operative LASIK eye (4). The top panel contains the Zernike representation of the patient. The next panel illustrates the Zernike representation using the rule values. The third and fourth panels show the rule surfaces, using the patient and rule coefficients, respectively. The bottom panel consists of a bar chart showing the deviation between the patient’s rule surface coefficients and the rule values
We can treat each possible path through a decision tree as a rule for classifying an object. Using the tree in Fig. 3, we construct rules for the first four leaves, with the leaf number (arbitrarily assigned) listed below the class label: 1. 2. 3. 4.
Z7 Z7 Z7 Z7
> 2.88 → Keratoconus ≤ 2.88 ∧ Z12 > −4.04 ∧ Z5 > 9.34 → Keratoconus ≤ 2.88 ∧ Z12 ≤ −4.04 ∧ Z0 ≤ −401.83 → Keratoconus ≤ 2.88 ∧ Z12 ≤ −4.04 ∧ Z0 > −401.83 → LASIK
480
K. Marsolo et al.
In each rule, the “∧” symbol is the standard AND operator and the “→” sign is an implication stating that if the values of the coefficients for a patient satisfy the expression on the left, the patient will be assigned the label on the right. For a given dataset, a certain number of patients will be classified by each rule. These patients will share similar surface features. As a result, one can compare a patient against the mean attribute values of all the other patients who were classified using the same rule. This comparison will give clinicians some indication of how “close” a patient is to the rule average. To compute these “rule mean” coefficients, denoted rule, we partition the training data and take the average of each coefficient over all the records in that particular partition. For a new patient, we compute the Zernike transformation and classify the record using the decision tree to determine the rule for that patient. Once this step has been completed, we apply our visualization algorithm,to produce five separate images (illustrated in Fig. 4). The first panel is a topographical surface representing the Zernike model for the patient. It is constructed by plotting the 3D transformation surface as a 2-D topographical map, with elevation denoted by color. The second section contains a topographical surface created in a similar manner by using the rule coefficients. These surfaces are intended to give an overall picture of how the patient’s cornea compares to the average cornea of all similarly-classified patients. The next two panels in Fig. 4 (rows 3 and 4) are intended to highlight the features used in classification, i.e. the distinguishing surface details. We call these surfaces the rule surfaces. They are constructed from the value of the coefficients that were part of the classification rule (the rest are zero). For instance, if a patient were classified using the second rule in the list above, we would keep the values of Z7 , Z12 and Z5 . The first rule surface (third panel of Fig. 4) is created by using the relevant coefficients, but instead of using the patient-specific values, we use the values of the rule coefficients. This surface will represent the mean values of the distinguishing features for that rule. The second rule surface (row 4, Fig. 4) is created in the same fashion, but with the coefficient values from the patient transformation, not the average values. The actual construction of the surfaces is achieved by running our transformation algorithm in reverse, updating the values in the coefficient matrix depending on the surface being generated. Once the values have been computed, we are left with several 3-D polar surfaces. In the surfaces, the base elevation is green. Increasing elevation is represented by red-shifted colors; decreased elevation is represented by blue-shifted colors. Finally, we provide a measure to show how close the patient lies to those falling in the same rule partition. For each of the distinguishing coefficients, we compute a relative error between the patient and the rule. We take the absolute value of the difference between the coefficient value of the patient and the value of the rule and divide the result by the rule value. We provide a bar chart of these errors for each coefficient (the error values of the the coefficients not used in classification are set to zero). This plot is intended to provide a clinician with an idea of the influence of the specific geometric modes in classification and the
A Model-Based Approach to Visualizing Classification Decisions
481
degree that the patient deviates from the mean. To provide further detail, if the patient’s coefficient value is less than the rule coefficient, we color the bar blue. If it is greater, we color it red.
6
Dataset
The dataset for these experiments consists of the examination data of 254 eyes obtained from a clinical corneal topography system. The data were examined by a clinician and divided into three groups based on corneal shape. The divisions are given below, with the number of patients in each group listed in parentheses: 1. Normal (n = 119) 2. Post-operative myopic LASIK (n = 36) 3. Keratoconus (n = 99) Relative to the corneal shape of normal eyes, post-operative myopic LASIK corneas generally have a flatter center curvature with an annulus of greater curvature in the periphery. Corneas that are diagnosed as having keratoconus are also more distorted than normal corneas. Unlike the LASIK corneas, they generally have localized regions of steeper curvature at the site of tissue degeneration, but often have normal or flatter than normal curvature elsewhere.
7
Visualization of Classification Results
Using a C4.5 decision tree generated from the classification of the dataset on a 4th order Zernike transformation (Fig. 3), we provide an example of our visualization technique on a record from each patient class. Table 1. Strongest Decision Tree Rules and Support Class Rule Label 1 K 8 N 4
L
Expression Z7 > 2.88 Z7 ≤ 2.88 ∧ Z12 > −4.04 ∧ Z5 ≤ 9.34 ∧ Z9 ≤ 1.42 ∧ −0.07 > Z8 ≤ 1.00 ∧ Z13 > −0.31 Z7 ≤ 2.88 ∧ Z12 ≤ −4.04 ∧ Z0 > −401.83
Support Support Percentage 47 66.2% 74 89.2% 19
79.2%
As stated previously, we must reclassify our training data after generating the decision tree in order to partition the data by rule. We expect that certain rules will be stronger than others, i.e. a larger portion of the class will fall into those rule’s partition. Table 1 provides a list of the strongest rule for each class. Given in the table is the rule number, its class label and support, or the number of patients that were classified using that rule. We also provide the support percentage of the rule, or the number of patients classified by the rule divided by
482
K. Marsolo et al.
the total number of patients in that class. These numbers reflect the classification of the training data only, not the entire dataset (the data was divided into a 70%/30% training/testing split). Table 1 also provides the expression for each rule. Using a patient that was classified by each strong rule, we execute our visualization algorithm and create a set of surfaces. These surfaces are shown in Fig. 4. Figure 4 (a) shows a Keratoconic patient classified by Rule 1. Figure 4 (b) gives an example of a Normal cornea classified under Rule 8. Finally, Fig. 4 (c) illustrates a LASIK eye falling under Rule 4. Since these are the strongest, or most discriminating rules for each class, we would expect that the rule surfaces would exhibit surface features commonly associated with corneas of that type. Our results are in agreement with expectations of domain experts. For instance, corneas that are diagnosed as Keratoconic often have a cone shape protruding from the surface. This aberration, referred to as “coma” is modeled by terms Z7 and Z8 (vertical and horizontal coma, respectively). As a result, we would expect one of these terms to play some role in any Keratoconus diagnosis. Rule 1, the strongest Keratoconus rule, is characterized by the presence of a large Z7 value. The rule surfaces (rows 3 and 4) of Fig. 4 (a) clearly show such a feature. Corneas that have undergone LASIK surgery are characterized by a generally flat, plateau-like center. They often have a very negative spherical aberration mode (Z12 ). A negative Z12 value is part of Rule 4, the strongest LASIK rule. The elevated ring structure shown in the rule surfaces (rows 3 and 4) of Fig. 4 (c), is an example of that mode.
8
Conclusions
In order to address malpractice concerns and be trusted by clinicians, a classification system should consistently provide high accuracy (at least 90%, preferably higher). When coupled with meta-learning techniques, decision trees can provide such accuracy. The motivation behind the application presented here was to provide an additional tool that can be used by clinicians to aid in patient diagnosis and decision support. By taking advantage of the unique properties of Zernike polynomials, we are able to create several decision surfaces which reflect the geometric features that discriminate between normal, keratoconic and post-LASIK corneas. We have shown our results to several practicing clinicians who indicated that the important corneal features found in patients with keratoconus were indeed being used in classification and can be seen in the decision surfaces. This creates a degree of trust between the clinician and the system. In addition, the clinicians stated a preference for this technique over several start-of-the-art systems available to them. Most current classification systems diagnosis a patient as either having keratoconus or not. An off-center post-LASIK cornea can present itself as one that has keratoconus and vice versa. Misclassifying the post-LASIK cornea is bad, but misclassifying a keratoconic cornea as normal (in a binary classifier) can be catastrophic. Performing LASIK surgery on a cornea suffering from keratoconus can lead to corneal transplant. Our method can be leveraged
A Model-Based Approach to Visualizing Classification Decisions
483
by clinicians to serve as a safety check, limiting the liability that results from relying on an automated decision from an expert system.
References 1. Thibos, L., Applegate, R., Schwiegerling, JT Webb, R.: Standards for reporting the optical aberrations of eyes. In: TOPS, Santa Fe, NM, OSA (1999) 2. Iskander, D., Collins, M., Davis, B.: Optimal modeling of corneal surfaces with zernike polynomials. IEEE Trans Biomed Eng 48 (2001) 87–95. 3. Iskander, D.R., Collins, M.J., Davis, B., Franklin, R.: Corneal surface characterization: How many zernike terms should be used? (arvo abstract). Invest Ophthalmol Vis Sci 42 (2001) S896 4. Twa, M., Parthasarathy, S., Raasch, T., Bullimore, M.: Automated classification of keratoconus: A case study in analyzing clinical data. In: SIAM Intl. Conference on Data Mining, San Francisco, CA (2003) 5. Maeda, N., Klyce, S.D., Smolek, M.K., Thompson, H.W.: Automated keratoconus screening with corneal topography analysis. Invest Ophthalmol Vis Sci 35 (1994) 2749–57 6. Rabinowitz, Y.S., Rasheed, K.: Kisaminimal topographic criteria for diagnosing keratoconus. J Cataract Refract Surg 25 (1999) 1327–35 7. Smolek, M.K., Klyce, S.D.: Screening of prior refractive surgery by a wavelet-based neural network. J Cataract Refract Surg 27 (2001) 1926–31. 8. Maeda, N., Klyce, S.D., Smolek, M.K.: Neural network classification of corneal topography. preliminary demonstration. Invest Ophthalmol Vis Sci 36 (1995) 1327– 35.
Learning Rules from Multisource Data for Cardiac Monitoring ´ Elisa Fromont, Ren´e Quiniou, and Marie-Odile Cordier IRISA, Campus de Beaulieu, 35000 Rennes, France {efromont, quiniou, cordier}@irisa.fr Abstract. This paper aims at formalizing the concept of learning rules from multisource data in a cardiac monitoring context. Our method has been implemented and evaluated on learning from data describing cardiac behaviors from different viewpoints, here electrocardiograms and arterial blood pressure measures. In order to cope with the dimensionality problems of multisource learning, we propose an Inductive Logic Programming method using a two-step strategy. Firstly, rules are learned independently from each sources. Secondly, the learned rules are used to bias a new learning process from the aggregated data. The results show that the the proposed method is much more efficient than learning directly from the aggregated data. Furthermore, it yields rules having better or equal accuracy than rules obtained by monosource learning.
1
Introduction
Monitoring devices in Cardiac Intensive Care Units (CICU) use only data from electrocardiogram (ECG) channels to diagnose automatically cardiac arrhythmias. However, data from other sources like arterial pressure, phonocardiograms, ventilation, etc. are often available. This additional information could also be used in order to improve the diagnosis and, consequently, to reduce the number of false alarms emitted by monitoring devices. From a practical point of view, only severe arrhythmias (considered as red alarms) are diagnosed automatically, and in a conservative manner to avoid missing a problem. The aim of the work that has begun in the Calicot project [1] is to improve the diagnosis of cardiac rhythm disorders in a monitoring context and to extend the set of recognized arrhythmias to non lethal ones if they are detected early enough (considered as orange alarms). To achieve this goal, we want to combine information coming from several sources, such as ECG and arterial blood pressure (ABP) channels. We are particularly interested in learning temporal rules that could enable such a multisource detection scheme. To learn this kind of rules, a relational learning system that uses Inductive Logic Programming (ILP) is well-adapted. ILP not only enables to learn relations between characteristic events occurring on the different channels but also provides rules that are understandable by doctors since the representation method relies on first order logic. One possible way to combine information coming from difference sources is simply, to aggregate all the learning data and then, to learn as in the monosource S. Miksch et al. (Eds.): AIME 2005, LNAI 3581, pp. 484–493, 2005. c Springer-Verlag Berlin Heidelberg 2005
Learning Rules from Multisource Data for Cardiac Monitoring
485
(i.e one data source) case. However, in a multisource learning problem, the amount of data and the expressiveness of the language, can increase dramatically and with them, the computation time of ILP algorithms and the size of the hypothesis search space. Many methods have been proposed in ILP to cope with the search space dimensions, one of them is using a declarative bias [2]. This bias aims either at narrowing the search space or at ranking hypotheses to consider first the better ones for a given problem. Designing an efficient bias for a multisource problem is a difficult task. In [3], we have sketched a divideand-conquer strategy (called biased multisource learning) where symbolic rules are learned independently from each source and then, the learned rules are used to bias automatically and efficiently a new learning process on the aggregated dataset. This proposal is developed here and applied on cardiac monitoring data. In the first section we give a brief introduction to inductive logic programming. In the second section we expose the proposed method. In the third section we describe the experiments we have done to compare, on learning from cardiac data, the monosource, naive multisource and biased multisource methods. The last section gives conclusions and perspectives.
2
Multisource Learning with ILP
In this section, we make a brief introduction to ILP (see [4] for more details) and we give a formalization of this paradigm applied to multisource learning. We assume familiarity with first order logic (see [5] for an introduction). 2.1
Introduction to ILP
Inductive Logic Programming (ILP) is a supervised machine learning method. Given a set of examples E and a set of general rules B representing the background knowledge, it builds a set of hypotheses H, in the form of classification rules for a set of classes C. B and H are logic programs i.e. sets of rules (also called definite clauses) having the form h:- b1 , b2 , ..., bn . When n=0 such a rule is called a fact. E is a labeled set of ground facts. In a multi-class problem, each example labeled by c is a positive example for the class c and a negative example for the class c ∈ {C − c}. The following definition for a multi-class ILP problem is inspired by Blockeel et al. [6]. Definition 1. A multi-class ILP problem is described by a tuple < L, E, B, C > such that: – E = {(ek , c)|k = 1, m; c ∈ C} is the set of examples where each ek is a set of facts expressed in the language LE . – B is a set of rules expressed in the language L. L = LE ∪ LH where LH is the languages of hypotheses. The ILP algorithm has to find a set of rules H such that for each (e, c) ∈ E: H ∧ e ∧ B c and ∀ c ∈ C − {c}, H ∧ e ∧ B c
486
´ Fromont, R. Quiniou, and M.-O. Cordier E.
The hypotheses in H are searched in a so-called hypothesis space. A generalization relation, usually the θ-subsumption [7], can be defined on hypotheses. This relation induces a lattice structure on LH which enables an efficient exploration of the search space. Different strategies can be used to explore the hypothesis search space. For example, ICL [8], the ILP system we used, explores the search space from the most general clause to more specific clauses. The search stops when a clause that covers no negative example while covering some positive examples is reached. At each step, the best clause is refined by adding new literals to its body, applying variable substitutions, etc. The search space, initially defined by LH , can be restricted by a so-called language bias. ICL uses a declarative bias (DLAB [9]) which allows to define syntactically the subset of clauses from LH which belong to the search space. A DLAB bias is a grammar which defines exactly which literals are allowed in hypotheses, in which order literals are added to hypotheses and the search depth limit (the clause size). The most specific clauses of the search space that can be generated from a bias specification are called bottom clauses. Conversely, a DLAB bias can be constructed from a set of clauses (the method will not be explained in this article). 2.2
Multisource Learning
In a multisource learning problem, examples are bi-dimensional, the first dimension, i ∈ [1, s], refers to a source, the second one, k ∈ [1, m], refers to a situation. Examples indexed by the same situation correspond to contemporaneous views of the same phenomenon. Aggregation is the operation consisting in merging examples from different views of the same situations. The aggregation function Fagg depends on the learning data type and can be different from one multisource learning problem to another. Here, the aggregation function is simply the set union associated to inconsistency elimination. Inconsistent aggregated examples are eliminated in the multisource learning problem. The aggregation knowledge, such as correspondence between example attributes on different channels and temporal constraints is entirely described in the background knowledge B. Multisource learning for a multi-class ILP problem is then defined as follows: Definition 2. Let < Li , Ei , Bi , C >, i = 1, s, be ILP problems such that Li describes the data from source i. Ei = {(ei,k , c)|k = 1, m; c ∈ C}. A multisource ILP problem is defined by a tuple < L, E, B, C > such that: s – E = Fagg (E1 , E2 , . . . , Em ) = {(ek , c)|ek = i=1 ei,k , k = 1, m} – L = LE ∪ LH is the multisource language where LE = Fagg (LE1 , LE2 , . . . , LEm ) and LH ⊇ si=1 LHi , – B is a set of rules in the language L. The ILP algorithm has to find a set of rules H such that for each (e, c) ∈ E: H ∧ e ∧ B c and ∀ c ∈ C − {c}, H ∧ e ∧ B c A naive multisource approach consists in learning directly from the aggregated examples and with a global bias that covers the whole search space related
Learning Rules from Multisource Data for Cardiac Monitoring
487
to the aggregated language L. The main drawback of this approach is the size of the resulting search space. In many situations the learning algorithm is not able to cope with it or takes too much computation time. The only solution is to specify an efficient language bias, but this is often a difficult task especially when no information describing the relations between sources is provided. In the following section, we propose a new method to create such a bias.
3
Reducing the Multisource Learning Search Space
We propose a multisource learning method that consists in learning rules independently from each source. The resulting clauses, considered as being bottom clauses, are then merged and used to build a bias that will be used for a new learning process on the aggregated data. Algorithm 1 shows the different steps of the method on two source learning. It can be straightforwardly extended to n source learning. We assume that the situations are described using a common reference time. This is seldom the case for raw data, so we assume that the data set have been preprocessed to ensure this property. A literal that describes an event occurring on some data source, as qrs(R0,normal), is called an event literal. Literals of predicates common to the two sources and describing relations between two events, as suc(R0,R1) or rr1(R0,R1,normal), are called relational literals. Algorithm 1 1. Learn with bias Bias1 on the ILP problem < L1 , E1 , B1 , C >. Let Hc1 be the set of rules learned for a given class c ∈ C (rules with head c). 2. Learn with bias Bias2 on the ILP problem < L2 , E2 , B2 , C >. Let Hc2 be the set of rules learned for the class c (rules with head c). 3. Aggregate the sets of examples E1 and E2 giving E3 . 4. Generate from all pairs (h1j , h2k ) ∈ Hc1 × Hc1 a set of bottom clauses BT such that each bti ∈ BT built from h1j and h2k is more specific than both h1j and h2k .The literals of bti are all the literals of h1j and h2k plus new relational literals that synchronize events in h1j and h2k . 5. For each c ∈ C Build bias Biasc3 from BT . Let Bias3 = {Biasc3 |c ∈ C}. 6. Learn with Bias3 on the problem < L, E3 , B3 , C > where : – L is the multisource language as defined in section 2 – B3 is a set of rules expressed in the language L One goal of multisource learning is to make relationships between events occurring on different sources explicit. For each pair (h1j ,h2k ), there are as many bottom clauses as ways to intertwine events from the two sources. A new relational predicate, suci, is used to specify this temporal information. The number n where n is the numof bottom clauses generated for one pair (h1j ,h2k ) is Cn+p ber of event predicates belonging to h1j and p is the number of event predicates belonging to h2k . The number of clauses in BT is the total number of bottom clauses generated for all possible pairs. This number may be very high if Hc1
488
´ Fromont, R. Quiniou, and M.-O. Cordier E.
and Hc2 contain more than one rule and if there are several event predicates in each rule. However, in practice, many bottom clauses can be eliminated because the related event sequences do not make sense for the application. The bias can then be generated automatically from this set of bottom clauses. The multisource search space bounded by this bias has the properties 1, 2 and 3. Property 1 (Correctness). There exist hypotheses with an equal or higher accuracy than the accuracy of Hc1 and Hc2 in the search space defined by Bias3 of algorithm 1. Intuitively, property 1 states that, in the worst case, Hc1 and Hc2 can also be learned by the biased multisource algorithm. The accuracy1 is defined as the rate of correctly classified examples. Property 2 (Optimality). There is no guaranty to find the multisource solution with the best accuracy in the search space defined by Bias3 in algorithm 1. Property 3 (Search space reduction). The search space defined by Bias3 in algorithm 1 is smaller than the naive multisource search space. The size of the search space specified by a DLAB bias can be computed by the method given in [9]. The biased multisource search space is smaller than the naive search space since the language used in the first case is a subset of the language used in the second case. In the next section, this biased multisource method is compared to monosource learning from cardiac data coming from an electrocardiogram for the first source and from measures of arterial blood pressure for the second source. The method is also compared to a naive multisource learning performed on the data aggregated from the two former sources.
4
Experimental Results
4.1
Data
We use the MIMIC database (Multi-parameter Intelligent Monitoring for Intensive Care [10]) which contains 72 patients files recorded in the CICU of the Beth Israel Hospital Arrhythmia Laboratory. Raw data concerning the channel V1 of an ECG and an ABP signal are extracted from the MIMIC database and transformed into symbolic descriptions by signal processing tools. These descriptions are stored into a logical knowledge database as Prolog facts (cf. Figures 1 and 2). Figure 1 shows 7 facts in a ventricular doublet example : 1 P wave, 3 QRS s, 1
P +T N The accuracy is defined by the formula T P +TT N+T where T P (true positive) is P +T N the number of positive examples classified as true, T N (true negative) the number of negative examples classified as false, F N (false negative)is the number of positive examples classified as false and F P (false positive), the number of negative examples classified as true.
Learning Rules from Multisource Data for Cardiac Monitoring
489
begin(model). doublet_3_I. ..... p_wave(p7,4905,normal). qrs(r7,5026,normal). suc(r7,p7). qrs(r8,5638,abnormal). suc(r8,r7). qrs(r9,6448,abnormal). suc(r9,r8). ..... end(model).
begin(model). rs_3_ABP. ....... diastole(pd4,3406,-882). suc(pd4,ps3). systole(ps4,3558,-279). suc(ps4,pd4). ...... end(model).
Fig. 1. Example representation of a ventricular doublet ECG
Fig. 2. Example representation of a normal rhythm pressure channel
Table 1. Number of nodes visited for learning and computation times
sr ves bige doub vt svt af TOT
monosource: ECG monosource: ABP naive multisource biased multisource Nodes Time * Nodes Time * Nodes Time Nodes Time (⊃ *) 2544 176.64 2679 89.49 18789 3851.36 243 438.55 2616 68.15 5467 68.04 29653 3100.00 657 363.86 1063 26.99 1023 14.27 22735 3299.43 98 92.74 2100 52.88 4593 64.11 22281 2417.77 1071 290.17 999 26.40 3747 40.01 8442 724.69 30 70.84 945 29.67 537 17.85 4218 1879.71 20 57.58 896 23.78 972 21.47 2319 550.63 19 63.92 11163 404.51 19018 315.24 108437 15823.59 2138 1377.66
the first one occurring at time 5026 as well as relations describing the order of these waves in the sequence. Additional information such as the wave shapes (normal/abnormal) is also provided. Figure 2 provides a similar description for the pressure channel. Seven cardiac rhythms (corresponding to seven classes) are investigated in this work: normal rhythm (sr ), ventricular extra-systole (ves), bigeminy (bige), ventricular doublet (doub), ventricular tachycardia (vt ) which is considered as being a red alarm in CICU, supra-ventricular tachycardia (svt )and atrial fibrillation (af ). On average, 7 examples were built for each of the 7 classes. 4.2
Method
To verify empirically that the biased multisource learning method is efficient, we have performed three kinds of learning experiments on the same learning data: monosource learning from each sources, multisource learning from aggregated data using a global bias and multisource learning using a bias constructed from rules discovered by monosource learning (first experiment). In order to assess the
´ Fromont, R. Quiniou, and M.-O. Cordier E.
490
Table 2. Results of cross validation for monosource and multisource learnings
sr ves bige doub vt svt af
monosource ECG TrAcc Acc Comp 0.84 0.84 9 1 0.96 5/6 1 1 5 1 1 4/5 1 1 3 1 1 3 1 1 3
monosource ABP TrAcc Acc Comp 1 0.98 5 0.963 0.64 3/2/3/2 0.998 0.84 3/2 0.995 0.84 4 0.981 0.78 3/3/5 1 0.96 2 0.981 0.98 2/2
naive multisource biased multisource TrAcc Acc Comp TrAcc Acc Comp 0.98 0.98 3 0.98 0.98 6 0.976 0.76 4/3 0.96 0.98 5 0.916 0.7 4/2 1 1 5 0.997 0.8 3/4 0.967 0.9 5 0.981 0.94 4 1 1 3 1 0.98 2 1 1 2 1 1 2 1 1 2
impact of the learning hardness, we have performed two series of experiments: in the first series (4.3) the representation language was expressive enough to give good results for the three kinds of experiments; in the second series (4.4) the expressibility of the representation language was reduced drastically. Three criteria are used to compare the learning results: computational load (CPU time), accuracy and complexity (Comp) of the rules (each number in a cell represents the number of cardiac cycles in each rule produced by the ILP system). As the number of examples is rather low, a leave-one-out cross validation method is used to assessed the different criteria. The average accuracy measures obtained during cross-validation training (TrAcc) and test (Acc) are provided. 4.3
Learning from the Whole Database
Table 1 gives an idea of the computational complexity of each learning method (monosource on ECG and ABP channels, naive and biased multisource on aggregated data). Nodes is the number of nodes explored in the search space and Time is the learning computation time in CPU seconds on a Sun Ultra-Sparc 5. Table 1 shows that, on average, from 5 to 10 times more nodes are explored during naive multisource learning than during monosource learning and that about 500 to 1000 times less nodes are explored during biased multisource learning than during naive multisource learning. However the computation time does not grow linearly with the number of explored nodes because the covering tests (determining whether an hypothesis is consistent with the examples) are more complex for multisource learning. Biased multisource learning computation times take into account monosource learning computation times and are still very much smaller than for naive multisource learning (8 to 35 times less). Table 2 gives the average accuracy and complexity of rules obtained during cross validation for the monosource and the two multisource learning methods. The accuracy of monosource rules is very good for ECG and a bit less for ABP, particularly the test accuracy. The naive multisource rules have also good results. Furthermore, for the seven arrhythmias, these rules combine events and relations occurring on both sources. Only sr, ves, svt and af got combined biased multisource rules. In the three other cases, the learned rules are the same as the ECG rules. The rules learned for svt in the four learning settings are given
Learning Rules from Multisource Data for Cardiac Monitoring
491
class(svt):qrs(R0,normal), p(P1,normal), qrs(R1,normal), suc(P1,R0), suc(R1,P1), rr1(R0,R1,short), p(P2,normal), qrs(R2,normal), suc(P2,R1), suc(R2,P2), rythm(R0,R1,R2,regular).
class(svt):cycle_abp(Dias0,_,Sys0,normal), cycle_abp(Dias1,normal,Sys1,normal), suc(Dias1,Sys0), amp_dd(Dias0,Dias1,normal), ss1(Sys0,Sys1,short), ds1(Dias1,Sys1,long).
Fig. 3. Example of rule learned for class svt from ECG data
Fig. 4. Example of rule learned for class svt from ABP data
in Figures 3, 4, 5 and 6. All those rules are perfectly accurate. The predicate cycle abp(D, ampsd, S, ampds) is a kind of macro predicate that expresses the succession of a diastole named D, and a systole named S. ampsd (resp. ampds) expresses the symbolic pressure variation (∈ {short, normal, long}) between a systole and the following diastole D (resp. between the diastole D and the following systole S). The biased multisource rule and the naive multisource rule are very similar but specify different event orders (in the first one the diastolesystole specification occurs before two close-in-time QRS whereas in the second one, the same specification occurs after two close-in-time QRS). As expected from the theory, the biased multisource rules accuracy is better than or equal to the monosource rules accuracy except for the doublet. In this case, the difference between the two accuracy measures comes from a drawback of the cross-validation to evaluate the biased multisource learning rules with respect to the monosource rules. At each cross-validation step, one example is extracted for test from the example database and learning is performed on the remaining database. Sometimes the learned rules differ from one step to another. Since this variation is very small we have chosen to keep the same multisource bias for all the cross-validation steps even if the monosource rules upon which it should be constructed may vary. According to this choice, the small variation between the biased multisource rules accuracy and the monosource rules accuracy is not significant. Table 2 also shows that when the monosource results are good, the biased multisource rules have a better accuracy than the naive multisource rules and rules combining events from different sources can also be learned. 4.4
Learning from a Less Informative Database
The current medical data we are working on are very well known from the cardiologists, so, we have a lot of background information on them. For example, we know which event or which kind of relations between events are interesting for the learning process, which kind of constraints exists between events occurring on the different sources etc. This knowledge is very useful to create the learning bias and can explain partly the very good accuracy results obtained in the learning experiments above. These good results can also be explained by the small number of examples available for each arrhythmia and the fact that our examples are not corrupted. In this context, it is very difficult to evaluate the usefulness of using two data sources to improve the learning performances.
492
´ Fromont, R. Quiniou, and M.-O. Cordier E.
class(svt):qrs(R0,normal), cycle_abp(Dias0,_,Sys0,normal), p(P1,normal),qrs(R1,normal), suci(P1,Sys0),suc(R1,P1), systole(Sys1), rr1(R0,R1,short).
class(svt):qrs(R0,normal), p(P1,normal),qrs(R1,normal), suc(P1,R0),suc(R1,P1), rr1(R0,R1,short), cycle_abp(Dias0,_,Sys0,normal), suci(Dias0,R1).
Fig. 5. Example of rule learned for class svt by naive multisource learning
Fig. 6. Example of rule learned for class svt by biased multisource learning
Table 3. Results of cross-validation for monosource and multisource learnings without knowledge on P wave nor QRS shape nor diastole
sr ves bige doub vt svt af
monosource ECG TrAcc Acc Comp 0.38 0.36 5 0.42 0.4 5 0.96 0.92 4 0.881 0.78 4/4 0.919 0.84 5/5 0.96 0.94 5 0.945 0.86 4/4
monosource ABP TrAcc Acc Comp 1 0.96 5/4 0.938 0.76 4/3 1 0.98 4/4 0.973 0.86 4/4/5 0.943 0.84 3/4/5 0.962 0.86 4 1 0.9 3/4/5
naive multisource biased multisource TrAcc Acc Comp TrAcc Acc Comp 1 0.92 5 1 0.98 5/4 0.945 0.64 4/4/6 0.94 0.9 4/3 0.98 0.96 4 1 0.98 4/4 1 0.92 4/4 0.941 0.9 3/4/5 0.977 0.76 6/5 0.96 0.86 3/5/5 0.76 0.76 2 0.96 0.94 4 0.962 0.82 2/3 1 0.98 3/4/5
We have thus decided to set ourselves in a more realistic situation where information about the sources is reduced. In this experiment we do not take into account the P waves nor the shape of the QRS on the ECG and the diastole on the ABP channel. This experiment makes sense as far as signal processing is concerned since it is still difficult with current signal processing algorithms to detect a P wave on the ECG. Besides, in our symbolic description of the ABP channel, the diastole is simply the lowest point between two systoles. This specific point is also difficult to detect. Note that cardiologists view the diastole as the time period between two systoles (the moment during which the chambers fill with blood). The results of this second experiment are given in Table 3. This time again, the biased multisource rules have as good or better results than the monosource rules. For arrhythmias sr, bige and af the biased method acts like a voting method and learns the same rules with the same accuracy as the best monosource rules (small variations in accuracies come from the cross validation drawback). For arrhythmias ves, doublet, vt and svt the biased multisource rules are different from both monosource rules corresponding to the same arrhythmia and accuracy and test are better than for the monosource rules.
Learning Rules from Multisource Data for Cardiac Monitoring
5
493
Conclusion
We have presented a technique to learn rules from multisource data with an inductive logic programming method in order to improve the detection and the recognition of cardiac arrhythmias in a monitoring context. To reduce the computation time of a straightforward multisource learning from aggregated examples, we propose a method to design an efficient bias for multisource learning. This bias is constructed from the results obtained by learning independently from data associated to the different sources. We have shown that this technique provides rules which have always better or equal accuracy results than monosource rules and that it is is much more efficient than a naive multisource learning. In future work, the method will be test on corrupted data. Besides, since this article only focus on accuracy and performance results, the impact of the multisource rules on the recognition and the diagnosis of cardiac arrhythmias in a clinical context will be more deeply evaluated by experts.
Acknowledgments Elisa Fromont is supported by the French National Net for Health Technologies as a member of the CEPICA project. This project is in collaboration with LTSIUniversit´e de Rennes1, ELA-Medical and Rennes University Hospital.
References 1. Carrault, G., Cordier, M., Quiniou, R., Wang, F.: Temporal abstraction and inductive logic programming for arrhythmia recognition from ECG. Artificial Intelligence in Medicine 28 (2003) 231–263 2. N´edellec, C., Rouveirol, C., Ad´e, H., Bergadano, F., Tausend, B.: Declarative bias in ILP. In De Raedt, L., ed.: Advances in Inductive Logic Programming. IOS Press (1996) 82–103 3. Fromont, E., Cordier, M.O., Quiniou, R.: Learning from multi source data. In: PKDD’04 (Knowledge Discovery in Databases), Pisa, Italy (2004) 4. Muggleton, S., De Raedt, L.: Inductive Logic Programming: Theory and methods. The Journal of Logic Programming 19 & 20 (1994) 629–680 5. Lloyd, J.: Foundations of Logic Programming. Springer-Verlag, Heidelberg (1987) 6. Blockeel, H., De Raedt, L., Jacobs, N., Demoen, B.: Scaling up inductive logic programming by learning from interpretations. Data Mining and Knowledge Discovery 3 (1999) 59–93 7. Plotkin, G.: A note on inductive generalisation. In Meltzer, B., Michie, D., eds.: Machine Intelligence 5. Elsevier North Holland, New York (1970) 153–163 8. De Raedt, L., Van Laer, W.: Inductive constraint logic. Lecture Notes in Computer Science 997 (1995) 80–94 9. De Raedt, L., Dehaspe, L.: Clausal discovery. Machine Learning 26 (1997) 99–146 10. Moody, G.B., Mark, R.G.: A database to support development and evaluation of intelligent intensive care monitoring. Computers in Cardiology 23 (1996) 657–660 http://ecg.mit.edu/mimic/mimic.html.
Effective Confidence Region Prediction Using Probability Forecasters David G. Lindsay1 and Siˆan Cox2 1
Computer Learning Research Centre [email protected] 2 School of Biological Sciences Royal Holloway, University, Egham, Surrey, TW20 OEX, UK [email protected]
Abstract. Confidence region prediction is a practically useful extension to the commonly studied pattern recognition problem. Instead of predicting a single label, the constraint is relaxed to allow prediction of a subset of labels given a desired confidence level 1 − δ. Ideally, effective region predictions should be (1) well calibrated - predictive regions at confidence level 1 − δ should err with relative frequency at most δ and (2) be as narrow (or certain) as possible. We present a simple technique to generate confidence region predictions from conditional probability estimates (probability forecasts). We use this ‘conversion’ technique to generate confidence region predictions from probability forecasts output by standard machine learning algorithms when tested on 15 multi-class datasets. Our results show that approximately 44% of experiments demonstrate well-calibrated confidence region predictions, with the K-Nearest Neighbour algorithm tending to perform consistently well across all data. Our results illustrate the practical benefits of effective confidence region prediction with respect to medical diagnostics, where guarantees of capturing the true disease label can be given.
1
Introduction
Pattern recognition is a well studied area of machine learning, however the equally useful extension of confidence region prediction [1] has to date received little attention from the machine learning community. Confidence region prediction relaxes the traditional constraint on the learner to predict a single label to allow prediction of a subset of labels given a desired confidence level 1 − δ. The traditional approach to this problem was to construct region predictions from so-called p-values, often generated from purpose built algorithms such as the Transductive Confidence Machine [1]. Whilst these elegant solutions offer non-asymptotic provenly valid properties, there are significant disadvantages of p-values as their interpretation is less direct than that of probability forecasts and they are easy to confuse with probabilities. Probability forecasting is an increasingly popular generalisation of the standard pattern recognition problem; rather than attempting to find the “best” label, the aim is to estimate the conditional probability (otherwise known as a probability forecast) of a possible label given an observed object. Meaningful probability forecasts are vital when learning techniques are involved with S. Miksch et al. (Eds.): AIME 2005, LNAI 3581, pp. 494–503, 2005. c Springer-Verlag Berlin Heidelberg 2005
Effective Confidence Region Prediction Using Probability Forecasters
495
cost sensitive decision-making (as in medical diagnostics) [2],[3]. In this study we introduce a simple, yet provably valid technique to generate confidence region predictions from probability forecasts, in order to capitalise on the individual benefits of each classification methodology. There are two intuitive criteria for assessing the effectiveness of region predictions, ideally they should be: 1. Well calibrated - Predictive regions at confidence level 1−δ ∈ [0, 1] will be wrong (not capture the true label) with relative frequency at most δ. 2. Certain - The predictive regions should be as narrow as possible. The first criterion is the priority: without it, the meaning of predictive regions is lost, and it becomes easy to achieve the best possible performance. We have applied our simple conversion technique to the probability forecasts generated by several commonly used machine learning algorithms on 15 standard multi-class datasets. Our results have demonstrated that 44% of experiments produced well-calibrated region predictions, and found at least one well-calibrated region predictor for 12 of the 15 datasets tested. Our results have also demonstrated our main dual motivation that the learners also provide useful probability forecasts for more direct decision making. We discuss the application of this research with respect to medical diagnostics, as several clinical studies have shown that the use of computer-based diagnostics systems [4], [5] can provide valuable insights into patient problems, especially with helping to identify alternative diagnoses. We argue that with the use of effective probability forecasts and confidence region predictions on multi-class diagnostic problems, the end user can have probabilistic guarantees on narrowing down the true disease. Finally, we identify areas of future research.
2
Using Probability Forecasts for Region Prediction
Our notation will extend upon the commonly used supervised learning approach to pattern recognition. Nature outputs information pairs called examples. Each example zi = (xi , yi ) ∈ X × Y = Z consists of an object xi ∈ X and its label yi ∈ Y = {1, 2, . . . , |Y|}. Formally our learner is a function Γ : Zn × X → P(Y) of n training examples, and a new test object mapping onto a probability distribution over labels. Let Pˆ (yi = j | xi ) represent the estimated conditional probability of the jth label matching the true label for the ith object tested. We will often consider a sequence of N probability forecasts for the |Y| possible labels output by a learner Γ , to represent our predictions for the test set of objects, where the true label is withheld from the learner. 2.1
Probability Forecasts Versus p-Values
Previous studies [1] have shown that it is possible to create special purpose learning methods to produce valid and asymptotically optimal region predictions via the use of p-values. However these p-values have several disadvantages when compared to probability forecasts; their interpretation is less direct than that of probability forecasts, the p-values do not sum to one (although they often take values between 0 and 1) and they are easy to confuse with probabilities. This misleading nature of p-values and other factors have led some authors [6] to object to any use of p-values.
496
D.G. Lindsay and S. Cox
In contrast, although studies in machine learning tend to concentrate primarily on ‘bare’ predictions, probability forecasting has become an increasingly popular doctrine. Making effective probability forecasts is a well studied problem in statistics and weather forecasting [7][8]. Dawid (1985) details two simple criteria for describing how effective probability forecasts are: 1. Reliability - The probability forecasts “should not lie”. When a probability pˆ is assigned to an label, there should be roughly 1 − pˆ relative frequency of the label not occurring. 2. Resolution - The probability forecasts should be practically useful and enable the observer to easily rank the labels in order of their likelihood of occurring. These criteria are analogous to that defined earlier for effective confidence region predictions; indeed we argue that the first criterion of reliability should remain the main focus, as the second resolution criterion naturally is ensured by the classification accuracy of learning algorithms (which is a more common focus of empirical studies). At present the most popular techniques for assessing the quality of probability forecasts are square loss (a.k.a. Brier score) and ROC curves [9]. Recently we developed the Empirical Reliability Curve (ERC) as a visual interpretation of the theoretical definition of reliability [3]. Unlike the square loss and ROC plots, the ERC allows visualisation of over- and under-estimation of probability forecasts. 2.2
Converting Probability Forecasts into Region Predictions
The first step in converting the probability forecasts into region predictions is to sort (1) (|Y|) the probability forecasts into increasing order. Let yˆi and yˆi denote the labels corresponding to the smallest and largest probability forecasts respectively (1) (2) (|Y|) Pˆ (ˆ yi | xi ) ≤ Pˆ (ˆ yi | xi ) ≤ . . . ≤ Pˆ (ˆ yi | xi ) .
Using these ordered probability forecasts we can generate the region prediction Γ 1−δ at a specific confidence level 1 − δ ∈ [0, 1] for each test example xi k−1 (j) (|Y|) (l) 1−δ ˆ P (ˆ y | xi ) < δ Γ (xi ) = {ˆ y , . . . , yˆ } where j = argmax . (1) i
i
k
i
l=1
Intuitively we are using the fact that the probability forecasts are estimates of conditional probabilities and we assume that the labels are mutually exclusive. Therefore summing these forecasts becomes a conditional probability of a conjunction of labels, which can be used to choose the labels to include in the confidence region predictions at the desired confidence level. 2.3
Assessing the Quality of Region Predictions
Now that we have specified how region predictions are created, how do we assess the quality of them? Taking inspiration from [1], informally we want two criteria to be satisfied: (1) that the region predictions are well-calibrated, i.e. the fraction of errors at confidence level 1 − δ should be at most δ, and (2) that the regions should be as narrow
Effective Confidence Region Prediction Using Probability Forecasters
497
as possible. More formally let erri (Γ 1−δ (xi )) be the error incurred by the true label yi not being in the region prediction of object xi at confidence level 1 − δ 1 if yi ∈ / Γ 1−δ (xi ) erri (Γ 1−δ (xi )) = . (2) 0 otherwise To assess the second criteria of uncertainty we simply measure the fraction of total labels that are predicted at the desired confidence level 1 − δ unci (Γ 1−δ (xi )) =
|Γ 1−δ (xi )| . |Y|
(3)
These terms can then be averaged over all the N test examples Err1−δ N
N N 1 1 1−δ 1−δ = erri (Γ (xi )) , UncN = unci (Γ 1−δ (xi )) . N i=1 N i=1
(4)
It is possible to prove that this technique of converting probability forecasts into region predictions (given earlier in Equation 1) is well-calibrated for a learner that outputs Bayes optimal forecasts, as shown below. Theorem 1. Let P be a probability distribution on Z = X × Y. If a learners forecasts are Bayes optimal i.e. Pˆ (yi = j | xi ) = P(yi = j | xi ), ∀j ∈ Y, 1 ≤ i ≤ N , then 2 PN Err1−δ ≥ δ + ε ≤ e−2ε N . N This means if the probability forecasts are Bayes optimal, then even for small finite sequences the confidence region predictor will be well-calibrated with high probability. This proof follows directly from the Hoeffding Azuma inequality [10], and thus justifies our method for generating confidence region predictions. 2.4
An Illustrative Example: Diagnosis of Abdominal Pain
In this example we illustrate our belief that both probability forecasts and region predictions are practically useful. Table 1 shows probability forecasts for all possible disease labels made by Naive Bayes and DW13-NN learners when tested on the Abdominal Pain dataset from Edinburgh Hospital, UK [5]. For comparison, the predictions made by both learners are given for the same patient examples in the Abdominal Pain dataset. At a glance it is obvious that the predicted probabilities output by the Naive Bayes learner are far more extreme (i.e. very close to 0 or 1) than those output by the DW13NN learner. Example 1653 (Table 1 rows 1 and 4) shows a patient object which is predicted correctly by the Naive Bayes learner (ˆ p=0.99) and less emphatically by the DW13-NN learner (ˆ p=0.85). Example 2490 demonstrates the problem of over- and under- estimation by the Naive Bayes learner where a patient is incorrectly diagnosed with Intestinal obstruction (overestimation), yet the true diagnosis of Dyspepsia is ranked 6th with a very 22 (underestimation). In contrast the DW13-NN low predicted probability of pˆ = 1000 learner makes more reliable probability forecasts; for example 2490, the true class is
498
D.G. Lindsay and S. Cox
Table 1. This table shows sample predictions output by the Naive Bayes and DW13-NN learners tested on the Abdominal Pain dataset. The 9 possible class labels for the data (left to right in the table) are: Appendicitis, Diverticulitis, Perforated peptic ulcer, Non-specific abdominal pain, Cholisistitis, Intestinal obstruction, Pancreatitis, Renal colic and Dyspepsia. The associated probability of the predicted label is underlined and emboldened, whereas the actual class label is marked with the symbol
Example # Appx.
Div.
Probability forecast for each class label Perf. Non- Cholis. Intest. Pancr. Renal spec.
0.0 0.0 0.0
0.0 0.22 0.0
Dyspep.
DW13-NN 1653 2490 5831
0.0 0.0 0.53
0.03 0.85 0.0 0.0 0.0 0.25 0.425 0.001 0.005
0.01 0.04 0.0
0.0 0.09 0.0
0.11 0.4 0.039
Naive Bayes 1653 2490 5831
3.08e-9 4.5e-6 3.27e-6 4.37e-5 0.99 4.2e-3 3.38e-3 4.1e-10 1.33e-4 9.36e-5 0.01 0.17 2.26e-5 0.16 0.46 0.2 2.17e-7 2.2e-4 0.969 2.88e-4 1.7e-13 0.03 1.33e-9 2.2e-4 4.0e-11 6.3e-10 7.6e-9
correctly predicted albeit with lower predicted probability. Example 5381 demonstrates a situation where both learners encounter an error in their predictions. The Naive Bayes learner gives misleading predicted probabilities of pˆ = 0.969 for the incorrect diagnosis of Appendicitis, and a mere pˆ = 0.03 for the true class label of Non-specific abdominal pain. In contrast, even though the DW13-NN learner incorrectly predicts Appendicitis, it is with far less certainty pˆ = 0.53 and if the user were to look at all probability forecasts it would be clear that the true class label should not be ignored with a predicted probability pˆ = 0.425. The results below show region predictions output at 95% and 99% confidence levels by the Naive Bayes and DW13-NN learners for the same Abdominal Pain examples tested in Table 1 using the simple conversion technique (cf. 1). As before, the true label is marked with the symbol if it is contained in the region prediction. Conversion of DW13-NN forecasts into region predictions: Example # Region at 95% Confidence Region at 99% Confidence 1653 2490 5831
Cholis , Dyspep
Dyspep , Intest, Perf, Renal Appx, Non-spec
Cholis , Dyspep, Non-spec, Pancr
Dyspep , Intest, Perf, Renal, Pancr Appx, Non-spec , Dyspep
Conversion of Naive Bayes forecasts into region predictions: Example # Region at 95% Region at 99% Confidence Confidence 1653 2490 5831
Cholis
Intest, Pancr, Perf, Cholis Appx
Cholis
Intest, Pancr, Perf, Cholis, Div Appx, Non-spec
The DW13-NN learner’s region predictions are wider, with a total of 20 labels included in all region predictions as opposed to 14 for the Naive Bayes learner. However, despite
Effective Confidence Region Prediction Using Probability Forecasters
499
these narrow predictions, the Naive Bayes’s region prediction fails to capture the true label for example 2490, and only captures the true label for example 5831 at a 99% confidence level. In contrast, the DW13-NN’s region predictions successfully capture each true label even at a 95% confidence level. This highlights what we believe to be the practical benefit of generating well-calibrated region predictions - the ability to capture or dismiss labels with the guarantee of being correct is highly desirable. In our example, narrowing down the possible diagnosis of abdominal pain could save money and patient anxiety in terms of unnecessary tests to clarify the condition of the patient. It is important to stress that when working with probabilistic classification, classification accuracy should not be the only mechanism of assessment. Table 2 shows that the Naive Bayes learner actually has a slightly lower classification error rate of 29.31% compared with 30.86% of the DW13-NN learner. If other methods of learner assessment had been ignored, the Naive Bayes learner may have been used for prediction by the physician even though Naive Bayes has less reliable probability forecasts and non-calibrated region predictions. 2.5
The Confidence Region Calibration (CRC) Plot Visualisation
The Confidence Region Calibration (CRC) plot is a simple visualisation of the performance criteria detailed earlier (cf. 4). Continuing with our example, figure 1 shows CRC plots for the Naive Bayes and DW13-NN’s probability forecasts on the Abdominal pain dataset. The CRC plot displays all possible confidence levels on the horizontal axis, versus the total fraction of objects or class labels on the vertical axis. The dashed line represents the number of errors Err1−δ N , and the solid line represents the average region width Unc1−δ at each confidence level 1 − δ for all N test examples. Informally, N as we increase the confidence level, the errors should decrease and the width of region predictions increase. The lines drawn on the CRC plot give the reader at a glance useful information about the learner’s region predictions. If the error line never deviates above ≤ δ for all δ ∈ [0, 1]), then the the upper left-to-bottom right diagonal (i.e. Err1−δ N learner’s region predictions are well-calibrated. The CRC plot also provides a useful measure of the quality of the probability forecasts, we can check if regions are wellcalibrated by computing the area between the main diagonal 1 − δ and the error line 1−δ using simple trapezErr1−δ n , and the area under the average region width line Uncn ium and triangle rule estimation (see Table 2). These deviation areas (also seen beneath Figure 1) can enable the user to check the performance of region predictions without any need to check the corresponding CRC plot. Figure 1 shows that in contrast to the DW13-NN learner, the Naive Bayes learner is not well-calibrated as demonstrated by its large deviation above the left-to-right diagonal. At a 95% confidence level, the Naive Bayes learner makes 18% errors on its confidence region predictions and not ≤ 5% as is usually required. The second line corshows how useful the learner’s region responding to the average region width Unc1−δ n predictions are. The Naive Bayes CRC plot indicates that at 95% confidence, roughly 15% of the 9 possible disease labels (≈ 1.4) are predicted. In contrast, the DW13-NN learner is less certain at this confidence level, predicting 23% of labels (≈ 2.1). However, although the DW13-NN learner’s region predictions are slightly wider than those of Naive Bayes, the DW13-NN learner is far closer to being calibrated, which is our
500
D.G. Lindsay and S. Cox DW13-NN Fraction of Test Objects / Class Labels
Fraction of Test Objects / Class Labels
Naive Bayes 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0
0.1
0.2
0.3 0.4 0.5 0.6 0.7 Confidence Level 1 − δ
0.8
CRC Err Above: 0.0212 CRC Av. Region Width: 0.112
0.9
1
1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0
0.1
0.2
0.3 0.4 0.5 0.6 0.7 Confidence Level 1 − δ
0.8
0.9
1
CRC Err Above: 3.5E-4 CRC Av. Region Width: 0.114
Fig. 1. Confidence Region Calibration (CRC) plots of probability forecasts output by Naive Bayes and DW 13-NN learners on the Abdominal Pain dataset. The dashed line represents the fraction of errors made on the test sets region predictions at each confidence level. The solid line represents the average region width for each confidence level 1 − δ. If the dashed error line never deviates above the main left-right diagonal then the region predictions are well-calibrated. The deviation areas of both error and region width lines are given beneath each plot
first priority. Indeed narrowing down the true disease of a patient from 9 to 2 possible disease class labels, whilst guaranteeing less that 5% chance of errors in that region prediction, is a desirable goal. In addition, further information could be provided by referring back to the original probability forecasts to achieve a more direct analysis of the likely disease.
3
Experimental Results
We tested the following algorithms as implemented by the WEKA Data Mining System [9]: Bayes Net, Neural Network, Distance Weighted K-Nearest Neighbours (DWKNN), C4.5 Decision Tree (with Laplace smoothing), Naive Bayes and Pairwise Coupled + Logistic Regression Sequential Minimisation Optimisation (PC+LR SMO) a computationally efficient version of Support Vector Machines. We tested the following 15 multi-class datasets (number of examples, number of attributes, number of classes): Abdominal Pain (6387,136,9) from [5], and the other datasets from the UCI data repository [11]: Anneal (898,39,5), Glass (214,10,6), Hypothyroid (3772,30,4), Iris (150,5,3), Lymphography (148,19,4), Primary Tumour (339,18,21), Satellite Image (6435,37,6), Segment (2310,20,7), Soy Bean (683,36,19), Splice (3190,62,3), USPS Handwritten Digits (9792,256,10), Vehicle (846,19,4), Vowel (990,14,11) and Waveform (5000,41,3). Each experiment was carried out on 5 randomisations of a 66% Training / 34% Test split of the data. Data was normalised, and missing attributes replaced with mean and mode values for numeric and nominal attributes respectively. Deviation areas for the CRC plots were estimated using 100 equally divided intervals. The stan-
Effective Confidence Region Prediction Using Probability Forecasters
501
Table 2. Results of CRC Error Above / CRC Av. Region Width (assessing confidence region predictions), Classification Error Rate / Square Loss (assessing probability forecasts), computed from the probability forecasts of several learners on 15 multi-class datasets. Experiments that gave well-calibrated confidence region predictions (where CRC Error Above = 0) are highlighted in grey. Best assessment score of the well-calibrated learners (where available) for each dataset are emboldened CRC Plot Error Above / CRC Plot Av. Region Width Dataset / Learner Abdominal Pain Anneal Glass Hypothyroid Iris Lymphography Primary Tumour Satellite Image Segment Soy Bean Splice USPS Vehicle Vowel Waveform 5000
Naive Bayes 0.021 / 0.112 0.004 / 0.167 0.098 / 0.144 0.0 / 0.250 0.0 / 0.332 0.002 / 0.251 0.146 / 0.048 0.018 / 0.166 0.012 / 0.143 0.001 / 0.053 0.0 / 0.334 0.018 / 0.099 0.083 / 0.251 0.006 / 0.093 0.006 / 0.334
Bayes Net -/0.0 / 0.167 0.005 / 0.145 0.0 / 0.251 1.2E-4 / 0.335 2.8E-4 / 0.252 0.133 / 0.048 -/0.0 / 0.143 0.0 / 0.053 -/-/0.0 / 0.251 2.7E-4 / 0.094 0.0 / 0.334
Dataset / Learner Abdominal Pain Anneal Glass Hypothyroid Iris Lymphography Primary Tumour Satellite Image Segment Soy Bean Splice USPS Vehicle Vowel Waveform 5000
Naive Bayes 29.309 / 0.494 11.929 / 0.219 56.056 / 0.869 4.423 / 0.075 4.8 / 0.089 16.735 / 0.279 53.982 / 0.692 20.793 / 0.4 19.403 / 0.357 9.604 / 0.163 4.365 / 0.068 19.619 / 0.388 53.192 / 0.848 40.606 / 0.538 20.12 / 0.342
Bayes Net -/4.794 / 0.076 33.521 / 0.47 1.002 / 0.016 6.8 / 0.1 15.51 / 0.248 54.867 / 0.695 -/4.364 / 0.071 6.432 / 0.091 -/-/29.787 / 0.389 21.455 / 0.308 18.619 / 0.268
Neural Net 0.027 / 0.112 0.0 / 0.169 0.025 / 0.145 1.6E-4 / 0.252 1.6E-4 / 0.335 0.006 / 0.252 0.156 / 0.048 0.002 / 0.168 0.0 / 0.145 5.5E-4 / 0.056 1.2E-4 / 0.333 0.0 / 0.102 0.004 / 0.25142 0.001 / 0.093 0.006 / 0.332
C4.5 0.024 / 0.155 0.0 / 0.176 0.017 / 0.202 0.0 / 0.253 7.5E-4 / 0.339 0.004 / 0.284 0.219 / 0.098 0.0 / 0.189 0.0 / 0.154 1.7E-4 / 0.1055 0.0 / 0.344 0.0 / 0.121 0.002 / 0.274 0.002 / 0.153 0.004 / 0.348
PC+LR SMO DW K-NN 0.002 / 0.113 3.5E-4 / 0.114 0.0 / 0.168 0.0 / 0.169 0.02 / 0.144 0.004 / 0.146 0.0 / 0.251 0.0 / 0.252 0.0 / 0.334 0.0 / 0.335 0.002/ 0.252 0.0 / 0.251 0.153 / 0.047 0.24 / 0.048 0.0 / 0.169 0.0 / 0.169 0.0 / 0.145 0.0 / 0.146 2.9E-4 / 0.055 0.0 / 0.057 0.001 / 0.334 0.0 / 0.334 0.0 / 0.102 0.0 / 0.104 0.0 / 0.251 0.0 / 0.252 0.002 / 0.093 0.0 / 0.094 0.0 / 0.335 0.0 / 0.335
Error Rate / Square Loss Neural Net 30.484 / 0.528 1.672 / 0.026 37.747 / 0.58 5.521 / 0.095 5.6 / 0.089 20.0 / 0.346 59.646 / 0.905 10.839 / 0.19 3.688 / 0.061 7.489 / 0.125 4.774 / 0.079 3.055 / 0.054 20.213 / 0.319 9.152 / 0.155 16.627 / 0.297
C4.5 38.779 / 0.597 1.895 / 0.051 35.775 / 0.579 0.461 / 0.009 7.2 / 0.136 22.857 / 0.368 57.699 / 0.855 14.965 / 0.249 4.156 / 0.078 7.753 / 0.263 6.529 / 0.116 11.496 / 0.201 27.801 / 0.403 24.303 / 0.492 24.658 / 0.407
PC+LR SMO 26.886 / 0.388 2.676 / 0.044 40.282 / 0.537 2.975 / 0.046 4.8 / 0.068 22.041 / 0.317 58.23 / 0.772 14.382 / 0.194 7.273 / 0.107 8.37 / 0.134 7.267 / 0.133 4.679 / 0.077 26.312 / 0.363 34.909 / 0.452 13.974 / 0.199
DW K-NN 30.859 / 0.438 2.899 / 0.06 37.465 / 0.509 6.858 / 0.118 3.2 / 0.057 20.408 / 0.29 53.628 / 0.721 9.604 / 0.138 3.61 / 0.066 9.779 / 0.142 16.764 / 0.269 3.162 / 0.055 29.078 / 0.388 1.697 / 0.034 19.328 / 0.266
dard inner product was used in all SMO experiments. All experiments were carried out on a Pentium 4 1.7Ghz PC with 512Mb RAM. The values given in Table 2 are the average values computed from each of the 5 repeats of differing randomisations of the data. The standard deviation of these values was negligible of magnitude between E-8 and E-12. It is important to note that all results are given with the default parameter settings of each algorithm as specified by WEKA - there was no need for an exhaustive search of parameters to achieve well-calibrated region predictions. Results with the DWK-NN learners are given for the best value of K for each dataset, which are K = 13, 2, 9, 11, 5, 6, 12, 7, 2, 3, 6, 4, 9, 1, 10 from top to bottom as detailed in Table 2. These choices of K perhaps reflect some underlying noise level or complexity in the data, so a higher value of K corresponds to a noisier, more complex dataset. Results with Bayesian Belief Networks were unavailable due to mem-
502
D.G. Lindsay and S. Cox
ory constraints with the following datasets: Abdominal Pain, Satellite Image, Splice and USPS. Results which produced well-calibrated region predictions are highlighted in grey, where 38 out of 86 experiments (≈44%) were calibrated. Figure 1 illustrates that despite DW13-NN not being strictly well-calibrated on the Abdominal Pain dataset with a CRC error above calibration ≈3.5E-4, in practice this deviation appears negligible. Indeed, if we loosened our definition of well-calibratedness to small deviations of order ≈E-4 then the fraction of well-calibrated experiments would increase to 49 out of 86 (≈57%). All datasets, except Abdominal Pain, Glass and Primary Tumor were found to have at least one calibrated learner, and all learners were found to be wellcalibrated for at least one dataset. Some datasets, such as Anneal and Segment, were found to have a choice of 5 calibrated learners. In this instance it is important to consult other performance criteria (i.e. region width or classification error rate) to assess the quality of the probability forecasts and decide which learner’s predictions to use. Table 2 shows that the Naive Bayes and Bayes Net confidence region predictions are far narrower (≈ −4%) than the other learners, however this may explain the Naive Bayes learner’s poor calibration performance across data. In contrast the C4.5 Decision Tree learner has the widest of all region predictions (≈ +13%). The Bayes Net performs best in terms of square loss, whilst the Naive Bayes performs the worst. This reiterates conclusions found in other studies [3], in addition to results in Section 2.4 where the Naive Bayes learner’s probability forecasts were found to be unreliable. Interestingly, the K-NN learners give consistently well-calibrated region predictions across datasets - perhaps this behaviour can be attributed to the K-NN learners’ guarantee to converge to the Bayes optimal probability distribution [10].
4
Discussion
We believe that using both probability forecasts and region predictions can give the end-users of such learning systems a wealth of information to make effective decisions in cost-sensitive problem domains, such as in medical diagnostics. Medical data often contains a lot of noise and so high classification accuracy is not always possible (as seen here with Abdominal Pain data, see Table 2). In this instance the use of probability forecasts and region predictions can enable the physician to filter out improbable diagnoses, and so choose the best course of action for each patient with safe knowledge of their likelihood of error. In this study we have presented a simple methodology to convert probability forecasts into confidence region predictions. We were surprised at how successful standard learning techniques were at solving this task. It is interesting to consider why it was not possible to obtain exactly calibrated results with the Abdominal Pain, Glass and Primary Tumor datasets. Further research is being conducted to see if meta-learners such as Bagging [12] and Boosting [13] could be used to improve on these initial results and obtain better calibration for these problem datasets. As previously stated, the Transductive Confidence Machine (TCM) framework has much empirical and theoretical evidence for strong region calibration properties, however these properties are derived from the p-values which can be misleading if read on their own, as they do not have any direct probabilistic interpretation. Indeed, our results have shown that unreliable probability forecasts can be misleading enough, but with
Effective Confidence Region Prediction Using Probability Forecasters
503
p-values this confusion is compounded. Our current research is to see if the exposition of this study could be done in reverse; namely a method could be developed to convert p-values into effective probability forecasts. Another interesting extension to this research would be to investigate the problem of multi-label prediction, where class labels are not mutually exclusive [14]. Possibly an adaptation of the confidence region prediction method could be applied for this problem domain. Much of the current research in multi-label prediction is concentrated in document and image classification, however we believe that multi-label prediction in medical diagnostics is an equally important avenue of research, and particularly relevant in instances where a patient may suffer more than one illness at once.
Acknowledgements We would like to acknowledge Alex Gammerman and Volodya Vovk for their helpful suggestions, and in reviewing our work. This work was supported by EPSRC (grants GR46670, GR/P01311/01), BBSRC (grant B111/B1014428), and MRC (grant S505/65).
References 1. Vovk, V.: On-line Confidence Machines are well-calibrated. In: Proc. of the Forty Third Annual Symposium on Foundations of Computer Science, IEEE Computer Society (2002) 2. Zadrozny, B., Elkan, C.: Transforming Classifier Scores into Accurate Multiclass Probability Estimates. In: Proc. of the 8th ACM SIGKDD, ACM Press (2002) 694–699 3. Lindsay, D., Cox, S.: Improving the Reliability of Decision Tree and Naive Bayes Learners. In: Proc. of the 4th ICDM, IEEE (2004) 459–462 4. Berner, E.S., Wesbter, G.D., Shugerman, A.A., Jackson, J.R., Algina, J., Baker, A.L., Ball, E.V., Cobbs, C.G., Dennis, V.W., Frenkel, E.P., Hudson, L.D., Mancall, E.L., Rackley, C.E., Taunton, O.D.: Performance of four computer-based diagnostic systems. The New England Journal of Medicine 330 (1994) 1792–1796 5. Gammerman, A., Thatcher, A.R.: Bayesian diagnostic probabilities without assuming independence of symptoms. Yearbook of Medical Informatics (1992) 323–330 6. Berger, J.O., Delampady, M.: Testing precise hypotheses. Statistical Science 2 (1987) 317– 335 David R. Cox’s comment: pp. 335–336. 7. Dawid, A.P.: Calibration-based empirical probability (with discussion). Annals of Statistics 13 (1985) 1251–1285 8. Murphy, A.H.: A New Vector Partition of the Probability Score. Journal of Applied Meteorology 12 (1973) 595–600 9. Witten, I., Frank, E.: Data Mining - Practical Machine Learning Tools and Techniques with Java Implementations. Morgan Kaufmann, San Francisco (2000) 10. Devroye, L., Gy¨orfi, L., Lugosi, G.: A Probabilistic Theory of Pattern Recognition. Springer, New York (1996) 11. Blake, C., Merz, C.: UCI repository of machine learning databases (1998) 12. Breiman, L.: Bagging predictors. Machine Learning 24 (1996) 123–140 13. Freund, Y., Schapire, R.E.: Game theory, on-line prediction and boosting. In: Proc. 9th Ann. Conf. on Computational Learning Theory, Assoc. of Comp. Machinery (1996) 325–332 14. Boutell, M.R., Luo, J., Shen, X., Brown, C.M.: Learning multi-label scene classification. Pattern Recognition 37 (2004) 1757–1771
Signature Recognition Methods for Identifying Influenza Sequences Jitimon Keinduangjun1, Punpiti Piamsa-nga1, and Yong Poovorawan2 1
Department of Computer Engineering, Faculty of Engineering, Kasetsart University, Bangkok, 10900, Thailand {jitimon.k, punpiti.p}@ku.ac.th 2 Department of Pediatrics, Faculty of Medicine, Chulalongkorn University, Bangkok, 10400, Thailand [email protected]
Abstract. Basically, one of the most important issues for identifying biological sequences is accuracy; however, since the exponential growth and excessive diversity of biological data, the requirement to compute within considerably appropriate time usually compromises with accuracy. We propose novel approaches for accurately identifying DNA sequences in shorter time by discovering sequence patterns -- signatures, which are enough distinctive information for the sequence identification. The approaches are to find the best combination of n-gram patterns and six statistical scoring algorithms, which are regularly used in the research of Information Retrieval, and then employ the signatures to create a similarity scoring model for identifying the DNA. We generate two approaches to discover the signatures. For the first one, we use only statistical information extracted directly from the sequences to discover the signatures. For the second one, we use prior knowledge of the DNA in the signature discovery process. From our experiments on influenza virus, we found that: 1) our technique can identify the influenza virus at the accuracy of up to 99.69% when 11-gram is used and the prior knowledge is applied; 2) the use of too short or too long signatures produces lower efficiency; and 3) most scoring algorithms are good for identification except the “Rocchio algorithm” where its results are approximately 9% lower than the others. Moreover, this technique can be applied for identifying other organisms.
1 Introduction The rapid growth of genomic and sequencing technologies from past decades has generated an incredibly large size of diverse genome data, such as DNA and protein sequences. However, the biological sequences mostly contain very little known meaning. Therefore, the knowledge is needed to be discovered. Techniques for knowledge acquisition from biological sequences become more important for transforming unknowledgeable, diverse and huge data into useful, concise and compact information. These techniques generally consume long computation time; their accuracy usually depends on data size; and there is no best solution known for S. Miksch et al. (Eds.): AIME 2005, LNAI 3581, pp. 504 – 513, 2005. © Springer-Verlag Berlin Heidelberg 2005
Signature Recognition Methods for Identifying Influenza Sequences
505
any particular circumstances. Many research projects on biological sequence processing still share some stages for their experiments in common. One of important research topics in computational biology is sequence identification. Ideally, the success of research in sequence identification depends on how we can find some short informative data (signatures) that can efficiently identify types of the sequences. The signature recognition is useful to reduce the computation time since data are more compact and more precise [4]. The existing techniques of biological identification require long computation time as categorized into three groups, namely Inductive Learning: this technique takes learning tools, such as Neural Network and Bayesians, for deriving rules that are used to identify biological sequences. For instance, gene identification uses a neural network [14] for deriving a rule that identifies whether a sequence is a gene or not. The identification generally uses small contents, such as frequencies of two or three bases, instead of plain sequences as inputs of the learning process. Although the use of the rule derived from the learning process attains low processing time, the procedure used as a pre-process of the systems still consumes too long computation time to succeed in all target concepts. Sequence Alignment: this technique, such as BLAST [7] and FASTA [10], aligns uncharacterized sequences with the existing sequences in genome database and then assigns the uncharacterized sequences to the same class with one of the sequences in database that gets the best alignment score. The processing time of this alignment technique is much higher than other techniques since the process has to perform directly on the sequences in database, whose sizes are usually huge. Consensus Discovery: this technique takes multiple alignments [13] as a pre-process of systems for finding a “consensus” sequence, which is used to derive a rule. Then, the rule is used to identify sequences in uncharacterized sequences. The consensus sequence is nearly a signature, but it is created by taking the majority bases in the multiple alignments of a sequence collection. Because of too high computation of the multiple alignments, the consensus sequence is not widely used in any tasks. Our approaches construct the identifiers on the same way as the pre-processing in the inductive learning and the multiple alignments. However, the proposed approaches have much less pre-processing time than the learning technique and the multiple alignment technique. Firstly, we apply the n-gram method to transform DNA sequences into a pattern set. Then, signatures are discovered from the pattern set using statistical scoring algorithms and used to create identifiers. Finally, the performance of identifiers is estimated by applying on influenza virus sequences. Our identification framework is described in Section 2. Section 3 describes a pattern generation process of DNA sequences. Section 4 has details on scoring algorithms to discover the signatures. Section 5 describes how to construct the identifiers from the discovered signatures. Section 6 has details on estimating the performance of identifiers. Finally, we summarize our techniques in Section 7.
2 Identification Framework The identification is a challenge of Bioinformatics since the biological sequences are zero-knowledge based data, unlike the text data of human language. Biological data
506
J. Keinduangjun, P. Piamsa-nga, and Y. Poovorawan
are totally different from the text data in a case that the text data basically use words as its data representation; while we do not know any “words” of the biological data. Therefore, a process for generating the data representation is needed. Our common identification framework of biological sequences is depicted in Fig. 1: I: Pre-process is to transform the training data into a pattern set as the data representation of biological sequences. In this research, we create the training data from influenza and other virus sequences and use them to generate n-gram patterns. II: Construction is a process of creating identifiers using the n-gram patterns and scoring algorithms. The model construction is divided into two consecutive parts: Signature Discovery and Identifier Construction. The Signature Discovery is to measure the significance of each pattern, called “Scores” using variously different scoring functions and select a set of the k highestscore patterns as our “Signatures”. We then pass the signatures through the Identifier Construction which uses the signatures for formulating a sequence similarity scoring function. If there is a query sequence, the similarity scoring function is to detect whether it is the target (influenza, for our experiments) or not. III: Evaluation is to estimate the performance of identifiers using a set of test data. The performance estimator compares an accuracy of all identifiers to find the best one for identifying unseen data, instances that are not members of training and test sets. IV: Usage uses to identify the unseen data by predicting a value to an actual data and identifying a new object. I: Pre-process Training Data
Pattern Generation Pattern Set
III: Evaluation
II: Construction Scoring Algorithms
Signature Discovery
Test Data
Identifier Construction Signatures
Identifiers
Performance Evaluation
IV: Usage Unseen Data
Identification
DNA Type
Identifier
Fig. 1. Common identification framework
3 Pattern Generation Finding a data representation of DNA sequences is performed in a pre-process of identification using training data. The data representation, called a pattern, is a substring of the DNA sequences. DNA sequences include four symbols {A, C, G, T}; therefore, members of the patterns are also the four symbols. Let Y be a sequence y1y2…yM of length M over the four symbols of DNA. A substring t1 t2…tn of length n is called an n-gram pattern [3]. The n-gram methodology generates the patterns for representing the DNA sequences using different n-values. There are M-n+1 patterns in a sequence of length M for generating the n-gram patterns, but there are only 4n possible patterns for any n values. For example, let Y be a 10-base-long sequence ATCGATCGAT. Then, the number of 6-gram patterns generated from the sequence of length 10 is 5 (10-6+1):
Signature Recognition Methods for Identifying Influenza Sequences
507
ATCGAT, TCGATC, CGATCG, GATCGA, ATCGAT. There are 4,096 (46) possible patterns; however, only 4 patterns occur in this example: ATCGAT (2), TCGATC (1), CGATCG (1) and GATCGA (1). From the four patterns, one pattern occurs two times. Our experiments found that the use of too long patterns (high n-values) will produce poor identifiers. Notice that 3-gram patterns have 64 (43) possible patterns; while 24-gram patterns have 2.8×1014 (424) possible patterns. The numbers of possible patterns obviously vary from 64 to 2.8×1014 (424) patterns. Too high n-values may not be necessary because each generated pattern does not occur repeatedly and scoring algorithms cannot discover signatures from the set of n-gram patterns. However, in our experiments, the pattern sets are solely generated from 3- to 24-grams since the higher n-values do not yield the improvement of performance. Discovering the signatures from these n-gram patterns uses the scoring algorithms as discussed next.
4 Signature Discovery The signature discovery is a process of finding signatures from pattern sets, generated by the n-gram method. We propose statistical scoring algorithms for evaluating significant of each the n-gram pattern to select the best one to be our signatures. The DNA signature discovery uses two different datasets, target and non-target DNA sequences. Section 4.1 has details on the two datasets. The scoring algorithms are performed in Section 4.2. Section 4.3 describes the signature discovery process. 4.1 Datasets Our datasets are selected from the Genbank genome database of National Center for Biotechnology Information (NCBI) [2]. In this research, we use influenza virus as the target dataset of identification and other viruses, non-influenza virus, as the non-target dataset. The structure of influenza virus is distinctly separated into eight segments. Each segment has different genes. The approximate length of complete sequences of each gene is between 800 and 2,400 bases. We randomly picked 120 complete sequences of each gene of the influenza virus and then insert into the target dataset. The total number of sequences of the target dataset which gathers the eight genes is 960 (120×8) sequences. For the non-target dataset, we randomly picked 960 sequences from other virus sequences, which are not the influenza. In general, each dataset is divided into two sets, training and test sets. One-third of the dataset (320 sequences) is used for the test set and the rest for the training set (640 sequences). The training set is used for discovering signatures and the test set is used for validation and performance evaluation of discovered signatures. 4.2 Scoring Algorithms The scoring algorithms are well-known metrics in the research of Information Retrieval [12]. We evaluated the statistical efficiency of six scoring algorithms in their discovery of signatures for identifying biological sequences. The six statistical scoring algorithms are widely categorized into two groups as follows:
508
J. Keinduangjun, P. Piamsa-nga, and Y. Poovorawan
Group I: Based on common patterns, such as Term Frequency (TF) [1] and Rocchio (TF-IDF) [5]. Group II: Based on class distribution, such as DIA association factor (Z) [11], Mutual Information (MI) [8], Cross Entropy (CE) [8] and Information Gain (IG) [11]. The scoring algorithms based on common patterns (Group I) are the simplest techniques for the pattern score computation. They independently compute scores of patterns from frequencies of each pattern; whereas the scoring algorithms based on class distribution (Group II) compute the score of each pattern from a frequency of co-occurrence of a pattern and a class. Each class of identification comprises sequences which are unified to form the same type. Most scoring algorithms based on class distribution consider only the presences (p) of patterns in a class; except the “Information Gain” that considers both the presences and absences ( p ) of patterns in class [9]. Like the scoring algorithms based on common patterns, they independently compute the score of each pattern. The scoring measures are formulated as follows: TF ( p) = Freq( p).
(1) |D|
TF − IDF ( p) = TF ( p) ⋅ log ( DF ( p) ).
(2)
Z(p) = P (Ci | p ).
(3)
MI ( p ) = ∑i P (Ci ) log
P ( p|C i ) . P( p)
CE ( p ) = P( p )∑i P(Ci | p ) log IG ( p ) = P( p )∑i P (Ci | p ) log
P(C i| p ) P(C i )
(4)
P(C i| p ) . P(C i )
+ P ( p )∑i P (Ci | p ) log
(5) P (C i| p ) . P (C i )
(6)
Where Freq(p) is the Term Frequency (the number of times that pattern p occurred); IDF is the Invert Document Frequency; DF ( p ) is the Document Frequency (the number of sequences in which pattern p occurs at least once); D is the total number of sequences; P( p) is the probability that pattern p occurred; p means that pattern p does not occur; P(C i) is the probability of the ith class value; th P(C i | p) is the conditional probability of the i class value given that pattern p occurred; and P( p | C i) is the conditional probability of pattern occurrence given the ith class value. 4.3 Signature Discovery Process The signature discovery process is used to find DNA signatures by measuring the significance (score) of each n-gram pattern of DNA sequences, sorting patterns as the scores and selecting the first k highest-score patterns to be the “Signatures”. We use heuristics to select the optimal number of the signatures. Following our previous experiments [6], the identifiers get the highest efficiency when the number of signatures is ten. In this research, we discover the signatures using two different discovery methods, namely Discovery I and Discovery II.
Signature Recognition Methods for Identifying Influenza Sequences
509
Discovery I: we divided training data into two classes, influenza virus and noninfluenza virus. The scores of each pattern of influenza sequences are computed using the statistical scoring algorithms. The first k highest-score patterns are selected to be the signatures. For instance, let Pattern P represent a set of m members of n-gram patterns p1, p2 … pm, generated by the influenza virus sequences. The score of each pattern pl, denoted score(pl), is the significance of each pattern evaluated by a pattern scoring measure. Let Signature S be a set of signatures s1, s2 … sk, where each signature sj is selected from the first k highest-score patterns. The Pattern P and the Signature S are defined by the following equations: P = {∀ l pl | score(pl −1 ) ≤ score(pl ) ∧ l = 1,2,..., m}.
(7)
S = {∀ j s j | s j = p m- k + j ∧ 1 ≤ j ≤ k}.
(8)
Discovery II: we employed some prior knowledge or characteristics of the target dataset for the signature discovery. In this research, we use the characteristics of distinctly separating into eight segments of influenza virus to discover the signatures. The training data of both influenza and non-influenza virus are divided into eight subclasses. Each sub-class of influenza virus has different types of segments. Each pair of the influenza and non-influenza sub-class generates a signature subset. The eight signature subsets are gathered into the same set as all signatures of influenza virus. For instance, there are eight sub-classes of both influenza and non-influenza datasets for generating eight subsets of the separate signatures. Let Pattern Pi represent a set of m members of n-gram patterns p1, p2,…, pm, generated by a sub-class i of influenza sequences. The score of each pattern pl, score(pl), is the significance of each pattern, evaluated by a scoring algorithm. Let Signature Si be a subset i of signatures s1, s2 … sk, where each signature sj is selected from the first k highest-score patterns. Each the subset of Pattern Pi and Signature Si is defined by Pi = {∀ l p l | score(p l −1 ) ≤ score(pl ) ∧ l = 1,2,..., m}.
(9)
S i = {∀ j s j | s j = p m- k + j ∧ 1 ≤ j ≤ k}.
(10)
DNA signatures of the influenza virus with respect to the eight signature subsets, denoted Signature S, are defined to be S = {S1 , S 2 , S 3 , S 4 , S 5 , S 6 , S 7 , S 8 }.
(11)
5 Identifier Construction The identifier construction process uses the signatures to formulate a similarity scoring function. When there is a query sequence, the scoring function is used to measure a similarity value between the query sequence and target sequences, called SimScore. If the SimScore is higher than a threshold, the query is identified as a member of the target sequences. We use heuristics to select the optimal threshold. In our experiments, we found that the identifiers get the highest efficiency when the threshold is zero. Assume that we pass a query sequence through an identification system; the system is performed according to the four following steps:
510
J. Keinduangjun, P. Piamsa-nga, and Y. Poovorawan
Step I: Set the SimScore to zero. Step II: Generate patterns of the query sequence according to the n-gram method using the same n-value as the one used in the signature generation process. Step III: Match between patterns of the query sequence and signatures of the target sequences. If one match occurs, the SimScore is added one point. Step IV: Identify the query sequence to the same type of target sequences, if the SimScore is more than a threshold. For example, let X be a query sequence with cardinality m. Let p x1 , p x 2 ... p x d be a set of d (m-n+1) n-gram patterns generated by the query sequence. Let Y be a type of target sequences and s y1 , s y 2 ...s y e is a set of its e signatures. Then, sim(px,sy) is a similarity score of a pattern px and a signature sy, where sim(px,sy) is 1, if px is similar to sy, and sim(px,sy) is 0, if not similar. The similarity score between a query sequence X and a type of target sequences Y, denoted SimScore(X,Y), is a summation of similarity scores of every pattern px and every signature sy as SimScore( X , Y ) =
∑
1≤i ≤d ,1≤ j ≤e
sim( p xi , s y j ).
(12)
If SimScore(X,Y) > threshold, the query sequence X is identified to the same type of target sequences Y.
6 Performance Evaluation Evaluation process is to assess the performance of identifiers over the set of test data. We use Accuracy [12] as a performance estimator for evaluating common strategies. Accuracy is the ratio of the summation of number of target sequences correctly identified and non-target sequences correctly identified to the total number of test sequences. Accuracy is usually expressed as a percentage formulation: Accuracy =
TP + TN × 100%. (TP + FP + TN + FN)
(13)
Where True Positive (TP) is a number of target sequences correctly identified; False Positive (FP) is a number of target sequences incorrectly identified; True Negative (TN) is a number of non-target sequences correctly identified; and False Negative (FN) is a number of non-target sequences incorrectly identified. In our experiments, we evaluate the performance of two identifiers constructed by the two methods of signature discovery as described earlier. Identifier I is constructed using signatures generated by Discovery I and Identifier II by Discovery II. First, we evaluate the performance of Identifier I as the use of different n-grams, where n is ranged from 3 to 24, and six different scoring algorithms. Each pair of an n-gram and a scoring algorithm generates different signatures for the identifier construction. The results of Identifier I are depicted in Fig. 2. The figure shows that most identifiers have low efficiency, accuracy below 65%. Only few identifiers are successful with accuracy over 80%, i.e., the identifiers are constructed using 7- to 9-gram signatures and almost all scoring algorithms, except the Rocchio algorithm (TF-IDF). The accuracy of the use of Rocchio algorithm is approximately 9% lower than the others.
Signature Recognition Methods for Identifying Influenza Sequences
511
Identifier I A c c ura c y 10 0 TF
85
TF-IDF
70
CE
55
IG
Z MI
Gra ms
40 3
4
5
6
7
8
9
10
11
12
13
14
15 16
17
18
19 2 0
21 22 23 24
Fig. 2. Accuracy of Identifier I using different n-grams and different scoring algorithms
In Fig. 3, we also compare the performance of Identifier II as the use of different ngrams and scoring algorithms. Like the experimental results of Identifier I, at the too low and too high n-grams, the performance of Identifier II is the lowest, accuracy below 65%. The use of Rocchio algorithm also produces 9% lower accuracy than the other algorithms. In contrast to the Identifier I, most experimental results are accomplished with 95% accuracy in average. Moreover, the peak accuracy of Identifier II provides up to 99.69% at 11-gram in several scoring algorithms. Identifier II A c c ura c y 10 0 TF TF-IDF
85
Z CE
70
MI IG
55 40
G ra ms 3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19 2 0
21 22 23 24
Fig. 3. Accuracy of Identifier II using different n-grams and different scoring algorithms
Following all the experimental results, the models based on class distribution, such as Z, CE, MI and IG, achieve good results; while the models based on common patterns, such as TF and TF-IDF, have only TF which is achievable. The high efficiency of TF indicates that the high-frequency patterns selected to be the signatures are seen in target classes or the influenza dataset only, not seen in the noninfluenza dataset. For the TF-IDF formula: TF ( p) ⋅ log ( DF|D(|p) ) , although the frequency of pattern,
is high, the score of pattern computed by the TF-IDF may be reduced if is low. The formula log ( DF|D(|p) ) is low, when the number of sequences in
TF ( p) ,
|D| log ( ) DF ( p)
which pattern occurs at least once is high. As discussed above, the high-frequency patterns selected to be signatures are seen in sequences of target classes only; therefore, a large number of the sequences in which patterns occur at least once should also produce high scores. Nevertheless, the Rocchio algorithm reduces scores of these patterns. The high-score patterns, evaluated by the Rocchio, are unlikely to be signatures which are good enough for the DNA sequence identification. This is why the use of Rocchio gets less performance than the use of the other algorithms.
512
J. Keinduangjun, P. Piamsa-nga, and Y. Poovorawan
Finally, we discuss on the four values of Accuracy: TP, FP, TN and FN. Table 1 shows that, between 7- and 21-grams, the accuracy of Identifier I and II depends on two values only, TP and FP, because the other values are quite stable; TN is maximal and FN is zero. Notice that the identification of non-target sequences is successful since no non-target sequences are identified as the influenza virus. Identifier I produces 70% accuracy in average; whereas Identifier II is higher with 90% accuracy in average. Next, we found that the performance of both identifiers is very low, when the identifiers are constructed using too short or too long n-gram signatures, shorter than 7 and longer than 21. For shorter than 7-gram, both identifiers get accuracy below 65%. Both TP and FN are nearly maximal; while FP and TN are nearly zero. The shorter signatures of influenza virus cannot identify any sequences because they are found in almost all sequences. Almost all query sequences get higher score than the threshold; therefore they are identified as influenza virus. This is why both TP and FN are nearly maximal. For longer than 21-gram, both identifiers also get accuracy below 65%. The longer signatures of influenza virus cannot be found in almost all sequences, including influenza sequences. Most query sequences get lower score than the threshold and they are incorrectly identified. Therefore, TP and FN are nearly zero, and FP and TN are maximal. Table 1. Comparing Accuracy, True Positive (TP), False Positive (FP), True Negative (TN) and False Negative (FN) between Identifier I and II using different n-grams Identifiers
Grams
Accuracy
TP
FP
TN
FN
I II I & II I & II
7-21 7-21 <7 >21
70% avg. 90% avg. <65% <65%
~max ~0
~0 ~max
max max ~0 ~max
0 0 ~max ~0
7 Conclusion The acquisition of an intelligent and low computation method is one of the most significant tasks of identification since the existing research projects apply a high computation method for accomplishing in the target concept of identification, such as projects based on alignments. We propose novel approaches for accurately identifying DNA sequences in shorter time. The approaches apply an n-gram method and statistical scoring algorithms to discover one of intelligent data called “Signatures” as inputs of the identification system for reducing the high computation. The experimental results showed that identifiers constructed using almost all scoring algorithms achieve good results, except the use of Rocchio algorithm which produces 9% less accuracy than the others. The use of either too short or too long n-gram signatures yields poor efficiency; while other n-gram signatures yield better efficiency. Additionally, the peak accuracy provides up to 99.69%. In conclusion, one of our approaches, based on the n-gram method, the statistical scoring algorithms and the knowledge addition of the target DNA, succeeds in the identifier construction.
Signature Recognition Methods for Identifying Influenza Sequences
513
Acknowledgements We would like to thank the precious support provided by Kasetsart University Research and Development Institute (KURDI). We also thank Thailand Research Fund (Senior Research Scholar (Y.P.)) and Center of Excellence, Viral Hepatitis Research Unit, Chulalongkorn University. Finally, we thank Venerable Dr. Mettanando Bhikkhu for reviewing the manuscript.
References 1. Aalbersberg, I.: A Document Retrieval Model Based on Term Frequency Ranks. Proceedings of the 7th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (1994) 163-172 2. Benson, D.A., Karsch-Mizrachi, I., Lipman, D.J., Ostell, J., Rapp, B.A., Wheeler, D.L.: GenBank. Nucleic Acids Research 28(1) (2000) 15-18 3. Brown, P.F., de Souza, P.V., Della Pietra, V.J., Mercer, R.L.: Class-Based N-Gram Models of Natural Language. Computational Linguistics 18(4) (1992) 467-479 4. Chuzhanova, N.A., Jones, A.J., Margetts, S.: Feature Selection for Genetic Sequence Classification. Bioinformatics Journal 14(2) (1998) 139-143 5. Joachims, T.: A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization. Proceedings of the 14th International Conference on Machine Learning (1997) 143-151 6. Keinduangjun, J., Piamsa-nga, P., Poovorawan, Y.: Models for Discovering Signatures in DNA Sequences. Proceedings of the 3rd IASTED International Conference on Biomedical Engineering, Innsbruck, Austria (2005) 548-553 7. Krauthammer, M., Rzhetsky, A., Morozov, P., Friedman, C.: Using BLAST for Identifying Gene and Protein Names in Journal Articles. Gene 259(1-2) (2000) 245-252 8. Mladenic, D., Grobelnik, M.: Feature Selection for Classification Based on Text Hierarchy. In Working Notes of Learning from Text and the Web. Conference on Automated Learning and Discovery, Carnegie Mellon University, Pittsburgh (1998) 9. Mladenic, D., Grobelnik, M.: Feature Selection for Unbalanced Class Distribution and Naïve Bayes. Proceedings of the 16th International Conference on Machine Learning (1999) 258-267 10. Pearson, W.R.: Using the FASTA Program to Search Protein and DNA Sequence Databases. Methods Molecular Biology 25 (1994) 365-389 11. Sebastiani, F.: Machine Learning in Automated Text Categorization. ACM Computing Surveys 34(1) (2002) 1-47 12. Spitters, M.: Comparing Feature Sets for Learning Text Categorization. Proceedings on RIAO. (2000) 13. Wang, J.T.L., Rozen, S., Shapiro, B.A., Shasha, D., Wang, Z., Yin, M.: New Techniques for DNA Sequence Classification. Journal of Computational Biology 6(2) (1999) 209-218 14. Xu, Y., Mural, R., Einstein, J., Shah, M., Uberbacher, E.: Grail: A Multiagent Neural Network System for Gene Identification. Proceedings of IEEE 84(10) (1996) 1544-1552
Conquering the Curse of Dimensionality in Gene Expression Cancer Diagnosis: Tough Problem, Simple Models Minca Mramor1, Gregor Leban1 , Janez Demˇsar1 , and Blaˇz Zupan1,2 1
Faculty of Computer and Information Science, University of Ljubljana, Trˇzaˇska 25, Ljubljana, Slovenia 2 Department of Molecular and Human Genetics, Baylor College of Medicine, 1 Baylor Plaza, Houston, TX 77030, U.S.A.
Abstract. In the paper we study the properties of cancer gene expression data sets from the perspective of classification and tumor diagnosis. Our findings and case studies are based on several recently published data sets. We find that these data sets typically include a subset of about 100 highly discriminating features of which predictive power can be further enhanced by exploring their interactions. This finding speaks against often used univariate feature selection methods, and may explain the superior performance of support vector machines recently reported in the related work. We argue that a much simpler technique that directly finds visualizations with clear separation of diagnostic classes may be used instead. Furthermore, it may perform better in inference of an understandable classifier that includes only a few relevant features.
1
Introduction
Carcinogenesis is a multi step process in which genetic alterations drive the progressive transformation of normal human cells into malignant derivates. Gene expression microarrays can be used to identify specific genes that are differentially expressed across different tumor types. Classification of clinically heterogeneous cancers using gene expression profiles is an emerging application of microarray technology. Several recent studies of different cancer types, including acute leukemias [1], lymphomas [2], and brain tumors [3] have already demonstrated the utility of gene expression profiles for cancer classification, and reported on the superior classification performance when compared to standard morphological criteria. More accurate classifications based on molecular phenotype can improve and individualize treatment, influence the development of targeted therapeutics, and help in the identification of biomarkers for diagnosis and prognosis. From the viewpoint of inference of classification models, gene expression data sets are rather peculiar. They most often include thousands of attributes (genes) and a small number of examples (patients). Another problem is a substantial S. Miksch et al. (Eds.): AIME 2005, LNAI 3581, pp. 514–523, 2005. c Springer-Verlag Berlin Heidelberg 2005
Conquering the Curse of Dimensionality in GE Cancer Diagnosis
515
component of noise, resulting from numerous sources of variation affecting expression measurements. In conquering the curse of dimensionality, the prevailing modelling approaches include gene filtering in the data preprocessing phase, and gene subset selection often coupled with a modelling technique. For instance, in the work reported by Khan et al. [4] to classify four types of tumors in childhood, authors first ignored genes with low expression values throughout the data set, then trained 3750 feed-forward neural networks on different subsets of genes as determined by principal component analysis, analyzed the resulting networks for most informative genes and in this way obtained a subset of 96 genes of which expression clearly separated different cancer types when using multi-dimensional scaling. Other approaches, often similar in complexity of data analysis procedure, include k-nearest neighbors with weighted voting of informative genes [1] and support vector machines (SVM, [5, 6, 7]). In most cases, the resulting prediction models include complex computation over a set of gene expressions (e.g., neural networks, SVM, or principal components models) which are hard to interpret and can not be communicated to the domain experts in a way that would easily reveal the role genes play in separating different cancer types. While different approaches have been used for selecting marker genes [5, 1, 4, 8] (for a review see [9]), the prevalent approach is based on univariate studies which examine the value of a gene in absence of context, that is, disregarding the expressions of other observed genes. Such an approach, most often used in practice, is for instance signal-to-noise statistics [1]. Since genes interact, one would expect that more information can be gained by observing a set of genes as a group, rather than summing their individual effects. This is confirmed experimentally through a success of non-linear modelling methods that can account for gene interactions [5]. In this respect, the use of univariate scoring for gene subset selection is questionable. In the paper, we provide a simple alternative to rather complex gene expression diagnostic modelling approaches mentioned above. Namely, we show that a subset of two to five genes can most often provide sufficient information to clearly separate the diagnostic classes when their expression data is visualized either in scatterplot (two genes) or radviz (three genes or more) planar geometric graphs. The paper first introduces gene scoring, visualization, and visual projection search methods we apply in our studies. Our experimental study, together with the data sets used, is described next. The paper finishes with the discussion of results and concluding remarks.
2 2.1
Methods Gene Ranking
In the paper, we use two in essence very different methods for gene ranking. Signal-to-noise (S2N) [1] is a univariate statistics for scoring of attributes that is 1 derived from the standard parametric t-test statistic and is computed as µσ00 −µ +σ1 , where µ and σ represent the mean and standard deviation of gene’s expression,
516
M. Mramor et al.
respectively for each class. To score genes in multi-class problems, we have taken the data for each pair of class values, computed the statistics, and then averaged it across all possible class value pairs. ReliefF [10, 11] is a feature scoring function that is, in principle, sensitive to feature interactions by being able to detect features that may not provide much information on their own, but could be very useful when used together with some other features. ReliefF scores features according to how well their values distinguish among instances that are similar to each other. Since the similarity is computed based on all features in the data set, they define the context for the feature’s score thus providing grounds for revealing the interactions. 2.2
Two-Dimensional Geometric Visualization Methods
In the paper we propose to visualize gene expression data by plotting the examples from the data sets in a two-dimensional graphs. Depending on how many genes we use in the plot, we draw either a scatterplot (two genes) or a radviz (three or more genes). By selecting a suitable set of genes for the plots (see next section), we aspire that either of the two visualization methods would provide for a clear visual separation of diagnostic classes. For a scatterplot, an example of such a graph is given in Figure 1.a. Figure 1.b shows a radviz with five genes, represented as anchors that are equally spaced around the perimeter of a unit circle. The examples are visualized as points inside the unit circle, where their position depends on gene expression value: the higher the value for a gene, the more the anchor attracts the corresponding point. Finding an attraction equilibria for a visualized set of genes determines the placement of each of examples in the data set [12]. Examples with approximately equal expressions of genes that lie on the opposite sides of the circle will lie close to the center of the circle. On the other hand, if the expression of a single gene in a visualized set prevails, the point will lie close to the corresponding anchor. The radviz projection is defined with the gene subset being visualized, and with the placement of gene anchors. While enabling the visualization of several genes in a single graph, radviz has some deficiencies. For instance, placing two highly correlated genes that are good at discriminating between classes on the opposite side of the circle will make them useless in the visualization, since there joint effect will be cancelled out. On the other hand, they might generate a projection with well separated classes if their anchors are placed adjacently. The “correct” placement of feature anchors was for instance crucial for a nice separation of classes in Figure 1.b, where the two anchors (genes) on the top of the circle (SET and CD19) attract data points from the ALL class and genes APLP2 and LTC4S attract points with AML class value. 2.3
Visualization Scoring and Ranking and Projection Search
In Figure 1 we have shown two data visualizations that utilize only a small number of attributes (genes) but provide a good separation of diagnostic classes. Cancer microarray data includes thousands of genes, and it is therefore not trivial to find a useful projection, that is, a subset of genes to visualize. In fact, we have
Conquering the Curse of Dimensionality in GE Cancer Diagnosis
517
observed that, for a typical cancer data set, there are only a few among millions of possible projections that exhibit a clear separation of diagnostic classes, making the manual search for a good visualization impossible. To automate the search for a good projection, we first need to define how to evaluate them. The algorithm we use, VizRank [13], scores a particular projection by training a k-nearest neighbor classifier on two attributes – the x and y positions of points in the projection. The classification accuracy of the classifier is then assessed using 10-fold cross validation and provides for a projection score. When classes in the projection are well separated, the classification accuracy of the k-NN classifier will be high and the projection will be highly ranked. In projections where some points from different classes overlap, the accuracy of the classifier and with it the value of the projection will be accordingly lower.
CD19
class ALL AML
5000
SET 4500
4000
3500
TCF
3000
LTC4S
2500
2000
1500
PARG
1000
500
0 -1000
0
1000
2000
3000
4000
5000
6000
APLP2
7000
APLP2 class=ALL
class=AML
(a) scatterplot
(b) radviz
Fig. 1. Two projections for the leukemia data set (see Section 3.3)
Although VizRank’s visualization scoring can be very efficient (more than 2000 projections can be evaluated per minute on a 2.4GHz computer) evaluating all projections in the data sets with several thousands of attributes is not feasible. Instead, VizRank uses an efficient heuristic that first scores the attributes using ReliefF [11], ranks the subsets of attributes to be considered by the sum of ReliefF scores, and evaluates the projections starting with the most likely candidates using this heuristics. Our experiments show that by using this heuristic only a few percents of possible projections have to be assessed in order to find the most interesting ones.
518
3 3.1
M. Mramor et al.
Experimental Study Data Sets
For experimental analysis reported in this paper we use five publicly available data sets with information on gene expression profiles in different human cancer types (Table 1). Three data sets, leukemia [1], diffuse large B-cell lymphoma (DLBCL) [2] and prostate tumor [14] have two categories. The leukemia data includes 48 acute lymphoblastic leukemia (ALL) samples and 25 acute myeloid leukemia (AML) samples, each with 7074 gene expression values. The DLBCL data set includes 7070 gene expression profiles for 77 patients, 58 with DLBCL and 19 with follicular lymphoma (FL). The prostate tumor data set includes 12533 genes measured for 52 prostate tumor and 50 normal tissue samples. The data for these three data sets and the mixed lineage leukemia (MLL) data set were produced from Affymetrix gene chips and are available at http://wwwgenome.wi.mit.edu/cancer/. We additionally analyzed two multi category data sets. The MLL [15] data set includes 12533 gene expression values for 72 samples obtained from the peripheral blood or bone marrow of affected individuals. The ALL samples with a chromosomal translocation involving the mixed lineage gene were diagnosed as MLL, so three different leukemia classes were obtained (AML, ALL and MLL). The SRBCT data set [4] consists of four types of tumors in childhood, including Ewing’s sarcoma (EWS), rhabdomyosarcoma (RB), neuroblastoma (NB) and Burkitt’s lymphoma (BL). It includes 2308 genes and 83 samples derived from tumor biopsy and cell lines. The data for the SRBCT data set were obtained from cDNA microarrays. The data set is available at http://research.nhgri.nih.gov/microarray/Supplement/. 3.2
Gene Ranking by ReliefF and Signal-to-Noise Statistics
We started our experiments with a comparative study of ReliefF and S2N scores and associated gene ranking. For all five data sets, histograms of ReliefF and S2N scores were qualitatively similar, being skewed to the right, with a group Table 1. Cancer-related gene expression data sets used in our study. Columns report on number of examples, diagnostic classes and genes included in a data set, and proportion of examples in the majority diagnostic class. Last two columns show the average probability of correct classification (P ) for the top-ranked scatterplot and radviz projection Number of Data set Samples Classes Genes Leukemia 73 2 7074 MLL 72 3 12533 SRBCT 83 4 2308 Prostate 102 2 12533 DLBCL 77 2 7070
Major P for top projection class Scatterplot Radviz 52.8% 98.04% 99.55% 38.9% 94.82% 99.75% 34.9% 87.69% 99.74% 51.0% 91.76% 98.27% 75.3% 96.82% 99.71%
Conquering the Curse of Dimensionality in GE Cancer Diagnosis
519
30
1000
1000
25
20
800
800
15
10
600
600
5
0
0.08
0.1
0.12
0.14
0.16
0.18
0.2
400
400
200
200
0 -0.05
0
0.05
0.1
0.15
0.2
(a) histogram for actual attribute values; the part with the best ranked attributes is magnified in the upper right corner.
0
-0.03
-0.02
-0.01
0
0.01
0.02
0.03
(b) histogram for permuted data
Fig. 2. Histograms of ReliefF on actual and permuted values of attributes (Section 3.2)
of about 50 to 100 most discriminating genes in the right tail. A permutation test was used to verify if these highly discriminatory genes were assigned high scores by chance. We used permutation analysis and calculated ReliefF scores after random permutation of expression values for each of the attributes. In the interest of brevity, we here only show a histogram for ReliefF scores on MLL data set and its corresponding histogram on randomly permuted data (Figure 2). Note that the part with the highest scored genes (the magnification in Fig 2.a) is far outside the normal-shaped distribution computed on permuted data (Fig 2.b). The association between the gene ranks obtained by the univariate S2N and multivariate ReliefF gene ranking methods was obtained by computing the nonparametric Spearman correlation coefficient. The Spearman rank correlation coefficient varied importantly depending on which data set we analyzed. The correlation between ReliefF and S2N was highest (0.89) in the MLL data set but as low as 0.24 in the DLBCL data set, indicating that these two scoring functions would typically yield very different ranking and providing grounds for hypothesis that data includes much interactions between genes. 3.3
Results for the Cancer Gene Expression Data Sets
On all the data sets, VizRank found either scatterplot or radviz visualizations with clear separation of diagnostic classes. If let run for an hour on a standard PC, the number of such projections was in the range of ten to twenty, but most importantly, most of them were found in the first few seconds. The last two columns in Table 1 show Vizrank’s scores, the average probability of correct classification, for the top-ranked scatterplot and radviz projections. The best scatterplot projections were scored from 87.64% to 98.04%, with the lowest score assigned to the multiclass SRBCT data set. We found that as the number
520
M. Mramor et al.
:$6
FODVV (:6 %/ 1% 506
9
6*&$ 8
7
6
)&*57
CAV1
5
4
3
2
,*) 1
0 0
1
2
3
4
5
6
$)4
AF1Q class=EWS
class=BL
class=NB
class=RMS
(a) scatterplot
(b) radviz
Fig. 3. Projections for the SRBCT data set (Section 3.3)
of classes exceeds two, scatterplot becomes less appropriate and radviz with a number of genes not exceeding five provides for excellent separations. For radviz visualizations the scores shown (from 98.27% to 99.75%) hold for the top-ranked projections using only five genes; increasing the number of genes may yield better separation of instances form different classes and thus a higher score. For reasons of space, we here show the visualizations and provide a corresponding discussion only for one two-valued and one multi-valued class problem. For the leukemia data set the best scatterplot obtained a score of 98% and included genes APLP2 and TCF (Figure 1.a). Notice a clear separation of instances from the ALL and AML classes with only a few outliers. An example of the radviz visualization with good separation of examples from different classes and a Vizrank score of 98% is shown in Figure 1.b. The five genes used in this projection are CFTR, CD19, LTC4S, IL8 and RNS2. In Table 2 we show genes appearing in the best scatterplots and the ReliefF and S2N ranks belonging to these genes. Out of the 20 genes represented, two were found to be cancer genes and eleven are cancer related according to the Atlas of genetics and cytogenetics in oncology and haematology (http://www.infobiogen.fr/services/chromcancer/index.html). In the table those genes are marked as C and CR, respectively. Figure 3 shows the best rated scatterplot and radviz visualization with clear separation of instances from different classes for the multicategory SRBCT data set. While the scatterplot does not provide a good separation between all diagnostic classes, separation in radviz is clear. In Table 2 we show twenty genes appearing in the 41 radviz visualizations with VizRank score of 100%, their Relieff and S2N ranks and report on whether they are related to carcinogenesis
Conquering the Curse of Dimensionality in GE Cancer Diagnosis
521
Table 2. Twenty best genes from the scatterplot visualizations for the leukemia data set (a) and from the radviz visualization for the SRBCT data set (b). In column 3 and 4 the ReliefF and S2N ranks are shown, respectively. In the cancer column: C = cancer gene, CR = cancer related gene, N = not related to cancer Gene L09209 M31523 U77948 X85116 M91592 U22376 M62982 Y12556 U82759 U27460 L08895 U46499 X03663 U26312 X04741 M31211 D16469 U68063 U51240 U09087
name APLP2 TCF3 KAI1 EPB72 ZNF76 MYB ALOX12 PRKAB1 HOXA9 UGP2 MEF2C MGST1 CSF1R CBX3 UCHL1 MYL ATP6AP1 SFRS10 KIAA0085 TMPO
Rlf 3 49 414 42 578 50 143 123 9 588 172 0 856 1152 792 57 249 1070 104 100
S2N 3 0 26 62 112 9 872 3082 25 59 72 2 1189 108 4140 4 341 134 208 42
cancer N C CR CR N CR CR CR C N N CR CR CR CR CR CR N N N
Gene name Rlf S2N cancer 770394 FCGRT 0 1 CR 236282 WAS 26 3 CR 796258 SGCA 23 104 N 207274 IGF2 92 39 CR 812105 AF1Q 5 0 N 183337 HLA-DMA 29 14 N 784224 FGFR4 2 44 CR 866702 PTPN13 19 54 CR 786084 CBX1 10 79 CR 814260 FVT1 35 135 CR 325182 CDH2 66 63 CR 244618 EST 106 25 377461 CAV1 4 23 CR 296448 IGF2 58 49 CR 629896 MAP1B 117 13 CR 624360 PSMB8 356 18 N 745019 EHD1 12 5 N 1435862 CD99 3 12 CR 383188 RCV1 13 2 CR 767183 HCLS1 21 22 N
according to the Atlas of genetics and cytogenetics in oncology and haematology (http://www.infobiogen.fr/services/chromcancer/index.html). One of the genes from this list is an expressed sequence tags (ESTs). Out of the remaining 19 gene products, 13 are cancer related, which is 68%.
4
Discussion
The results from Table 2 show that signal-to-noise and ReliefF measures rank genes very differently. This was expected, as most of our best-scored visualizations show at least a degree of interaction. For instance, in radviz from Figure 3 neither of the genes can provide a clear separation of classes alone, yet when combined all of the cancer categories are perfectly separated. Contrary to our expectations, though, we would expect a better match between ReliefF ranking and genes included in best set of visualizations. We hypothesize that the problem is in the number of attributes in the data set ReliefF considers as a context. When ReliefF evaluates an attribute it selects some reference examples and searches for the most similar examples from the same class
522
M. Mramor et al.
and from the other classes. The context with too many genes can mask the effect of genes in interaction with the estimated gene, thus disabling ReliefF to appropriately account for interactions. We compared our selection of “marker” genes in the leukemia data set to the genes selected by other methods. The best 20 genes from the scatterplot visualizations are different from the best ranked genes by Golub et al. [1] except for HOXA9 and EPB72. The primary reason is the use of univariate feature selection in their study. Also, in our study, we joined the test and training sets used by Golub et al. and performed the gene ranking and visualization methods on the combined set. In contrast to the leukemia data set, the selection of our “marker” genes for the SRBCT data set compared with the genes ranked best by other methods is very similar. We found a very high consensus between our selection and the selection based on artificial neural networks [4]. 19 genes from the best scatterplot visualizations and 16 out of 20 from the best radviz visualizations were also selected by the ANN method. Interestingly, there is also a very high consensus (11 and 12 genes out of 20 included in best ranked scatterplot and radviz visualizations) between our method and the method of Fu [7] based on support vector machines.
5
Conclusion
The most striking result from our work is how easy it is to find a simple twodimensional scatterplot or radviz visualization that clearly, non-ambiguously separates cancer diagnostic classes based on expression measurements for a few selected genes. This holds for all five data set studied, but the same observations also applies to about fifty other publicly available data sets we have studied but not reported in this paper. VizRank, a method we used to find good visualizations, often identifies the best ones within seconds of runtime. This is a significant achievement, especially when compared to hours of required runtime reported in a recent study that uses support vector machines combined with a set of other machine learning and feature selection approaches [5], and considering a clear presentation of results offered by these visualizations. To a surprise came a relatively poor performance of ReliefF, which was expected to outshine the univariate gene scoring, but instead performed similarly. We have yet to fully understand why this is so, as the finding does not match with those from the systematic study of ReliefF on data sets that contain much fewer attributes [16]. Gene expression data visualizations reported here provide evidence that cancer diagnostic classes can be clearly separated when using the expression data from only a few genes. VizRank also provides a way for robust selection of genes without the need for a particular scoring function. Our future work aims at using top rated visualizations for probabilistic classification, thus also providing grounds for comparison with other, much more complex but nowadays prevailing computational methods in gene expression cancer diagnosis area.
Conquering the Curse of Dimensionality in GE Cancer Diagnosis
523
References 1. Golub, T.R., Slonim, D.K., Tamayo, P.et al.: Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring. Science 286 (1999) 531–537 2. Shipp, M.A., Ross, K.N., Tamayo, P.et al.: Diffuse large b-cell lymphoma outcome prediction by gene-expression profiling and supervised machine learning. Nature Medicine 8 (2002) 68–74 3. Nutt, C.L., Mani, D.R., Betensky, R.A.et al.: Gene expression-based classification of malignant gliomas correlates better with survival than histological classification. Cancer Res 63 (2003) 1602–1607 4. Khan, J., Wei, J.S., Ringnr, M.et al.: Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks. 7 6 (2001) 673–679 5. Statnikov, A., Aliferis, C.F., Tsamardinos, I., Hardin, D., Levy, S.: A comprehensive evaluation of multicategory classification methods for microarray gene expression cancer diagnosis. Bioinformatics (2004) 33 – 46 6. Su, A.I., Welsh, J.B., Sapinoso, L.M.et al.: Molecular classification of human carcinomas by use of gene expression signatures. Cancer Res 61 (2001) 7388–7393 7. Fu, L.M., Fu-Liu, C.S.: Multi-class cancer subtype classification based on gene expression signatures with reliability analysis. FEBS Letters 561 (2004) 186–190 8. Gamberger, D., Lavrac, N., Zelezny, F., Tolar, J.: Induction of comprehensible models for gene expression datasets by subgroup discovery methodology. Journal of Biomedical Informatics 37 (2004) 269–284 9. Wang, Y., Tetko, I.V., Hall, M.A.et al.: Gene selection from microarray data for cancer classification–a machine learning approach. Computational Biology and Chemistry 29 (2005) 37–46 10. Kira, K., Rendell, L.: A practical approach to feature selection. In: Proceedings of the Ninth International Conference on Machine Learning. (1992) 249–256 11. Kononenko, I., Simec, E.: Induction of decision trees using relieff. In: Mathematical and statistical methods in artificial intelligence. Springer Verlag (1995) 12. Brunsdon, C., Fotheringham, A.S., Charlton, M.: An investigation of methods for visualising highly multivariate datasets. Case Studies of Visualization in the Social Sciences (1998) 55–80 13. Leban, G., Bratko, I., Petrovic, U., Curk, T., Zupan, B.: Vizrank: finding informative data projections in functional genomics by machine learning. Bioinformatics 21 (2005) 413–414 14. Singh, D., Febbo, P.G., Ross, K.et al.: Gene expression correlates of clinical prostate cancer behavior. Cancer Cell 1 (2002) 203–209 15. Armstrong, S.A., Staunton, J.E., Silverman, L.B., Pieters, R.et al.: MLL translocations specify a distinct gene expression profile that distinguishes a unique leukemia. Nature Genetics 30 (2001) 41–47 16. Sikonja, M.R., Kononenko, I.: Theoretical and empirical analysis of relieff and rrelieff. Machine Learning 53 (2003) 23 – 69
An Algorithm to Learn Causal Relations Between Genes from Steady State Data: Simulation and Its Application to Melanoma Dataset Xin Zhang1, Chitta Baral1 , and Seungchan Kim1 2 1
Department of Computer Science and Engineering, Arizona State University, Tempe, AZ 85287, USA 2 Translational Genomics Research Institute, 445 N. Fifth Street, Phoenix, AZ 85004, USA
Abstract. In recent years, a few researchers have challenged past dogma and suggested methods (such as the IC algorithm) for inferring causal relationship among variables using steady state observations. In this paper, we present a modified IC (mIC) algorithm that uses entropy to test conditional independence and combines the steady state data with partial prior knowledge of topological ordering in gene regulatory network, for jointly learning the causal relationship among genes. We evaluate our mIC algorithm using the simulated data. The results show that the precision and recall rates are significantly improved compared with using IC algorithm. Finally, we apply the mIC algorithm to microarray data for melanoma. The algorithm identified the important causal relations associated with WNT5A, a gene playing an important role in melanoma, verified by the literatures.
1
Introduction
The recent development of high-throughput genomic technologies like cDNA microarray and oligonucleotide chips [14] empowers researchers in new ways to study how genes interact with each other. This has led to researchers using mathematical modeling and in-silico simulation study to analyze the interaction structure unambiguously and predict the network dynamic behavior in a systematic way [4, 19]. In previous studies on cellular response of genotoxic damage [10] and melanoma data [2, 11, 17], coeÆcient of determination (CoD) is used to infer gene network structure. CoD provides a normalized measure of the degree to which target variables can be better predicted using the observations in a feature set than it can be in the absence of observations. While CoD provides useful information for network connectivity, the relationships identified via CoD does not necessarily imply causal relations. Bayesian network model, which represents statistical dependencies, also has been proposed to discover interactions between genes [6, 22, 24]. Based on Bayesian model there are some other network inference methods that were evaluated by applying them to biological simulation with known network topology [18]. Bayesian network model considers the maximum likelihood of the observed data given a structure. It is a model associated with statistical probability that does not infer the real causal relationships. Other than Bayesian network model, Gardner et al. [5] proposed a linear dynamic network model to infer a gene network from steady state measurement. In addition to above S. Miksch et al. (Eds.): AIME 2005, LNAI 3581, pp. 524–534, 2005. c Springer-Verlag Berlin Heidelberg 2005
An Algorithm to Learn Causal Relations Between Genes
525
models, several other probabilistic models have been proposed to learn gene networks with multiple data types [7, 15]. In fact none of the earlier research on learning gene networks considers the causal relationship between genes. A few researchers in [12, 13, 21] have suggested methods (for example, the IC algorithm) to learn the causal relationships between variables with steady state data, but not in biological domain. The assumption of causal theory is that the distribution of the dataset is faithful1 . However, the under Boolean gene network model faithful function (we will define formally later) does not implies faithful distribution. Therefore learning gene causal connections with IC algorithms does not yield good result when the distribution of the dataset is not faithful. In this paper, we present a new algorithm – modified IC (mIC) algorithm for learning causal relations between genes using additional knowledge of topological ordering2. We implement the algorithms using entropy to test conditional independence of the genes, and evaluate the causal learning approaches using simulation with the notions of precision and recall. We show that the precision and recall rates of the estimated gene network are significantly improved when using our mIC algorithm than original IC algorithm. In the end, we apply mIC algorithm to gene expression profile for a melanoma data with partial ordering information to learn gene regulatory network. The result shows that the important causal relationships associated with WNT5A gene are identified using mIC algorithm, and those causal connections have been confirmed in the literatures.
2
Learning Gene Causal Relationship
In this section we give a brief background on the di erence between simple predictive relationships and causal relationships, and the basic intuition behind learning causal relationship from steady-state data. We also have an introduction to our simulation methodology. 2.1
Learning Causal Relationship with Steady State Data
To understand the di erence between the simple and causal relationships between variables consider the propositions rain and f alling barometer from an example in [12]. When one observes that they are either both true or both false one concludes that they are related. One would then write rain f alling barometer. But neither rain causes f alling barometer nor vice-versa. Thus if one wanted rain to be true, one could not achieve it by somehow forcing f alling barometer to be true. This would have been possible if f alling barometer caused rain. We say that the relationship between rain 1
2
A probability distribution P is a faithful»stable distribution if there exist a directed acyclic graph (DAG) D such that the conditional independence relationship in P is also shown in the D, and vice versa. A topological ordering is an ordering among vertices of a DAG such that all edges are from vertices labeled with a smaller number to vertices labeled with a larger number. Knowledge about topological ordering between genes can be obtained if partial information about the pathways in which the genes (or their products) are involved is known; and also from existing knowledge about homologous genes in other organisms.
526
X. Zhang, C. Baral, and S. Kim
and f alling barometer is correlation, but not cause. In the context of genes and proteins if one would like to turn on a gene, which cannot be achieved directly, through other genes one would need to know the causal connection between the genes. Thus knowing the causal relationship is very important. The question then is how to obtain (learn or infer) causal relationship between genes. In wet-labs this can be done by knocking down the possible subsets of genes of a given set and studying its impact on the other genes in the set. This is of course not easy to obtain when the number of genes in the set is more than a handful. An alternative approach is to use time series gene expression data. Unfortunately such data can only be obtained for cells of particular organisms such as yeast. For human tissues highthroughput gene expression data is only available in the steady state observation. Thus the question that begs is how to infer causal relationship between genes from steady state data. For long it was thought that one can only infer correlations and other statistical measures such as conditional independence from steady state data and there is no way to infer causal relationship from such data. In recent years some researchers have challenged this view and have suggested methods, while not specially for the gene expression data, to infer causal information. The idea is generalized by Pearl in [12] and Spirtes et al in [21], and an Inductive Causation (IC) algorithm is presented where causal relation between variables is learned or inferred by first analyzing independence and dependence between variables and then constructing minimal and stable causal influence graphs that satisfy the independence and dependence information. The idea behind the inference of causality from steady-state data is based on the principle of finding the simplest explanation of observed phenomena [12]. The causal relationship between a set of genes can be expressed using a causal model which consists of a causal structure (a directed acyclic graph, or a DAG), and parameters that define the value of one node in terms of the value of its parents (in the DAG). The causal theory has an assumption on the distribution called stability or faithfulness3 [12]. The assumption is that all the independencies in distribution P are stable, that is P is entailed by a causal structure of a causal model regardless of the parameter. However in microarray dataset, the distribution might not be faithful. Hence the performance of IC algorithm is not good (w.r.t precision and recall) for inferring causal relationship in this case. We propose an modified IC (mIC) algorithm that uses entropy to test conditional independence and combines the steady state data with prior knowledge of gene topological ordering to jointly learn the causal relationship between genes. 2.2
Modelling and Simulation of a Causal Boolean Network
In order to evaluate the performance of the causal algorithms, we perform sets of simulations. We apply Boolean network model, originally introduced by Kau man [8, 9], for modeling gene regulatory networks. Although Boolean network cannot model quantitative concepts, it provides useful insights in network dynamics [1]. There are two main objectives in modeling and simulation of data-driven Boolean network for the genetic 3
A DAG G and a distribution P are faithful to each other if they exhibit the same set of independencies. A distribution P is said to be faithful if it is faithful to some DAG.
An Algorithm to Learn Causal Relations Between Genes
527
regulatory systems. First, we need to infer the model structure and parameters (rules) from observations such as gene expression profiles. Second, we can explore the dynamic behaviors of the system driven by the inferred rules through simulation. In the simulation, we construct a Boolean network model as a directed acyclic graph (DAG), and obtain the steady-state observations. The model contains n nodes with bik nary values. The state space has a total of 2n states. Theoretically there are 22 possible k functions for a Boolean network, where k is the number of predictors. Among the 22 functions, many do not actually reflect the influence of predictors. For example, assume that a gene gi has two causal parents g1 and g2 , and a Boolean function f determines the state of gi at next time step with gi f (g1 g2 ) (g1 g2 ) (g1 g2 ). The function 2 f is one of the 22 functions, but can be simplified as gi f (g1 g2 ) g1 . In this case, function f does not reflect the causal influence of one of its causal parents g2 . Therefore, we define the concepts of influence and proper Boolean function and only use such functions in our simulation. Definition 1. (Influence): Let z f (x1 xn ) be a Boolean function. We say xi has an influence on z in the function f if there exists two assignment vectors for x1 xn that only di er on the assignment to xi , such that the values of f on those two assignments di er. Definition 2. (Proper function): We say z f (x1 xn ) is a proper function if for i 1 n, xi has an influence on z in the function f . We did sets of experiments to show that under non-uniform distribution the proper function is faithful function4, which entails original causal structure. The simulation process in this study can be summarized as follows: – Step 1: Generate M Boolean networks with up to three input causal parents for each node in topological ordering. – Step 2: For each Boolean network connection, generate random proper Boolean functions for each node. – Step 3: Assign random probabilities for the root gene (gene with no causal parents). – Step 4: Given one configuration (fixed connection and functions), run the deterministic Boolean network starting from all possible initial states and get the probability distribution of all possible states. – Step 5: Collect two hundred data points sampled from the probability distribution. – Step 6: Repeat Step 3 and Step 5 for all M networks with probability distribution and save the configuration file and the data file. 2.3
Entropy and Mutual Information
Given a probability distribution of a dataset, one needs to compute the conditional independence among genes to find the causal information. Shannon [16] developed the concept of entropy to measure the uncertainty of the discrete random variables. In this 4
There exists some special case that under certain non-uniform distribution, proper function might not be faithful function. But those cases are rare and with random generator in the simulation, we may ignore it.
528
X. Zhang, C. Baral, and S. Kim
paper we calculate entropy H and mutual information I to obtain uncertainty coeÆcient U to test conditional independence between genes. The uncertainty coeÆcient U is range from 0 to 1 and defined as follows:
where H(X)
U(X Y) I(X Y)H(X);
x
p(x) log p(x); H(X Y)
and I(X Y) H(X) H(Y) H(X Y) [25].
3
x
(1) y
p(x y) log p(x y);
Algorithms and Criterion for Inferring a Gene Causal Network
3.1
Modified IC (mIC) Algorithm
The IC algorithm [12] examines pairwise conditional independencies between variables to determine the v-structures 5 first, and then applies rules to determine the rest of the network structures. The mIC algorithm is based on the IC algorithm [12], but it incorporates the topological ordering information in the learning step to infer the gene causal relationship from steady state data. It takes as input a probability distribution P generated by a DAG with some gene topological ordering information, and outputs a partially directed DAG. The mIC algorithm is described as follows: – Step 1: For each pair of gene gi and g j in a dataset, test pairwise conditional independence. If they are dependent, search for a set S i j gk gi and g j are independent given gk with i k j or j k i Construct an undirected graph G such that gi and g j are connected with an edge if and only if they are pairwise dependent and no S i j can be found; – Step 2: For each pair of nonadjacent genes gi and g j with common neighbor gk , if gk S i j , and k i k j, add arrowheads pointing at gk , such as gi gk g j ; – Step 3: Orientate the undirected edges without creating new cycles and v-structures. 3.2
Comparing Initial and Obtained Networks - New Definitions for Precision and Recall
For evaluating the learning results, we define the new notions of precision and recall. In comparing the initial and obtained networks one immediate challenge that we faced is in defining recall and precision for the case where the inferred graph may have both directed and undirected edges. (Note that the original graph has only directed edges.) Intuitively, an undirected edge A B means that we cannot distinguish the directionality between A and B with given dataset. To deal with this we define the following six categories: FN (false negatives), T P (true positives), PT P (partial true positives), PFN (partial false negatives), T N (true negatives), FP (false positives), PT N (partial true negatives), and PFP (partial false positives) as follows: 5
A v-structure is of the form a x b are not connected by an arrow.
b such that two converging arrows that the tails of a and
An Algorithm to Learn Causal Relations Between Genes
FN TP TN FP PFN PTN
529
X Y X Y is in the original graph and neither X Y nor X Y is in the obtained graph X Y X Y is in the original graph and also in the obtained graph X Y X Y is not in the original graph and neither X Y nor X Y is in the obtained graph X Y X Y is not in the original graph and X Y is in the obtained graph PTP X Y X Y is in the original graph and X Y is in the obtained graph PFP X Y X Y is not in the original graph and X Y is in the obtained graph
Now we can define the values AFP (aggregate number of false positive), AT N (aggregate number of true negatives), AFN (aggregate number of false negatives) and AT P (aggregate number of true positives) in terms of above six categories. AFN FN PFN 2; AT P T P PT P 2; AFP FP PFP 2; AT N T N PT N 2 where X is the cardinality of set X. Using the above we can now define Recall and Precision as follows: Recall 3.3
AT P ; (AFN AT P)
Precision
AT P (AT P AFP)
Precision and Recall with Observational Equivalence
The output of IC algorithm is a pattern, a partially directed DAG, which is a set of DAGs that have equivalence structures. Every edge in the original network is directed, while the edges in obtained graph may be directed or undirected. There might be a case that a directed edge in original graph has a corresponding undirected edge in obtained graph. Therefore with the view of observational equivalence (OE), we should not have penalties for such edges. Here we define the new notions of precision and recall with considering observational equivalence. We transform both original graph and obtained graph into their own observational equivalent classes, called original class and obtained class, using the definition of observational equivalence [12]. Then define the six categories as follows: FN TP TN FP PFN PTN
(X Y) X Y or X Y is in the original class and neither X Y nor X Y is in the obtained class (X Y) X Y is in the original class and also in the obtained class or X Y is in the original class and also in the obtained class (X Y) neither X Y nor X Y is in the original class and neither X Y nor X Y is in the obtained class (X Y) neither X Y Y nor X Y is in the original class and X Y is in the obtained class PTP (X Y) X Y is in the original class and X Y is in the obtained class or X Y is in the original class and X Y is in the obtained class PFP (X Y) neither X Y nor X Y is in the original class and X Y is in the obtained class
The concepts of the AFN, AT P, AT N, AFP, precision and recall are the same as the ones we defined previous section. 3.4
Comparing the Networks Based on Their Transitive Closure
There are many ways for comparing the initial and obtained graphs. We discussed the way for comparing two networks directly, with and without observational equivalence. Transitive closure (TC) is another way for graph comparison. Suppose the initial network that we have is A B C and we obtain the network with the only edge
530
X. Zhang, C. Baral, and S. Kim
A C. When comparing the obtained network with the initial network we may not treat A C just as a false positive. In fact this obtained network is better than the network that has no edges. To be able to make this conclusion we consider the TC of in the initial network and a similar notion in the obtained network. In the obtained network our definition of TC is based on defining two relations: cc(x y) and pcc(x y). Intuitively, cc(x y), denoting x causally contributes to y, is true if there is a directed or an undirected edge from x to y; and pcc(x y), denoting x possibly causally contributes to y, is true if there is a path from x to y consisting of properly directed edges and undirected edges such that pcc(x y) : cc(x y) pcc(x z) pcc(z y) 3.5
Steps of Learning Gene Causal Relationships
The steps for learning gene causal relationships are as follows: Step 1: Obtain the probability distribution, data sampling and the topological order of the genes; Step 2: Apply algorithms such as IC or mIC to find causal relations; Step 3: Compare the original and obtained networks based on the two notions of precision and recall; Step 4: Repeat step 1-3 for every random network.
4
Experiments, Results and Discussion
We did two sets of experiments for learning gene causal relationships using the IC algorithm and mIC algorithm. Each experiment contains 100 di erent randomly generated gene networks (DAGs), each of which contains 10 genes, with topological ordering connected by Boolean proper functions. The distribution of the network is generated based on the probability of the root genes, and Monte Carlo sampling is used to generate 200 samples in a dataset for each network based on the probability distribution. We use the uncertainty coeÆcient (U) to test the conditional independence in step 1 of the algorithm. We choose the of U 03 for pairwise and U 02 for triplewise conditional independent test. The threshold cut-o values are based on heuristics that we elaborate in [25]. 4.1
Learning with IC Algorithm
The first experiment is to use IC algorithm on the learning gene causal relations with steady state data without topological ordering information. The method is applied to derive an obtained graph for every network and then the obtained graphs are compared with their corresponding initial ones. The results with statistical confidence of 95% as the error bar marked are shown in figure 1. Figure 1(a) shows the precisions of the simulation with IC using two notions: with and without OE and TC. The result shows that the precision rate for inferring causal relations in simulation is 0.3 without OE or TC, 0.45 with OE, around 0.4 with TC, and around 0.5 with OE and TC. Figure 1(b) shows that the recall rate is below 0.3 without OE, and around 0.4 with OE. From the figure we can see that both precision and recall are significantly improved by using the notion of observational equivalence. However the recall rate is still around
An Algorithm to Learn Causal Relations Between Genes IC w/o OE IC with OE
0.6
0.5
0.5
0.4
0.4
Recall Rate
Precision Rate
0.6
0.3 0.2
531
IC w/o OE IC with OE
0.3 0.2 0.1
0.1 0
0 without TC
with TC
without TC
(a) Precision
with TC
(b) Recall
Fig. 1. Precision and Recall for learning by IC algorithm
0.4. From the above simulation results we can see that IC algorithm is not quite good for learning gene regulatory network using only steady state data. 4.2
Learning with Topological Ordering (mIC)
Since using IC algorithm for learning gene causal network from single type of dataset steady state data did not show a good result, our hypothesis is that a better way is to use additional knowledge such as gene topological ordering. The second simulation we did is jointly learning the gene regulatory network using mIC algorithm combining steady state observation and the background knowledge of gene topological orders. We then compare the results with the ones learned by IC algorithm as shown in Figure 2.
Precision Rate
0.6 0.5
0.7
IC w/o OE mIC w/o OE IC with OE mIC with OE
0.6
Recall Rate
0.7
0.4 0.3
0.5
IC w/o OE mIC w/o OE IC with OE mIC with OE
0.4 0.3
0.2
0.2
0.1
0.1 0
0 without TC
(a) Precision
with TC
without TC
with TC
(b) Recall
Fig. 2. Precision and Recall for learning by mIC with Ordering
Figure 2(a) shows that with statistical confidence of 95% as the error bar marked, the precision rate is significantly improved with mIC algorithm, and the precision rate is around 0.6 with TC and OE. Figure 2(b) are promising since the recall rates of learning with mIC algorithm are significantly improved from less than 0.3 with IC algorithm to greater than 0.45 by applying mIC algorithm, improved more than 50%, both with and without considering TC or OE. If considering the observational equivalence, the recall rate of learning with mIC algorithm has been significantly improved to above 0.5 with or without TC.
532
5
X. Zhang, C. Baral, and S. Kim
Applying mIC Algorithm on Melanoma Dataset
We finally applied mIC algorithm to a gene expression profile used in the study of melanoma [2]. 31 malignant melanoma samples were quantized to ternary format such that the expression level of each gene is assigned to -1(down-regulated), 0(unchanged) or 1 (up-regulated). The 10 genes involved in this study are chosen from 587 genes from the melanoma dataset that have been studied to cross predict each other in a multivariate setting [11]: pirin, WNT5A, MART-1, S100, RET-1, MMP-3, PHO-C, synuclein, HADHB and STC2. In previous expression profiling study, WNT5A has been identified as a gene of interest involved in melanoma[2], and expression level of WNT5A is closely related with metastatic status of melanoma [23]. It was shown that the abundance of messenger RNA for WNT5A can be significantly distinguished between cells with high metastatic competence versus those with low metastatic competence [2]. Later, it was also proved experimentally that increasing the level of WNT5A protein can directly change the cell metastatic competence [23]. It has been also suggested that controlling the influence of WNT5A in the regulation can reduce the chance of melanoma metastasizing [3]. In this study of set of 10-gene network, we have a partial biological prior knowledge that MMP-3 is expected to be at the end of the pathway. We applied mIC algorithm using entropy to test conditional independence among those 10 genes with the above prior knowledge to infer the causal regulatory network. The learning results are shown in figure 3.
S100
WNT5A
MART-1
pirin
PHO-C
synuclein
RET-1
STC2
MMP-3
HADHB
Fig. 3. Learning Melanoma Dataset with prior knowledge that MMP-3 is at the end of gene regulatory network
Figure 3 shows that pirin causatively influences WNT5A. This result is consistent with the literature[3] that in order to maintain the level of WNT5A, we need to directly control WNT5A or control WNT5A through pirin. The result also shows the causal connection between WNT5A and MART-1 such that WNT5A directly causes MART-1, which has been verified in the literature [20] that WNT5A may actually directly influence the regulation and suppression of MART-1 expression. In Figure 3, there are some causal connects that have not been verified by the scientist yet. However, they are unlikely to be obtained by random chance (see supplement materials http:www.asu.edu zhang24AIME05). mIC algorithm bring up a systematic way to predict the causal connections among genes using steady state data with some prior
An Algorithm to Learn Causal Relations Between Genes
533
biological knowledge. It could be applied as a guidance for the biologist to verify the causal connections in future experiments.
6
Conclusion
In this paper we presented a modified IC algorithm with entropy that can learn steady state data with gene topological ordering information. We did simulation based on Boolean network to evaluate the performance of the causal algorithms. In the process we developed ways to compare initial networks with obtained networks. From our simulation based evaluation we conclude that (i) IC algorithm does not work well for learning gene regulatory networks from steady state data alone, (ii) a better way for learning the gene causal relationship from steady state data is to use additional knowledge such as gene topological ordering, (iii) the precision and recall rates for mIC algorithm is significantly improved compared with IC algorithm with statistical confidence of 95%. For randomly generated networks, the mIC algorithms work well for joint learning the causal regulatory network by combining steady state data and gene topological ordering knowledge, with precision rate of greater than 60%, and recall rate greater than 50%. We then applied the algorithm to real biological microarray data Melanoma dataset. The result showed that some of the important causal relationships associated with WNT5A gene have been identified using mIC algorithm, and those causal connections have been verified in the literatures.
Acknowledgement This work was supported by NSF grant number 0412000. The authors acknowledge the valuable comments of the anonymous reviewers of this paper.
References 1. Akutsu, T., et al. Identification of Genetic Networks from A Small Number of Gene Expression Patterns under the Boolean Network Models. PSB,1999. 2. Bittner, M. et al. Molecular Classification of Cutaneous Malignant Melanoma by Gene Expression Profiling. Nature, 406:536-540, 2000. 3. Datta, A., Bittner, M. and Dougherty. E. External Control in Markovian Genetic Regulatory Networks. Machine Learning, Vol 52, 169-191, 2003. 4. De Jong, H. Modeling and Simulation of Genetic Regulatory Systems: A Literature Review. Journal of Computational Biology, 2002. 9(1): p. 67–103. 5. Bernardo, T. et al Inferring Genetic Networks and Identifying Compound Mode of Action via Expression Profiling. Science 2003. 6. Friedman, N., Linial, L., Nachman, I. and Pe’er D. Using Bayesian Networks to Analyze Expression Data. RECOMB 2000. 7. Hartemink, A. et al. Combining Location and Expression Data for Principled Discovery of Genetic Regulatory Network Models. PSB, 2002. 8. Kau«man, S.A. Requirements for Evolvability in Complex Systems: Orderly Dynamics and Frozen Components. Physica D, 1990. 42: p. 135-152.
534
X. Zhang, C. Baral, and S. Kim
9. Kau«man, S.A. The Origins of Order, Self-Organization and Selection in Evolution. Oxford University Press, 1993. 10. Kim, S., et al. Multivariate Measurement of Gene-expression Relationships. Genomics, vol. 67, pp. 201–209, 2000 11. Kim S., et al. Can Markov Chain Models Mimic Biological Regulation? Journal of Biological Systems, vol. 10, No. 4 (2002) 337-358. 12. Pearl, J. Causality : Models, Reasoning, and Inference. 2000 Cambridge, U.K. ; New York: Cambridge University Press. xvi, 384 p. 13. Scheines, R., Glymour, C. and Meek, C. TETRAD II: Tools for Discovery. Hillsdale, NJ: Lawrence Erlbaum Associates, 1994. 14. Schulze, A. and Downward, J. Navigating Gene Expression Using Microarrays - A Technology Review. Nature Cell Biology, 2002. 3: p.190-195. 15. Segal, E., et al. From Promoter Sequence to Expression: A Probabilistic Framework. RECOMB 2002. 16. Shanon, C. A Mathematical Theory of Communication The Bell Systems Technical Journal, 27, 1948. 17. Shmulevich, I. et al Probabilistic Boolean Networks: A Rule-based Uncertainty Model for Gene Regulatory Networks. Bioinformatics, 18(2), 2002. 18. Smith, V., Jarvis, E. and Hartemink, A. Influence of Network Topology and Data Collection on Network Inference. PSB 2003. 19. Smolen, P. et al. Modeling Transcriptional Control in Gene Networks – Methods, Recent Results, and Future Directions. Bull Math Biol, 62(2), 2000. 20. Sosman, J., Weeraratna, A., Sondak, V. When Will Melanoma Vaccines Be Proven E«ective? Journal of Clinical Oncology, vol. 22, No 3, 2004. 21. Spirtes, P., Glymour, C. and Scheines, R. Causation, Prediction, and Search. New York, N.Y.: Springer-Verlag. 2nd Edition, MIT, Press 1993 22. Spirtes, P. et al. Constructing Bayesian Network Models of Gene Expression Networks from Microarray Data. Proceedings of the Atlantic Symposium on Computational Biology, Genome Information Systems Technology, 2000. 23. Weeraratna, A.T., Jiang, Y., et al Wnt5a Signalling Directly A«ects Cell Motility and Invasion of Metastatic Melanoma. Cancer Cell, 1, 279-288, 2000. 24. Yoo, C. and G.F. Cooper Discovery of Gene-Regulation Pathways Using Local Causal Search. Proc AMIA Symp, 2002: p. 914-8. 25. Supplement Materials: http:»»www.public.asu.edu» xzhang24»AIME05
Relation Mining over a Corpus of Scientific Literature Fabio Rinaldi1 , Gerold Schneider1 , Kaarel Kaljurand1 , Michael Hess1 , Christos Andronis2 , Andreas Persidis2 , and Ourania Konstanti2 1
Institute of Computational Linguistics, IFI, University of Zurich, Switzerland {rinaldi, gschneid, kalju, hess}@ifi.unizh.ch 2 Biovista, Athens, Greece {candronis, andreasp, okonst}@biovista.com
Abstract. The amount of new discoveries (as published in the scientific literature) in the area of Molecular Biology is currently growing at an exponential rate. This growth makes it very difficult to filter the most relevant results, and the extraction of the core information, for inclusion in one of the knowledge resources being maintained by the research community, becomes very expensive. Therefore, there is a growing interest in text processing approaches that can deliver selected information from scientific publications, which can limit the amount of human intervention normally needed to gather those results. This paper presents and evaluates an approach aimed at automating the process of extracting semantic relations (e.g. interactions between genes and proteins) from scientific literature in the domain of Molecular Biology. The approach, using a novel dependency-based parser, is based on a complete syntactic analysis of the corpus.1
1
Introduction
The amount of research results in the area of molecular biology is growing at such a pace that it is extremely difficult for individual researchers to keep track of them. As such results appear mainly in the form of scientific articles, it is necessary to process them in an efficient manner in order to be able to extract the relevant results. Although many databases aim at consolidating the newly gained knowledge in a format that is easily accessible and searchable (e.g. UMLS, Swiss-Prot, OMIM, Gene Ontology, GenBank, LocusLink), the creation of such resources is a very labour intensive process. Relevant articles have to be selected and accurately read by an human expert looking for the core information.2 1
2
Part of the material contained in this paper has been previously presented at the Workshop on Data Mining and Text Mining for Bioinformatics, Pisa, September 2004. This process is referred to as ‘curation’ of the article.
S. Miksch et al. (Eds.): AIME 2005, LNAI 3581, pp. 535–544, 2005. c Springer-Verlag Berlin Heidelberg 2005
536
F. Rinaldi et al.
The various genome sequencing efforts have resulted in the creation of large databases containing gene sequences. However such information is of little use without the knowledge of the function of each gene and its role in biological pathways. Understanding the relationships between genes and pathways is central to biology research and drug design as they form an array of intricate and interconnected molecular interaction networks which is the basis of normal development and the sustenance of health. In the context of the OntoGene project3 we aim at developing and refining methods for discovery of interactions between biological entities (genes, proteins, pathways, etc.) from the scientific literature, based on a complete syntactic analysis of the articles, using a novel high-precision parsing approach. We consider that advanced parsing techniques combining statistics and human knowledge of linguistics have matured enough to be successfully applied in real settings. OntoGene is intended as a framework for testing this hypotheses in the area of Biomedical Text Mining, where these techniques could have a significant impact. In section 2, we present the “DepGENIA” corpus upon which our methodology is based. Section 3 describes the Relation Mining approach that we have adopted. Section 4 describes the evaluation of our results and briefly discusses current and future work. We conclude with a survey of related work in section 5.
2
The Corpus
GENIA [1]4 is a corpus of 2000 MEDLINE abstracts which have been annotated for various biological entities, according to the GENIA Ontology.5 We use version G3.02 of the GENIA corpus, which includes 18546 sentences (average length 9.27 sentences per article) and 490941 words (average of 26.47 words per sentence). The advantage of working over GENIA is that it provides pre-annotated terminological units (Genes, Proteins, etc.), thus removing the need for Terminology Recognition / Entity Detection. This allows attention to be focused on other challenges. In a first step, we convert the XML annotations of the GENIA corpus into a richer annotation schema [2]. There are two main reasons for performing this step. First, in the new annotation schema all relevant entities are given a unique identifier. As identifiers are preserved during all steps of processing, the existence of a unique identifier for each sentence and each token in the corpus later simplifies the task of presenting the results to the user. The second reason is that the new annotation scheme allows for a neater distinction of different ‘layers’ of annotations (structural, textual and conceptual) which again simplifies later steps of processing. 3 4 5
http://www.ontogene.org/ http://www-tsujii.is.s.u-tokyo.ac.jp/GENIA/ http://www-tsujii.is.s.u-tokyo.ac.jp/~genia/topics/Corpus/ genia-ontology.html
Relation Mining over a Corpus of Scientific Literature
537
We then apply to the resulting modified version of GENIA a pipeline of tools defined as follows: 1. 2. 3. 4.
replace terms with their heads lemmatization of all tokens (with morpha)6 noun group and verb group chunking (LT CHUNK)7 detection of heads in the group (with two simple rules: take the last noun from the noun group; take the last verb from the verb group) 5. dependency parsing (Pro3Gres) The pipeline (itself declaratively specified in XML) has been implemented as an Apache Ant build file8 which supports easy integration or replacement of specific components in the sequence. The end result of the process is a set of dependency relations, which are encoded as (sentence-id, type, head, dependent) tuples (and can be delivered either in CSV or XML). This is a format which is well suited for storage in a relational DB, for further processing with a spreadsheet tool, or for analysis with Data Mining algorithms. We call this modified resource “DepGENIA”.9
3
Relation Mining
As a first result, we want to show that the availability of domain terminology simplifies and improves the task of parsing the corpus. To this aim we create a corpus which does not contain the original GENIA markup for domain terminology (later we refer to this corpus as the ’NOTERM’ corpus). Second, we want to verify whether the parsing of the corpus can benefit from the existence of semantic tags. The idea is to allow the parser to decide on an ambiguous attachment based on the semantic type of the arguments. For instance the decision of attaching an argument of type ‘protein’ as the subject of the verb ‘bind’ could be made on the basis of the type, rather than based purely on the lexical item itself. The third result that we describe in this paper concerns the detection of specific relations by means of specific lexical classes and a small set of rules that describe specific syntactic patterns. This can be seen as partly similar to [3], which however makes use of surface POS-based patterns, while our patterns apply to the result of syntactic parsing. As an example consider the following GENIA sentence: NGFI-B/nur77 binds to the response element by monomer or heterodimer with retinoid X receptor (RXR). Based on the interaction with a domain expert, we have identified a set of relations that are of particular interest in this domain. Some examples of relevant 6 7 8 9
http://www.informatics.susx.ac.uk/research/nlp/carroll/morph.html Available at http://www.ltg.ed.ac.uk/software/chunk/ http://ant.apache.org/ For convenience, we have provided a web interface that allows simplified browsing of the results, see http://www.ontogene.org/
538
F. Rinaldi et al.
Fig. 1. Analysis of the sentence “Anti-Ro(SSA) autoantibodies are associated with T cell receptor beta genes in systemic lupus erythematosus patients.”, where each term is represented by a term head
relations are: activate, bind, interact, regulate, encode, signal [4]. The use of a lemmatizer allows us to capture with a single pattern all morphological variants of a given verb (e.g. bind → bind, binds, binding, bound ). For each of those relations, we have inspected some of the analysis that we obtained from parsing the corpus (see an example in figure 1). The deep syntactic analysis builds Table 1. The most important dependency upon the chunks using a broadcoverage probabilistic Dependency types used by the parser Parser [5] to identify sentence level Relation Label Example syntactic relations between the heads verb–subject subj he sleeps of the chunks. The output is a hiverb–first object obj sees it erarchical structure of syntactic relaverb–second object obj2 gave (her) kisses verb–adjunct adj ate yesterday tions — functional dependency strucverb–subord. clause sentobj saw (they) came tures, represented as the directed arverb–prep. phrase pobj slept in bed noun–prep. phrase modpp draft of paper rows in fig. 1. The parser [5, 6] uses a noun–participle modpart report written hand-written grammar combined with verb–complementizer compl to eat apples noun–preposition prep to the house a statistical language model that calculates lexicalized attachment probabilities, similar to [7]. Parsing is seen as a decision process, the probability of a total parse is the product of probabilities of the individual decisions at each ambiguous point in the derivation. Two supervised models (based on Maximum Likelihood Estimations, MLE) are used. The first is based on lexical probabilities of the heads of phrases, calculating the probability of finding specific syntactic relations (such as subject, sentential object, etc.). The second probability model is a Probabilistic Context Free Grammar (PCFG) for the production of verb phrases. Although Context Free Grammars (CFG) are not a component of dependency grammar, verb phrase PCFG rules can model verb subcategorization frames which are an important component of a dependency grammar. The parser expresses distinctions that are especially important for a predicateargument based shallow semantic representation, as far as they are expressed in the Penn Treebank training data, such as PP-attachment, most long distance dependencies, relative clause anaphora, participles, gerunds, and the argument/adjunct distinction for NPs.
Relation Mining over a Corpus of Scientific Literature
539
In some cases functional relations distinctions that are not expressed in the Penn Treebank are made. Commas are e.g. disambiguated between apposition and conjunction, or the Penn tag IN is disambiguated between preposition and subordinating conjunction. Other distinctions that are less relevant or not clearly expressed in the Treebank are left underspecified, such as the distinction between PP arguments and adjuncts, or a number of types of subordinate clauses. The parser is robust in that it returns the most promising set of partial structures when it fails to find a complete parse for a sentence. Its parsing speed is about 300,000 words per hour.
4
Evaluation
Two different types of evaluation have been performed. First a linguistic evaluation of the parser. Next we focused on the evaluation of the biological significance of the extracted relations. In order to perform an evaluation on the various experiments mentioned in the previous section we have randomly selected 100 test sentences from the GENIA corpus, which we have manually annotated for the syntactic relations that the parser can detect. We have first run the parser Table 2. Comparison of results of parsing under over the 100 test sentences as extracted from the NOTERM different conditions corpus, containing the chunks Relation NOTERM DepGENIA semantic as generated by LTCHUNK, subj (precision) 0.825 0.900 0.888 but no information on terminolsubj (recall) 0.744 0.862 0.846 ogy. Later we have performed obj (precision) 0.701 0.941 0.941 obj (recall) 0.772 0.949 0.949 the analysis over the same 100 nounpp (precision) 0.675 0.833 0.808 sentences, however this time exverbpp (precision) 0.671 0.817 0.770 sentobj (precision) 0.630 0.711 0.692 tracted from the “DepGENIA” sentobj (recall) 0.604 0.75 0.729 corpus. A comparison of the results is shown in table 2.10 As a second experiment, we have integrated PP-attachment modules [8, 9] using the GENIA corpus, because the original PP-training corpus (the Penn Treebank) is of a different domain. Against sparse data we back off to semantic GENIA classes. Our results do not show any improvement.11 In order to evaluate the specific task of Relation Extraction, we have focused on triples of the form (predicate - subject - object). The analysis of the whole GENIA corpus resulted in 10072 such triples (records). For the evaluation of biological relevance we selected only the triples containing the following predicates: activate, bind and block. This resulted in 487 records. 10 11
The parser is constantly being improved, results obtained after the publication of this paper will be made available on the OntoGene web site. This might be attributed to insufficient data or the relative simplicity of the GENIA Ontology.
540
F. Rinaldi et al. Table 3. Some examples of expert evaluation
relation subj
subj type
activate Interleukin-2 (IL-2) amino acid
activate IL-5 bind bind
amino acid
Spi-B amino acid The higher affinity other name sites
subj obj eval Y Stat5 in fresh PBL, and Stat3 and Stat5 in preactivated PBL Y the Jak 2 -STAT 1 signaling pathway Y DNA sequences Pr CVZ with 20-
obj type amino acid
obj eval A+
other name
Y
nucleic acid Aother organic N compound
The extraction algorithm maximally expands the arguments of the predicate, following all their dependencies. Each argument is then assigned a type (a concept of the GENIA Ontology), based on its head. The type assignment depends on the manual annotation performed by the GENIA annotators, so we have taken it as reliable and have not further evaluated it. We then removed all records where a type had not been assigned to either subject or object: this left 169 fully qualified records.12 This remaining set was inspected by a domain expert. In order to simplify the process of evaluation, we have created simple visualization tools (based on XML, CSS and CGI scripts), that can display the results in a browser. For instance, for the former type of evaluation, our visualization tool adds a special attribute to the sentences that have been detected by the methodology previously described. All the articles that contain relevant sentences are then automatically collected and displayed in a browser. The extracted relations can also be stored in a DB format for further processing with a spreadsheet tool or for analysis with Data Mining algorithms. We asked the domain experts to evaluate each argument separately and mark it according to the following codes: Y the argument is correct and informative N the argument is completely wrong Pr the argument is correct, but is anaphoric, and it would need to be resolved to be significant (e.g. “This protein”). A+ the argument is “too large” (which implies that a prepositional phrase has been erroneously attached to it) A- the argument is “too small” (which implies that an attachment has been omitted) In table 3 we show as an example the evaluation of the following sentences: – Interleukin-2 ( IL-2 ) rapidly activated Stat5 in fresh PBL, and Stat3 and Stat5 in preactivated PBL. 12
This step is meant to remove records where one of the arguments cannot be clearly assigned a type. This is generally caused by pronouns, which explains why in the error evaluation (see table 4) the number of pronouns appears so low.
Relation Mining over a Corpus of Scientific Literature
541
– Thus, we demonstrated that IL-5 activated the Jak 2 -STAT 1 signaling pathway in eosinophils. – Spi-B binds DNA sequences containing a core 5-GGAA-3 and activates transcription through this motif. – The higher affinity sites bind CVZ with 20- to 50-fold greater affinity, consistent with CVZ’s enhanced biological effects. The evaluation resulted in the values shown in table 4. This clearly shows that the biggest source of erY N Pr A+ Aror is overexpansion of the object, plus Subject 146 11 4 6 2 there is a little but not insignificant Object 99 1 4 59 6 problem in the detection of the subject.13 Despite the errors, the results can be considered satisfactory, as they show 86.4% and 58.6% correct results in the detection of subjects and objects (respectively). If all loose cases are considered as positive (excluding only the ’N’ cases), these results jump to 93.5% and 99.4% (respectively). As well as improving the parser, currently we are adding facilities for the detection of the polarity and the modality of the relation. Another task being tackled is the treatment of nominalizations (e.g. “activation”)14 and other morphological transformations of the relations of interest (e.g. “activators”, “the activated protein , “co-activation”). Further, some spelling variants should be considered (e.g. “analyze” vs. “analyse” or “down-regulate” vs. “downregulate”). In future we would like to apply Machine Learning Techniques to the task of learning rules that implement transformations from syntactic structures (dependency relations) to domain-relevant semantic relation. We also intend to partner with experts in the task of Bio Entity Identification or use one the available tools (some examples are mentioned in section 5) in order to move from simple experiments over the GENIA corpus to the real-world task of analyzing non-annotated MEDLINE documents. Another advanced application that we are working on is in a Question Answering system over scientific literature in the domain of Genomics [10]. Table 4. Distribution of errors
5
Related Work
At present, very few NLP approaches in the Biomedical domain include full parsing. In the following we summarize a number of research projects that include syntactical parsing (to various degrees) for the Biomedical domain. [11] presents experiments on parsing MEDLINE abstracts with Combinatory Categorial Grammar (CCG). Compared to state-of-the-art parsing speed, the 13 14
A close inspection of these cases points to problems with conjunctions in subject position, plus a specific problem with the construction “does not”. A simple inspection shows that “activation” makes up almost 50% of the occurrences of the stem “activat*”.
542
F. Rinaldi et al.
system is too slow for practical application (13 minutes for 200 sentences). A small evaluation on 492 sentences containing the anchor verbs yields 80% precision and only 48% recall. [12] describes a medical IE system that uses a Dependency Grammar. Only German versions of the parser are described. An evaluation with promising results is reported, but only on three low-level relations: auxiliaries, genitives and prepositional phrases. [13] presents the full parsing approach as entirely novel to the Biomedical domain. “A full parsing approach has not been used in practical application” [ibid.]. The authors belong to the research group that has made the GENIA corpus available, they are currently building the GENIA treebank (which is not claimed to be error-free, but close to the output of their parser). They use a widely established formal grammar, HPSG, and they have shown expertise in robust parsing; [14] is probably the first HPSG parsing approach that scales up to the entire Treebank. [15] use the approach to find anchor verbs in medical corpora. [4] describes a system (GENIES) which extracts and structures information about cellular pathways from the biological literature. The system relies on a term tagger using rules and external knowledge. The terms are combined in relations using a syntactic grammar and semantic constraints. It attempts to obtain a full parse to achieve high precision, but often backs off to partial parsing to improve recall. It groups the 125 anchor verbs into 14 semantic classes, and it even includes some nominalisations. Only a “pilot evaluation” on a single journal article is reported. The reported precision is 96% and recall 63%. The PASTA system [16] uses a template-based Information Extraction approach, focusing on the roles of specific amino acid residues in protein molecules. Similar to our approach is the usage of syntactic analysis resulting in a predicate argument representation. On the basis of such representation they also build a domain model which allows inferences based on multiple sentences. PASTA is perhaps the only parsing-based BioNLP system that has been given an extensive and thorough evaluation. Using the MUC-7 scoring system on the hard task or template recognition they report 65% precision and 68% recall. [17] processes MEDLINE articles (only titles and abstracts) focusing on relation identification. An advantage of their system is the anaphora resolution module, which can resolve many cases of pronominal anaphora and anaphora of the sortal type (e.g. “the protein”) including multiple antecedents (e.g. “both enzymes”). Their evaluation is based on the inhibit relation. They do not use full parsing, but a finite-state cascade approach in which for example many PPs remain unattached. Their shallow parsing is closer to a full parse than most other systems because they include a subordinate clause level, sentential coordination and a flexible relation identification module. On the discourse level, an anaphora resolution module is used. [18, 19] report their system MedScan which involves full parsing. [18] contains a true broad-coverage evaluation of the coverage of their syntax module, which was tested on 4.6 million sentences from PubMed. Only 1.56 million sentences
Relation Mining over a Corpus of Scientific Literature
543
of these yield a parse, which is 34 % coverage. Their system is impressive, but the syntactic analysis is not robust. [20] report 91% precision and 21% recall when extracting human protein interactions from MEDLINE using MedScan. In [19], they report that their recall is between 30-50%. A main reason for this relatively low recall is because “the coverage of MedScan grammar is about 51%, which means that information is extracted from only about half of the sentences” [ibid.]. [21] do a formal evaluation of parsing Biomedical texts with the Link Grammar Parser [22], a non-statistical, rule-based broad-coverage parser that does full parsing but delivers highly proprietary structures. [23] have already shown that the performance of Link Grammar is considerably below state-of-the-art. [21] report an overall dependency recall of 73.1%.
6
Conclusions
In this paper we have presented an approach aimed at supporting the process of extraction of core information from scientific literature in the Biomedical Domain. We have first described “DepGENIA”, an enhanced version of the GENIA corpus, which has been automatically enriched with syntactic dependencies. The quality of such dependencies has then been evaluated over a randomly selected set of test sentences. We have also described a possible application of our approach to the extraction of semantic relations, suggested by a domain expert. Detailed results of the evaluation and lines of further work have been presented.
References 1. Kim, J., Ohta, T., Tateisi, Y., Tsujii, J.: GENIA Corpus - a Semantically Annotated Corpus for Bio-Textmining. Bioinformatics 19 (2003) 180–182 2. Rinaldi, F., Dowdall, J., Hess, M., Ellman, J., Zarri, G.P., Persidis, A., Bernard, L., Karanikas, H.: Multilayer Annotations in PARMENIDES. In: The K-CAP2003 workshop on “Knowledge Markup and Semantic Annotation”. (2003) . 3. Ono, T., Hishigaki, H., Tanigami, A., Takagi, T.: Automated Extraction of Information on Protein-Protein Interactions from the Biological Literature. Bioinformatics 17 (2001) 155–161 4. Friedman, C., Kra, P., Krauthammer, M., H., Rzhetsky, A.: GENIES: a NaturalLanguage Processing System for the Extraction of Molecular Pathways from Journal Articles. Bioinformatics 17 (2001) 74–82 5. Schneider, G.: Extracting and Using Trace-Free Functional Dependencies from the Penn Treebank to Reduce Parsing Complexity. In: Proceedings of The Second Workshop on Treebanks and Linguistic Theories (TLT 2003), V¨ axj¨ o, Sweden (2003) 6. Schneider, G., Rinaldi, F., Dowdall, J.: Fast, Deep-Linguistic Statistical Minimalist Dependency Parsing. In: COLING-2004 workshop on Recent Advances in Dependency Grammars, August 2004, Geneva, Switzerland. (2004) 7. Collins, M.: Head-Statistical Models for Natural Language Processing. PhD thesis, University of Pennsylvania, Philadelphia, USA (1999) 8. Hindle, D., Rooth, M.: Structural Ambiguity and Lexical Relations. Computational Linguistics 19 (1993) 103–120
544
F. Rinaldi et al.
9. Volk, M.: Combining Unsupervised and Supervised Methods for PP-Attachment Disambiguation. In: Proceedings of COLING 2002, Taipeh. (2002) 10. Rinaldi, F., Dowdall, J., Schneider, G., Persidis, A.: Answering Questions in the Genomics Domain. In: The ACL 2004 workshop on Question Answering in Restricted Domains, Barcelona, July 2004. (2004) 11. Park, J.C., Kim, H.S., jae Kim, J.: Bidirectional Incremental Parsing for Automatic Pathway Identification with Combinatory Categorial Grammar. In: Proceedings of Pacific Symposium on Biocomputing (PSB), Big Island, Hawaii, USA (2001) 12. Hahn, U., Romacker, M., Schulz, S.: Creating Knowledge Repositories from Biomedical Reports: The medSynDiKATe Text Mining System. In Altman, R.B., Dunker, A.K., Hunter, L., Lauderdale, K., Klein, T.E., eds.: Proceedings of Pacific Symposium on Biocomputing, Kauai, Hawaii, USA (2002) 338–349 13. Yakushiji, A., Tateisi, Y., Miyao, Y.: Event Extraction from Biomedical Papers Using a Full Parser. In: Proceedings of Pacific Symposium on Biocomputing, River Edge, N.J., World Scientific Publishing (2001) 408–419 14. Miyao, Y., Ninomiya, T., Tsujii, J.: Corpus-oriented Grammar Development for Acquiring a Head-driven Phrase Structure Grammar from the Penn Treebank. In: Proceedings of IJCNLP-04. (2004) 15. Yakushiji, A., Tateisi, Y., nad Jun’ichi Tsujii, Y.M.: Finding Anchor Verbs for Biomedical IE using Predicate-Argument Structures. In: Companion Volume to the Proceedings of 42st Annual Meeting of the ACL, Barcelona, Spain (2004) 157– 160 16. Gaizauskas, R., Demetriou, G., Artymiuk, P.J., P., W.: Protein Structures and Information Extraction from Biological Texts: The PASTA System. Bioinformatics 19 (2003) 135–143 17. Pustejovsky, J., Casta˜ no, J., Zhang, J., Cochran, B., Kotecki, M.: Robust Relational Parsing over Biomedical Literature: Extracting Inhibit Relations. In: Pacific Symposium on Biocomputing. (2002) 18. Novichkova, S., Egorov, S., Daraselia, N.: MedScan, a Natural Language Processing Engine for MEDLINE Abstracts. Bioinformatics 19 (2003) 1699–1706 19. Daraselia, N., Egorov, S., Yazhuk, A., Novichkova, S., Yuryev, A., Mazo, I.: Extracting Protein Function Information from MEDLINE using a Full-Sentence Parser. In: Proceedings of the Second European Workshop on Data Mining and Text Mining in Bioinformatics, Pisa, Italy (2004) 20. Daraselia, N., Egorov, S., Yazhuk, A., Novichkova, S., Yuryev, A., Mazo, I.: Extracting Human Protein Interactions from MEDLINE using a Full-Sentence Parser. Bioinformatics 19 (2003) 1–8 21. Sampo, P., Ginter, F., Pahikkala, T., Boberg, J., Jrvinen, J., Salakoski, T., Koivula, J.: Analysis of Link Grammar on Biomedical Dependency Corpus Targeted at Protein-Protein Interactions. In: Proceedings of Coling 04 Workshop on Natural Language Proc essing in Biomedicine and its Applications (NLPBA/BioNLP), Geneva, Switzerland (2004) 22. Sleator, D., Temperley, D.: Parsing English with a Link Grammar. Technical Report Technical Report CMU-CS-91-196, Carnegie Mellon University Computer Science (1991) 23. Moll´ a, D., Hutchinson, B.: Intrinsic versus Extrinsic Evaluations of Parsing Systems. In: Proceedings of EACL03 workshop on Evaluation Initiatives in Natural Language Processing, Budapest (2003) 43–50
Author Index
Abu-Hanna, Ameen 53 ˚ Ahlfeldt, Hans 434 Akkaya, Cem 181 Alban`ese, Jacques 111 Aleksovski, Zharko 241 Alexopoulou, Dimitra 256 Allart, Laurent 13 Andronis, Christos 535 Antonini, Fran¸cois 111 Atzmueller, Martin 453 Avesani, Paolo 141
Clay, Chris 289 Combi, Carlo 315 Cordier, Marie-Odile 484 Cox, Siˆ an 494 Cristaldi, Gabriele 333
Badaloni, Silvana 33 Balcevich, Robert 276 Bamgbade, Adenike 463 Baneyx, Audrey 231 Baral, Chitta 524 Baud, Robert 246 Baumeister, Joachim 453 Baumgartner, Richard 463, 468 Bellazzi, Riccardo 23 Berka, Petr 79 Bisson, Guy 385 Blaszczynski, Jerzy 429 Bohanec, Marko 414 Bonnardel, Nathalie 111 Bonten, Marc 48 Bosman, Robert-Jan 53 Boswell, Brian 375 Bottrighi, Alessio 58, 136, 151 Bouaud, Jacques 131 Bourbeillon, Julie 226 Boyer, C´elia 251 Bradbrook, Kirsty 171 Bullimore, Mark 473 Castellani, Umberto 315 Cavallini, Anna 89 Cestnik, Bojan 414 Chambrin, Marie-Christine Charlet, Jean 231 Chaudet, Herv´e 111 Choi, Soo-Mi 353 Claveau, Vincent 236
13
Dailey, Matthew 343 Dankel II, Douglas D. 94 Darmoni, St´efan J. 251 Dasmahapatra, Srinandan 221 de Jonge, Evert 67 de Mol, Bas 67 Debeljak, Marko 414 Demˇsar, Janez 514 Distefano, Maria Luisa 333 Dojat, Michel 424 Dolenko, Brion 463 Dupplaw, David 221 Falda, Marco 33 Farion, Ken 429 Faro, Alberto 310 Fox, John 156, 171 ´ Fromont, Elisa 484 Fuchsberger, Christian
101
Ga´ al, Bal´ azs 419 Garbay, Catherine 226, 424 Gaudinat, Arnaud 251 Geissb¨ uhler, Antoine 246 Gill, Hans 434 Giordano, Daniela 310, 333 Giroud, Fran¸coise 226 Glasspool, David 171 Griffiths, Richard 171 Guyet, Thomas 424 Haddawy, Peter 343 Heller, Barbara 266 Hemsing, Achim 453 Herre, Heinrich 266 Hess, Michael 535 Holmes, John H. 444 Hommersom, Arjen 161
546
Author Index
Hu, Bo 221 Hunter, Jim 101 Ironi, Liliana
McCue, Paul 101 Michalowski, Wojtek 429 Micieli, Giuseppe 89 Miksch, Silvia 126, 146, 181 Milˇcinski, Metka 363 Molino, Gianpaolo 58, 151 Montani, Stefania 136, 151 Moreno, Antonio 409 Moser, Monika 126 Moskovitch, Robert 141 Mramor, Minca 514 Murino, Vittorio 315
323
Jaulent, Marie-Christine Jermol, Mitja 414 Jorna, Ren´e J. 400
231
Kabanza, Froduald 385 Kaewruen, Ploen 343 Kaiser, Katharina 181 Kaljurand, Kaarel 535 Karkaletsis, Vangelis 256 Kavˇsek, Branko 414 Keinduangjun, Jitimon 504 Kim, Jeong-Sik 353 Kim, Myoung-Hee 353 Kim, Seungchan 524 Kim, Yong-Guk 353 Klein, Michel 241 Kochin, Dmitry 276, 395 Kononenko, Igor 363 Konstanti, Ourania 535 Kopaˇc, Tadeja 414 Kozmann, Gy¨ orgy 419 ´ Kristmundsd´ ottir, Mar´ıa Osk Kukar, Matjaˇz 363 Kumar, Anand 213 Larizza, Cristiana 23 Laˇs, Vladim´ır 79 Lavraˇc, Nada 414 Lazarescu, Mihai 289 Leban, Gregor 514 Leonardi, Rosalia 333 Lewis, Paul 221 Lindsay, David G. 494 Lucas, Peter 48, 161 Lukowicz, Paul 7 Magni, Paolo 23 Maiorana, Francesco 333 Marcos, Mar 121, 146, 191 Marsolo, Keith 473 Martin, Claude 111 Marty, Johann 246 M˘ aru¸ster, Laura 400 Mary, Vincent 251 Marzola, Pasquina 315
Nannings, Barry 53 N´ev´eol, Aur´elie 251 Neville, Ron 375 Nikulin, Alexander 463 Ou, Monica
94
289
Panzarasa, Silvia 89 Papadimitriou, Elsa 256 Parthasarathy, Srinivasan 473 Patkar, Vivek 156 Peek, Niels 67 Peleg, Mor 156 Pellegrin, Liliane 111 Pennisi, Manuela 310 Pernice, Corrado 89 Persidis, Andreas 535 Piamsa-nga, Punpiti 504 Polo-Conde, Cristina 121, 146, 191 Poovorawan, Yong 504 Porreca, Riccardo 23 Pranckeviciene, Erinija 463 Puppe, Frank 453 Pur, Aleksander 414 Quaglini, Silvana 89 Quiniou, Ren´e 484 Ramati, Michael 43 ˇ Raudys, Sarunas 468 Razavi, Amir R. 434 Reed, Chris 375 Richter, Ernst-J¨ urgen 453 Rinaldi, Fabio 535 Rogozan, Alexandrina 251 Rose, Tony 156 Rosenbrand, Kitty 121, 146
Author Index Rubin, Steven Ruch, Patrick
429 246
Sacchi, Lucia 23 Sager, Jennifer A. 444 ˇ Sajn, Luka 363 S´ anchez, David 409 Sarakhette, Natapope 343 Sbarbati, Andrea 315 Scarciofalo, Giacomo 310 Schmidt, Rainer 300 Schneider, Gerold 535 Schurink, Karin 48 Serban, Radu 121, 191, 201 S´eroussi, Brigitte 131 Seyfang, Andreas 146 Shadbolt, Nigel 221 Shahar, Yuval 43, 166 Shahsavar, Nosrat 434 Sharshar, Samir 13 Simony-Lafontaine, Jo¨elle 226 Sliesoraitiene, Victoria 395 Sliesoraityte, Ieva 276 Smith, Barry 213 Snodgrass, Richard 58 Somorjai, Ray 463, 468 Sona, Diego 141 Spampinato, Concetto 310 Spyropoulos, Constantine D. 256 Steele, Rory 156 Stefanelli, Mario 89 Tbahriti, Imad 246 ten Teije, Annette 121, 191, 201 Tentoni, Stefania 323
Terenziani, Paolo 58, 136, 151 Thomson, Richard 156 Tomeˇckov´ a, Marie 79 Torchio, Mauro 58, 151 Tramontana, Francesco 310 Twa, Michael 473 Urbanˇciˇc, Tanja 414 Ustinovichius, Leonas 276, 395 Valarakos, Alexandros G. 256 van Bommel, Patrick 161 van Croonenborg, Joyce 121 van der Weide, Theo 161 van Gendt, Marjolein 201 van Harmelen, Frank 3, 191, 201 Vass´ anyi, Istv´ an 419 Verduijn, Marion 67 Veuthey, Anne-Lise 246 Vieillot, Jean-Jacques 131 Visscher, Stefan 48 Voorbraak, Frans 67 Vorobieva, Olga 300 Weiss, Dawid 429 West, Geoff A.W. 289 Wilk, Szymon 429 Winstanley, Graham 171 Wittenberg, Jolanda 121, 146 Young, Ohad
166
Zampieri, Marco 315 Zhang, Xin 524 Zupan, Blaˇz 514 Zweigenbaum, Pierre 236
547