Compositional Data Analysis in the Geosciences: From Theory to Practice
The Geological Society of London Books Editorial Committee B. PANKHURST (UK) (CHIEF EDITOR)
Society Books Editors J. GREGORY (UK) J. GRIFFITHS (UK) J. HOWE (UK) P. LEAT (UK) N. ROBINS (UK) J. TURNER (UK)
Society Books Advisors M. BROWN (USA) E. BUFFETAUT (France) R. GIERI~ ( G e r m a n y ) J. GLUYAS (UK) D. STEAD (Canada) R. STEPHENSON (Netherlands)
Geological Society books refereeing procedures The Society makes every effort to ensure that the scientific and production quality of its books matches that of its journals. Since 1997, all book proposals have been refereed by specialist reviewers as well as by the Society's Books Editorial Committee. If the referees identify weaknesses in the proposal, these must be addressed before the proposal is accepted. Once the book is accepted, the Society Book Editors ensure that the volume editors follow strict guidelines on refereeing and quality control. We insist that individual papers can only be accepted after satisfactory review by two independent referees. The questions on the review forms are similar to those for Journal of the Geological Society. The referees' forms and comments must be available to the Society's Book Editors on request. Although many of the books result from meetings, the editors are expected to commission papers that were not presented at the meeting to ensure that the book provides a balanced coverage of the subject. Being accepted for presentation at the meeting does not guarantee inclusion in the book. More information about submitting a proposal and producing a book for the Society can be found on its web site: www.geolsoc.org.uk.
It is recommended that reference to all or part of this book should be made in one of the following ways: BUCCIANTI, A., MATEU-FIGUERAS, G. & PAWLOWSKY-GLAHN, V. (eds) 2006. Compositional Data Analysis in the Geosciences: From Theory to Practice. Geological Society, London, Special Publications, 264. BUCCIANTI, A., TASSI, F. & VASELLI, O. 2006. Compositional changes in a fumarolic field, Vulcano Island: a statistical case study. In: BUCCIANTI, A., MATEU-FIGUERAS,G. & PAWLOWSKY-GLAHN,V. (eds) Compositional Data Analysis in the Geosciences: From Theory to Practice. Geological Society, London, Special Publications, 264, 67-77.
GEOLOGICAL SOCIETY SPECIAL PUBLICATION NO. 264
Compositional Data Analysis in the Geosciences" From Theory to Practice EDITED BY
A. BUCCIANTI Univerisit~ degli Studi di Firenze, Italy
G. MATEU=FIGUERAS Universitat e Girona, Spain and V. PAWLOWSKY-GLAHN Universitat de Girona, Spain
2006 Published by The Geological Society London
THE GEOLOGICAL SOCIETY The Geological Society of London (GSL) was founded in 1807. It is the oldest national geological society in the world and the largest in Europe. It was incorporated under Royal Charter in 1825 and is Registered Charity 210161. The Society is the UK national learned and professional society for geology with a worldwide Fellowship (FGS) of over 9000. The Society has the power to confer Chartered status on suitably qualified Fellows, and about 2000 of the Fellowship carry the title (CGeol). Chartered Geologists may also obtain the equivalent European title, European Geologist (EurGeol). One fifth of the Society's fellowship resides outside the UK. To find out more about the Society, log on to www.geolsoc.org.uk. The Geological Society Publishing House (Bath, UK) produces the Society's international journals and books, and acts as European distributor for selected publications of the American Association of Petroleum Geologists (AAPG), the Indonesian Petroleum Association (IPA), the Geological Society of America (GSA), the Society for Sedimentary Geology (SEPM) and the Geologists' Association (GA). Joint marketing agreements ensure that GSL Fellows may purchase these societies' publications at a discount. The Society's online bookshop (accessible from www.geolsoc.org.uk) offers secure book purchasing with your credit or debit card. To find out about joining the Society and benefiting from substantial discounts on publications of GSL and other societies worldwide, consult www.geolsoc.org.uk, or contact the Fellowship Department at: The Geological Society, Burlington House, Piccadilly, London W1J 0BG: Tel. +44 (0)20 7434 9944; Fax +44 (0)20 7439 8975; E-mail: enquiries @geolsoc.org.uk. For information about the Society's meetings, consult Events on www.geolsoc.org.uk. To find out more about the Society's Corporate Affiliates Scheme, write to
[email protected] Published by The Geological Society from: The Geological Society Publishing House, Unit 7, Brassmill Enterprise Centre, Brassmill Lane, Bath BA1 3JN, UK (Orders: Tel. -t-44 (0)1225 445046, Fax +44 (0)1225 442836) Online bookshop: www.geolsoc.org.uk/bookshop
The publishers make no representation, express or implied, with regard to the accuracy of the information contained in this book and cannot accept any legal responsibility for any errors or omissions that may be made. 9 The Geological Society of London 2006. All rights reserved. No reproduction, copy or transmission of this publication may be made without written permission. No paragraph of this publication may be reproduced, copied or transmitted save with the provisions of the Copyright Licensing Agency, 90 Tottenham Court Road, London W1P 9HE. Users registered with the Copyright Clearance Center, 27 Congress Street, Salem, MA 01970, USA: the item-fee code for this publication is 0305-8719/06/$15.00.
British Library Cataloguing in Publication Data A catalogue record for this book is available from the British Library. ISBN-10:1-86239-205-6 ISBN-13:978-1-86239-205-2 Typeset by Techset Composition, Salisbury, UK Printed by The Cromwell Press, Wiltshire, UK
Distributors North America For trade and institutional orders: The Geological Society, c/o AIDC, 82 Winter Sport Lane, Williston, VT 05495, USA Orders: Tel + 1 800-972-9892 Fax +1 802-864-7626 Email gsl.orders @aidcvt.com
For individual and corporate orders: AAPG Bookstore, PO Box 979, Tulsa, OK 74101-0979, USA Orders: Tel +1 918-584-2555 Fax +1 918-560-2652 Email
[email protected] Website http://bookstore.aapg.org India Affiliated East-West Press Private Ltd, Marketing Division, G-1/16 Ansari Road, Darya Ganj, New Delhi 110 002m India Orders: Tel. +91 11 2327-9113/2326-4180 Fax +91 11 2326-0538 E-mail
[email protected]
Contents
Preface
vii
PAWLOWSKY-GLAHN, V. & EGOZCUE, J. J. Compositional data and their analysis: an introduction
Applications to the solution of real geological problems O.KovAcs, L., KovAcs, G. P., MARTiN-FERNANDEZ,J. A. & BARCELO-VIDAL,C. Major-oxide compositional discrimination in Cenozoic volcanites of Hungary
11
THOMAS, C. W. & AITCHISON, J. Log-ratios and geochemical discrimination of Scottish Dalradian limestones: a case study
25
GORELIKOVA, N., TOLOSANA-DELGADO,R., PAWLOWSKY-GLAHN,V., KHANCHUK,A. • GONEVCHUK, V. Discriminating geodynamical regimes of tin ore formation using trace element composition of cassiterite: the Sikhote'Alin case (Far Eastern Russia)
43
REYMENT, R. A. On stability of compositional canonical variate vector components
59
BUCCIANTI, A., TASSI, F. & VASELLI, O. Compositional changes in a fumarolic field, Vulcano Island, Italy: a statistical case study
67
WELTJE, G. J. Ternary sandstone composition and provenance: an evaluation of the 'Dickinson model'
79
Software and related issues THI0-HENESTROSA, S. & MARTIN-FERNANDEZ,J. A. Detailed guide to CoDaPack: a freeware compositional software
101
VAN DER BOOGAART, K. G. & TOLOSANA-DELGADO,R. Compositional data analysis with 'R' and the package 'compositions'
119
BREN, M., BATAGELJ, V. Visualization of three- and four-part (sub)compositions with R
129
General theory and methods EGOZCUE, J. J. & PAWLOWSKY-GLAHN,V. Simplicial geometry for compositional data
145
DAUNIS-I-ESTADELLA,J., BARCELO-VIDAL,C. & BLrCCIANTI,A. Exploratory compositional data analysis
161
vi
CONTENTS
BUCCIANTI, A., MATEU-FIGUERAS,G. & PAWLOWSKY-GLAHN,V. Frequency distributions and natural laws in geochemistry
175
MARTIN-FERNANDEZ, J. A. & THI0-HENESTROSA, S. Rounded zeros: some practical aspects for compositional data
191
BARRABI~S,E. & MATEU-FIGUERAS,G. Is the simplex open or closed? (some topological concepts)
203
Index
207
Preface
Compositions are positive vectors whose components represent a relative contribution of different parts to a whole; therefore their sum is a constant, usually 1 or 100. Compositions are a familiar and important kind of data for geologists because they appear in many geological datasets (chemical analyses, geochemical compositions of rocks, sand-silt-clay sediments, etc). Since Karl Pearson wrote his famous paper on spurious correlation back in 1897, much has been said and written about the statistical analysis of compositional data, mainly by geologists such as Felix Chayes. His famous work concerned the G-2 granite sample and is used for comparison and standardization of geochemical analytical techniques between different laboratories. As with most igneous (and metamorphic) rocks that have achieved a stable mineralogy and minimized chemical-free energy, the number of minerals is limited by the phase rule to <6. In G-2 the minerals are, in order of decreasing abundance, plagioclase feldspar (43%), microcline feldspar (27%), quartz (21%), biotite (6%) and some minor accessory minerals. Following Chayes, the sympathetic variation between the percentage volumes of these minerals is predictable for quartz and plagioclase feldspar, showing a strong apparent inverse correlation since an increase in the proportion of one necessarily displaces the content of the other to some degree. Minor components may show an apparently positive correlation with a major one, e.g. biotite with plagioclase. Moreover, even modestly abundant components may show an apparent positive correlation, such as microcline with quartz. As Chayes pointed out, such variables pose intractable problems for conventional statistical methodologies generating, for example, bias in the sign of the correlation values. Consequently, the spurious positive correlations are just as predictable and misleading (and meaningless from a statistical point of view) as the negative ones, expected for a small number of major components. The most important step towards a solution of the issue was made in the early 1980s, when John Aitchison proposed the use of log-ratios. His seminal work has brought a completely new perspective to the statistical analysis of data in general, not only compositional in nature, but also strictly positive data, or a combination thereof.
This new approach is based on the idea that the sample space has a natural geometry, a geometry that is coherent with the intuitive concept of difference associated to the particular type of data. If we think that 5% is half of 10%, while 45% is 0.9 of 50%, why should we use methods which are based on the idea that the difference between 5% and 10% is the same as between 45% and 50%? Statistics is expected to make sense in our perception of the natural scale of data, and this is possible for compositional data using the log-ratio approach. There is a long history to the search for a proper approach to the statistical analysis of compositional data, with many publications warning of the potential misuse of standard statistical methods. Many publications have also illustrated the potential of the log-ratio approach, but, despite this, there are still many research groups who are not aware of the existence of a solution to the problem identified by Karl Pearson. We intend this Special Publication to be of interest to geologists using statistical methods with compositional data, mainly geochemists and petrologists. It includes the intuitive justification of the proposed methodology and presents case studies in different fields and includes free software. There is also a section for the mathematically skilled, for those who need to see the proof of the mathematical consistency of the methods used. This last aspect is necessary, since many advances have been made in the last 20 years, and there is no book available up to now which provides a synthesis of the progress made. Summarizing, it could be said that the main aim of this book is the diffusion of the state of the art in this field, emphasizing practical applications to the geological sciences. To introduce the reader to the subject coherently, the book comprises three general parts, following an introductory chapter. This illustrates with simple examples the potential usefulness of the method and, at the same time, brings the basic concepts common to the subsequent case studies. Part I 'Applications to the solution of real geological problems' presents the study of some real geological problems. It forms the core of the book, as it is devoted to illustrating the application of the new methodology in the investigation of real geological problems in some different fields of research. In particular, it includes case studies
viii
PREFACE
concerning the chemistry of different geological matrices, such as minerals, rocks and sediments, as well as fluids, collected, respectively, in different geodynamical circumstances and environments. An example concerning research in palaeontology, in which reciprocal abundances of different species are usually managed, is also included. Part II 'Software and related issues' brings the necessary tools for the better understanding and application of the methods proposed, i.e. computer programs with illustrative examples on how to use them. It includes contributions about available software to deal with compositional data, one of them developed in Visual Basic associated with Excel | and the other developed with the R package. Part III 'General theory and methods' presents some mathematical contributions aimed at giving a useful summary and justifying the appropriateness of the methodology used. Using simple geological examples it shows how reasonable results and interpretations can be obtained with this methodology. Other issues addressed in the theoretical
section are the definition of parametric models, and the treatment of missing values. It also includes a contribution on the special topology of the simplex, to help in the many discussions about whether the simplex is opened or closed. This book was initially based on contributions presented to sessions G13.01 and G03.08 at the International Geological Congress, which took place in Florence in August 2004. Session G13.01 dealt with compositional analysis, while G03.08 was devoted to the importance of statistical analysis in the solution of environmental problems. Later, it was opened up to other contributions with the aim of bringing more case studies, more software and a better insight into the mathematics behind the whole approach. The editors would like to express here their thanks for the effort made by all the authors. A. Buccianti G. Mateu-Figueras V. Pawlowsky-Glahn
Contents
Preface
vii
PAWLOWSKY-GLAHN, V. & EGOZCUE, J. J. Compositional data and their analysis: an introduction
Applications to the solution of real geological problems O.KovAcs, L., KovAcs, G. P., MARTiN-FERNANDEZ,J. A. & BARCELO-VIDAL,C. Major-oxide compositional discrimination in Cenozoic volcanites of Hungary
11
THOMAS, C. W. & AITCHISON, J. Log-ratios and geochemical discrimination of Scottish Dalradian limestones: a case study
25
GORELIKOVA, N., TOLOSANA-DELGADO,R., PAWLOWSKY-GLAHN,V., KHANCHUK,A. • GONEVCHUK, V. Discriminating geodynamical regimes of tin ore formation using trace element composition of cassiterite: the Sikhote'Alin case (Far Eastern Russia)
43
REYMENT, R. A. On stability of compositional canonical variate vector components
59
BUCCIANTI, A., TASSI, F. & VASELLI, O. Compositional changes in a fumarolic field, Vulcano Island, Italy: a statistical case study
67
WELTJE, G. J. Ternary sandstone composition and provenance: an evaluation of the 'Dickinson model'
79
Software and related issues THI0-HENESTROSA, S. & MARTIN-FERNANDEZ,J. A. Detailed guide to CoDaPack: a freeware compositional software
101
VAN DER BOOGAART, K. G. & TOLOSANA-DELGADO,R. Compositional data analysis with 'R' and the package 'compositions'
119
BREN, M., BATAGELJ, V. Visualization of three- and four-part (sub)compositions with R
129
General theory and methods EGOZCUE, J. J. & PAWLOWSKY-GLAHN,V. Simplicial geometry for compositional data
145
DAUNIS-I-ESTADELLA,J., BARCELO-VIDAL,C. & BLrCCIANTI,A. Exploratory compositional data analysis
161
vi
CONTENTS
BUCCIANTI, A., MATEU-FIGUERAS,G. & PAWLOWSKY-GLAHN,V. Frequency distributions and natural laws in geochemistry
175
MARTIN-FERNANDEZ, J. A. & THI0-HENESTROSA, S. Rounded zeros: some practical aspects for compositional data
191
BARRABI~S,E. & MATEU-FIGUERAS,G. Is the simplex open or closed? (some topological concepts)
203
Index
207
Compositional data and their analysis: an introduction V. P A W L O W S K Y - G L A H N
1 & J. J. E G O Z C U E 2
1Departament Informgttica i Matemgttica Aplicada, Universitat de Girona, Campus Montilivi, P4, E-17071 Girona, Spain (e-mail:
[email protected]) 2Departament Matemgttica ApIicada III, Universitat Politkcnica de Catalunya, Jordi Girona Salgado 1-3, C2, E-08034 Barcelona, Spain Abstract: Compositional data are those which contain only relative information. They are parts of some whole. In most cases they are recorded as closed data, i.e. data summing to a constant, such as 100% - whole-rock geochemical data being classic examples. Compositional data have important and particular properties that preclude the application of standard statistical techniques on such data in raw form. Standard techniques are designed to be used with data that are free to range from -oo to +oo. Compositional data are always positive and range only from 0 to 100, or any other constant, when given in closed form. If one component increases, others must, perforce, decrease, whether or not there is a genetic link between these components. This means that the results of standard statistical analysis of the relationships between raw components or parts in a compositional dataset are clouded by spurious effects. Although such analyses may give apparently interpretable results, they are, at best, approximations and need to be treated with considerable circumspection. The methods outlined in this volume are based on the premise that it is the relative variation of components which is of interest, rather than absolute variation. Log-ratios of components provide the natural means of studying compositional data. In this contribution the basic terms and operations are introduced using simple numerical examples to illustrate their computation and to familiarize the reader with their use.
This and the other papers in this volume are concerned with the statistical analysis of compositional data - that is, multivariate data in which the components represent some part of a whole. They are usually recorded in closed form, s u m m i n g to a constant (e.g. one if measured in parts per unit, or 100 if measured in percentages). Such data are widespread in the geosciences and other disciplines, underpinning classification, discrimination and modelling. They include classical geochemical data, data corresponding to categories of sedimentary particle-size distributions, and any data turned into percentages (an operation termed closure) to facilitate comparison; for example, the proportions of fossil species in two or more assemblages. Compositional data have particular and important numerical properties that have major consequences for any statistical analysis. These have been elucidated and discussed by a number of authors since Karl Pearson first highlighted problems in the analysis of compositions in 1897 (Sarmanov & Vistelius 1959; Krumbein 1962; Chayes 1971; Butler 1979; Aitchison 1986; Davis 1986; R o c k 1988; Rollinson 1995; Aitchison & Egozcue 2005). These fundamental properties are reviewed briefly below, in order to set the scene for the papers that follow.
Essential properties and their consequences The properties peculiar to compositional data arise from the fact that they represent parts of some whole; therefore, they convey only relative information. They are always positive and usually constrained to a constant sum. Values for components or parts in compositional data are not free to range from - ~ to + oo (as unconstrained variables are). This fact conditions the relationships that variables have to one another - they are not free to vary independently in the way that unconstrained variables can - and is manifest in their v a r i a n c e covariance structure (Aitchison 1986, chapter 3). The constant sum constraint forces at least one covariance (and thus at least one correlation coefficient between elements) to be negative. If one correlation has to be negative, then none of the correlation coefficients between elements are free to range between - 1 and +1. Thus, spurious correlations are induced by the fact that the data sum to a constant (or are closed) and there is a bias towards negative correlation. Consider the trivial case of a two-part composition s u m m i n g to a constant: the correlation between the two elements in this composition must be - 1 .
From: BUCCIANTI,A., MATEU-FIGUERAS,G. & PAWLOWSKY-GLAHN,V. (eds) CompositionalData Analysis in the Geosciences:From Theory to Practice. Geological Society, London, Special Publications, 264, 1-10. 0305-8719/06/$15.00
9 The Geological Society of London 2006.
2
V. PAWLOWSKY-GLAHN & J. J. EGOZCUE
Further problems arise with regard to subcompositions, commonly extracted from full compositions and renormalized to 100%, for further analysis. It turns out that the covariance relationships between elements in a subcomposition are not the same as they are between the same elements in the full composition, and there is no relationship between the two covariance structures (Aitchison 1986, chapter 2). This is known as subcompositional incoherence. Detailed discussions on the properties of compositional data and their consequences can be found in Aitchison (1984, 1986), Chayes 1971, Rock (1988) and Rollinson (1992), amongst others. However, the essential consequence of these properties is that standard statistical techniques devised for unconstrained random variables, cannot be used to analyse compositional data in crude or raw form (e.g. as simple percentages). This includes all those statistical methods based on the covariance or correlation matrix of vectors of observations, such as factor analysis, discriminant analysis or principal component analysis. Warnings against such analyses are legion, but still go largely unheeded. This is understandable because, until John Aitchison's work of the 1980s, there was no clear way of overcoming the problems and because, in many cases, the application of standard methods gives apparently reasonable and geologically interpretable results. The fact remains, however, that many of the results of statistical analysis of compositional data undertaken using conventional methods are invalid because the methods are, at best, inappropriate to the data. Rock (1988, p. 203) described some of the problems that arise in treating compositional data with conventional statistical techniques: (a) trends and clusters on petrological ternary and principal components diagrams can have little or no geological significance; (b) skewness and leptokurtosis, properties of the shape of a distribution defined by a probability density function (e.g. the normal distribution), can be induced by closure; (c) dendrograms produced by cluster analysis can be severely biased; (d) results from discriminant analysis are likely to be illusory; (e) any correlation coefficient will be affected to an unknown degree by spurious effects induced by the constant sum constraint; (f) the results of tests of significance will be intrinsically flawed since they arise from techniques applied to data for which they were never designed to be used. Despite the efforts primarily of Chayes (1960, 1971, 1983), a key problem with the analysis of compositional data by conventional methods is the nagging uncertainty that accompanies the results: it is not possible to distinguish between the spurious effects caused by the constant sum constraint and the effects that would be attributable to natural processes; elucidating the latter is, of course, the sole purpose of the analysis. In the
authors' experience, some researchers have questioned the need to worry about the effects of the constant sum constraint, given that apparently reasonable and interpretable results arise from the application of conventional techniques. This might be a reasonable view were there no suitable methods for the analysis of compositional data. However, such methods have existed since the 1980s, essentially due to the work of statistician John Aitchison (e.g. Aitchison 1986). It is the purpose of this volume to explain these methods, their underlying basis in theory and to provide tools for working geoscientists to use with their compositional datasets. In his discourse on the effects of the properties of closed data in geology, Rock (1988, p. 206) closed his discussion thus: 'The (log-ratio) method appears to be both robust and powerful, but its enormous potential in geology has only just begun to be realised'. It is the authors' hope that this volume will help this potential to be realized much more widely.
A new approach: relative variations and log-ratios John Aitchison laid the foundations of a new approach to the statistical analysis of compositional data through work undertaken in the early 1980s (Aitchison 1981, 1982, 1983, 1984), culminating in the monograph that brought this work together (Aitchison 1986). He recognized that although new analytical methods had been developed for other forms of constrained data, such as directional data, whose vectors are confined by a unit length constraint, there were none for those data confined by the constant sum. He developed an approach based on log-ratios, recognizing that it is the relative magnitudes and variations of components, rather than their absolute values, that provide the key to analysing compositional data. This indicates that the closure constraint is not the key aspect, but that it is the scale that is important. In fact, the negative bias in the covariance structure can be explained as a consequence of the Euclidean foundation of classical statistics, where the scale is absolute, not relative. His idea has provided a completely new perspective, not only of compositional data, but of statistical analysis in general, relevant also to the analysis of other types of constrained data, such as strictly positive data. The remainder of this paper outlines the basic features of compositional data analysis that will be elaborated upon by the other contributions to this volume.
Compositional data: the sample space In any statistical analysis of data it is important to recognize the sample space within which one's
COMPOSITIONAL DATA AND THEIR ANALYSIS data lie. It is take in for granted that most data are represented by variables free to vary from - c o to +co within Euclidian space and that classical statistical techniques provide a rich set of tools to deal with these sorts of data. However, compositional data occupy a restricted space where variables can vary only from 0 to 100, or any other given constant. Such a restricted space is known formally as a simplex. In the following discussion, only full compositional vectors are considered, including a residual part if necessary, and a compositional (row) vector of D parts, x = [Xl,X2. . . . . XD], will always be a vector of strictly positive components summing to a constant. Zero components are not considered here, as they require a particular approach, as discussed by Martin-Fernfindez & Thi6-Henestrosa (2006) in the third section of this volume. The D-part simplex, S D, is a subset of D dimensional real space. For D = 2, it can be represented as a line segment; for D = 3, as a triangle (the ubiquitous ternary diagram) and for D = 4, as a tetrahedron. Clearly, no graphical representations of simplices are possible beyond D = 4. Therefore, discussion is restricted to numerical examples with D < 4 parts. However, the definitions, operations and interpretations are valid for any number of parts. At this juncture, it is important to emphasize that data in which the components do not sum exactly to a constant are not free from the constant sum constraint. The fact that there is commonly not an exact constant sum merely reflects measurement error and/or unanalysed components - particularly a feature of geochemical data. Recall that the problems of spurious correlation affect pairs of measurement: correlation coefficients, computed using the traditional Pearson product-moment correlation coefficient, will not change if a residual part is added to the vector of measurements. For illustration consider three major oxides, SiO2 = 80, TiO2 = 5 and A1203 = 12, with units in weight percent. The residual with respect to 100% would be R = 100 - (80 + 5 + 12) = 3 and the four-part composition to be analysed would be [SiO2, TiO2, A1203, R]. If the residual part is not of interest, the subcomposition C[SiO2, TiO2, A1203] can be obtained from the four-part composition dividing by the sum of the three parts: I80.100 5.100 12:100] C[80,5,12] = [ c~'~ ' -9-7- ' 97 J = [82.5, 5.1, 12.4].
between raw components in the four-part composition (or amongst the three-part subcomposition without the residual part and without closure) would be different and bear no relationship to those for raw components for the three-part subcomposition after closure. A final comment on units is important. Besides the above-mentioned units, there are other types of units that do not close to a constant, such as molar or molal composition. They are a particular type of compositional data that can be handled with the same log-ratio methods (Buccianti & PawlowskyGlahn 2005). The strategy is simple: convert the data to weight percent, compute the results, then convert them back into the desired units.
Log-ratio representations Any multivariate statistical analysis on compositional data could be performed directly using the Euclidean vector space structure defined in the next section. Although mathematically quite straightforward, it is not easy to follow: the operations are unusual and even scientists trained in using them have difficulties. It is therefore easier to use an alternative based on an appropriate logratio representation and several are available. Aitchison (1982) introduced the additive-log-ratio (alr) and centred-log-ratio (clr) transformations and Egozcue et al. (2003) the isometric-log-ratio (ilr) transformation. Using these transformations, a composition is represented as a real vector. Nowadays they are termed coordinates, or sometimes coefficients, because this is a more natural terminology when working with the vector space structure of the simplex. Also, it emphasizes the mathematical implications of the vector space approach, which has essential characteristics different from the classical log-ratio approach based on transformations between a constrained sample space and the unconstrained real space. Numerically, coordinates or coefficients are easy to compute. For the three-part composition x : [80, 15, 5] E S 3, the three representations would be written as a vector with two components for the air and ilr coordinates, and with three components for the clr coefficients, as follows: air(x) = alr[80, 15, 5] : [ln~--, l n ~ ]
(1)
The subcomposition has three parts. Recalling the comments above concerning subcompositional incoherence, note that log-ratio methods lead to consistent results, whether one works with the full four-part composition or in the subcomposition in three parts. The variance-covariance relationships
3
= [ln(16), ln(3)] = [2.77, 1.10]; clr(x) = clr[80, 15, 5]
[ =
8o
15 l n ( 8 0 . 15 95) 1/3' In (80. 15 95 ) 1 / 3 ' In (80- 15 95) 1/3
4
V. PAWLOWSKY-GLAHN & J. J. EGOZCUE
Fin 80 In 15 5 ] = [ (6000)1/3, (6000)1/3, ln(6000)l/3 -- Fin 80 in 15 -- [ 18.17' ~ '
in 5___~_]
18.17/
= [ln(4.40), ln(0.83), ln(0.28)] = [1.48, -0.19, - 1.29]; ilr(x) = ilr[80, 15, 5]
(2)
= [~221n 1~, ~6~61n80"21---~5] = [ - ~ 2 ln(5"33)' ~ 1 ln(48)] =
1.67, - ~ 3 . 8 7
= [1.18, 1.58].
The corresponding general equations for any number of parts can be found either in the references cited in the previous paragraph, or in other contributions to this book, particularly in Egozcue & Pawlowsky-Glahn (2006) in the third section. Note that equations are written, here and in subsequent sections, using natural logarithms. Obviously, the equivalent equations using logarithms to other bases could be developed, but the natural logarithm is the standard approach in statistics. The three representations have different properties. (a)
(b)
The alr coordinates are simple: D - 1 of the components are divided by the remaining component and logarithms taken (Aitchison 1986, p. 135). In the above example the last component is used as the denominator, but it could be any of the components. The resulting log-ratios are real variables that can be analysed using standard statistical techniques. However, it is important to note that these coordinates cannot be mapped onto orthogonal axes because the axes are actually at 60 ~ something that might occasionally cause problems (Egozcue & Pawlowsky-Glahn 2005). A typical example is that it is not possible to use the usual inner product and distances on airtransformed observations to determine either the angle between two vectors or their Aitchison distance (see below); modified methods of calculating the inner products and distances are required. The clr coefficients are obtained by dividing the components by the geometric mean of the parts and then taking logarithms (Aitchison
1986, p. 79). They are not as simple to interpret as the alr coordinates, but they are still real variables and have certain advantages: the expression is symmetric in the parts and these coordinates reduce the computation of Aitchison distances to ordinary distances. They are useful in the computation of bi-plots (Aitchison & Greenacre 2002), a multivariate exploratory data analysis tool described by Daunis-i-Estadella et al. in the third section of this book. Unfortunately, clr coefficients have one major drawback in that they necessarily sum to zero. This means that clr-transformed observations lie on a plane in D-dimensional real space. Working in this geometric setting requires a certain care because the covariance and correlation matrices are singular (their determinant is necessarily zero). (c) Expressions for the calculation of ilr coordinates are more complex and there are different rules on how to generate them (Egozcue et al. 2003). In addition, they are not always easy to interpret geologically, although some new strategies based on balances, briefly described below and in the third section of this book, are very promising. Their advantage lies in the fact that they are coordinates in an orthogonal system and, thus, any classical multivariate statistical technique can be used straightway to study them. Balances are a particular form of ilr coordinates (Egozcue & Pawlowsky-Glahn 2005). They frequently have a simple interpretation as they represent the relative variation of two groups of parts. For example, consider the composition of a sediment described in terms of the proportions of coarse sand (S~), fine sand (SO, silt (S), and clay (C). Assume interest lies in the relative proportion of coarse and fine sand, (Sc, SO, to silt and clay, (S, C), the relative proportion of coarse sand, Sc, to fine sand, Sf, and the relative proportion of silt, S, to clay, C. The analysis would require three balances: 1 In Sc Sf. bl = b(So Sf; S, C) = ~ S.C' 9
b2 =
b(Sc" S f ) =
~
In
;
(3)
1 S b3 = b(S; C) = ~ l n ~ .
For illustration consider two samples, x1 = (20, 30, 40, 10) and x2 = (25, 35, 30, 10), where the parts represent their (Sc, Sf, S, C) composition.
COMPOSITIONAL DATA AND THEIR ANALYSIS The three balance coordinates of each sample are 1 20.30 bl(xl) = ~ l n 4 0 10 = 0.2027; 1 20 bz(Xl) : - - I n - - : -0.2867;
v~
30
4O b3(x1) : - - I n - - : 0.9803; 1
~/~
10
25 935
1
1
25
v~
35
(4)
= C[4000, 450, 100] F.IO0-4000 100.450 100. 100] - I 4550 ' 4-5-5--0 ' ~-5-5~
j
: -0.2379;
10
The coefficients, 1/2 and 1/x/~, in the expression of the balances are normalizing constants, necessary to guarantee that balance coordinates have good mathematical properties and all classical multivariate methods can be applied directly. For instance, distances between samples can be computed simply with these coordinates using the usual Euclidean distance. In fact, the squared compositional or Aitchison distance between samples x~ and x2 is (Aitchison 1986, p. 193) d](xl, x2) -- (0.2027 - 0.5352) 2 + (-0.2867 - (-0.2379)) 2 + (0.9803 - 0.7768) 2 = 0.1543,
(6)
= [87.91, 9.89, 2.20] = z.
30 b3(x2) : - - I n - - = 0.7768. 1
~/~
The following example shows how perturbation operates. Consider these two compositional threepart (row) vectors, x = [80, 15, 5] and y = [50, 30, 20]. To perturb x by y, first calculate the component-wise product and then close the result to 100 to produce z: x O y = C[80.50, 15-30, 5 . 2 0 ]
bl(x2) = ~ln 30~-7~ = 0.5352; b2(x2) = - - i n - -
5
(5)
and taking the square root d~,(xl,x2)= 0.3928 is obtained. The advantage of balances for interpretation is that they describe the relative behaviour between groups of parts, e.g. b(Sc, Sf; S, C) compares (Sc, SO and (S, C) and within groups of parts, e.g. b(Sc; SO and b(S; C). From the mathematical point of view, they define the coordinates of the samples within an orthogonal system of axes, i.e. they are usual random variables in real space. Thus, any multivariate statistical analysis can be undertaken using those coordinates.
Basic operations Two operations are standard in compositional data analysis: p e r t u r b a t i o n , denoted by | the operation of change, and powering, here denoted by Q, although sometimes the symbol | is also used (Aitchison 1986). As summarized in another contribution in the third section of this book, these operations underpin the complete algebraic-geometric structure of the simplex (Barcel6-Vidal et al. 2001; Billheimer et al. 2001; Pawlowsky-Glahn & Egozcue 2001).
From the above equations it is clear that to undo the perturbation of x by y, the result z has to be divided component-wise by y:
zGY=C[
_F87.91 9.89 ~ ' 30'
-- C[80, 15, 5] = x.
2.20] 20 j (7)
0 y is called the inverse element of y. To understand practically what perturbation is, assume the parts of x represent proportions of sand, silt and clay, in a sediment, and the parts of y represent the proportions of each that are left after some erosional process. The proportions that one would observe in the remaining sediment after erosion are given by the result x | = [87.91, 9.89, 2.20]. Note that the same composition could also be observed if the process is not erosional, but depositional. To see that it is so, express the amounts brought in as proportions of the total amounts that were already there, e.g. y = [250, 150, 100]. This can be interpreted in the following way: no additional clay has been brought in (i.e. perturbing by 1), the amount of silt added is one half of the silt that was already on site (i.e. 1.5 times as much) and the additional sand added is 2.5 times what was there. The observed proportions would be again x @ y = [87.91, 9.89, 2.20]. The corollary is that compositional data will not indicate if the process is additive or subtractive; one requires additional independent information to determine the nature of the process. However, perturbation can be used to quantify the change resulting from processes acting on compositions. Finally, closure is implicit when compositions are perturbed. Perturbation can be performed with any number of parts, and it has some important mathematical properties, which shall not be developed here further, except for the idea of repeated perturbation, because of its important consequences for the modelling of processes.
6
V. PAWLOWSKY-GLAHN & J. J. EGOZCUE
Perturbing composition x by itself leads to x ~ x = [80, 15, 5] 9 [80,15, 5] = C[802,152, 52], and perturbing again by x leads to x ~ x @ x = C[803, 153, 53]. This procedure can be repeated as many times as necessary, leading to the concept of perturbing one element t times by itself, t Q x = tQ[80, 15, 51 = C[80 t, 15t, 5t].
(8)
It is clear why this operation is called powering. This idea can be generalized to any real number for t. The mathematical properties of powering, together with the properties of perturbation, provide for a vector space structure within the simplex S ~ In a vector space, it is possible to draw geometric elements, such as straight lines. Straight lines in S ~ (strictly compositional straight lines), are obtained using perturbation, powering, a starting point Xo and a direction v, using an equation of the following form, x(t) = Xo ~ (t C) v).
xl
(9)
Compositions in S3, i.e. three-part compositions, can be represented in the familiar ternary diagram. Take, for example, Xo = [54, 26, 20] and v = [36, 13, 51], where Xo is the starting point. For t = 1, 2, 3 this results in:
x3
x2 Fig. 1. Compositional straight line starting at Xo = [54, 26, 20] with direction v = [36, 13, 51] and four points on it.
underpins their definition is the inner product. For compositional data, the Aitchison inner product is simply the usual inner product applied either to all possible pairwise log-ratios or to the ilr coordinates, or to the clr coefficients. For x = Xo = [54, 26, 20] and y = x(3) = [35.6, 0.3, 64.1] it is computed as 54
(x, Y)a = ~1 l n ~
9in
~
.
+
In 54 35.6 -20 - . I n -64.1 -
26 In 0.3 ] + l n ~ -6. 64.1_1 = 0.5.
(11)
x(1) = [54, 26, 201 | (1 | [36, 13, 511) From the Aitchison inner product, a norm (the length of a vector or modulus) and a distance (i.e. a measure of difference between vectors) are derived. The squared Aitchison norm is computed as the Aitchison inner product of a vector powered by itself:
= C[54.361 , 26.131 , 2 0 . 5 1 I] = [56.9, 6.8, 36.3]; x(2) = [54, 26, 20] @ (2 | [36, 13, 51]) = C[54.362 , 26- 132 , 20.512 ]
(1o) IlxI 2 = (x, X)a = 0.7072;
= [47.6, 1.4, 51.0]; x(3) = [54, 26, 20] @ (3 Q [36, 13, 51]) = C[54.363 , 26.133 , 20.513 ] = [35.6, 0.3, 64.1]. Figure 1 shows the four points and the corresponding compositional straight line. Passing a smooth line through Xo, x(1), x(2) and x(3), indicates a trend, possibly implying a process (e.g. Fe-enrichment trends in tholeitic basalts), which can be extended in both directions using larger positive values or negative values for t. The importance of such straight lines is that they are the basic linear models that appear when performing any statistical analysis with compositional data. They can be interpreted as compositional linear processes describing natural phenomena. Other crucial elements in geometry are distances, moduli and angles. The mathematical operation that
IlYll2 = (Y, Y)a = 4.3012,
(12)
and the Aitchison distance between two elements as the norm of the difference between them, d~(x, y) = IIx 0 Ylla2 =1
In 3
+
In 54/35"6"] 2 20/64.1/
/ 26/0.3 \2-] + [\l n ~ ] J / i : 4"22"
(13)
As y = x(3) has been obtained perturbing x = Xo three times by v, da(x, y) is three times Ilvl]a. This result quantifies the difference between x and y, something that is sometimes difficult to perceive in a ternary diagram, as can be seen in Figure 1.
COMPOSITIONAL DATA AND THEIR ANALYSIS The norm, IlXlla, may be viewed as the distance ofx from the origin of a linear space. In the example, Ilxlla is the Aitchison distance of the observation x from the barycentre of the ternary diagram [1/3, 1/3, 1/3]. For real vectors, the cosine of the angle between them is calculated from their inner product over the product of their norms. This also holds for compositional vectors, using instead the Aitchison norm and inner product. If the cosine between two compositions, x and y, is null (their inner product is null), then the compositional lines whose directions are defined by x and y, are orthogonal. For example, the cosine of the angle between x = x0 and y = x(3), measured from the barycentre of the temary diagram, is:
7
The geometric mean of each part is:
g(Xl) = (0.45.4.73-8.02- 12.39.37.99) 1/5 = 6.05; g(X2) = (46.41 911.15.6.75.4.19.0.74) 1/5 = 6.40; g(X3) = (53.14.84.12.85.23.83.42.61.28) 1/5 = 72.09
(17)
The resulting three geometric means are then rescaled by closing to 100, giving an estimate of the centre of the three compositions: cen(X) = C[6.05, 6.40, 72.09] = [7.15, 7.58, 85.271.
(x, Y)a 0.5 c o s a - Ilxlla IlYlI~------~-- 0.707.4.301 = 0.164, (14) and the angle is a = 80.5 ~ The geometric structure in the simplex, defined by the above operations, leads to a Euclidean space structure, a structure which allows the definition of a full geometry, completely equivalent to the geometry in the corresponding unrestricted real space (for D parts the corresponding real space has ( D - 1) dimensions). This geometry, defined for the simplex, is termed here Aitchison geometry.
Data centring When displaying three-part compositional data in a ternary diagram, one matter that frequently causes problems is the concentration of data points in a comer (one component dominant) or along a border (two components dominant). As a consequence, it is often difficult to observe any structure (or the lack of it) in the data. Data plotting in such a manner can be visualized more readily by the mechanism of centring, which is simply a special case of perturbation (yon Eynatten et al. 2002). Centring consists of perturbing all the data by the same element, usually the inverse of the centre of a dataset. The centre is defined as the closed geometric mean of the sample, represented by a data matrix X = [X1, X2, X3], cen(X) = C[g(X1), g(X2), g(X3)].
(15)
For illustration, consider the following simple data matrix, which contains five cases (rows) with three parts each (columns):
x = IX1, x2, x31 =
0.45 46.41 53.14 1 4.73 11.15 84.12 8.02 6.75 85.23 (16) 12.39 4.19 83.42 37.99 0.74 61.28
(18)
The inverse of the estimated centre is:
(3 cen(X)=
7~i5' 7.58' 85 7
= [49.30, 46.56, 4.14].
(19)
Now each observation can be perturbed by the inverse of the centre: 0.45 4.73
46.41 53.141 11.15 84.12/
8.02
6.75
12.39
4.19
85.23/| /46"56/ 83.42 / \ 4.14 ]
37.99
0.74
61.28_1
0.45.49.30 4.73 49.30
=C
_-
I
{49.30
46.41.46.56 53.14-4.14 1 11.15-46.56 84.12.4.14
8.02 49.30
6.75.46.56
85.23- 4.14
12.39.49.30
4.19.46.56
83.42.4.14
37.99.49.30
0.74.46.56
61.28- 4.14
I
L86"68
s9.9 9.60 1.59
9.1,
I
9.9s I 11.73_]
Figure 2 shows a temary diagram with initial data (empty circles) and their centre (filled diamond). After centring, the centre of the compositions has been shifted to the barycentre (filled triangle, denoted n) and centred data are visualized better (filled circles). In order to compare the centre of the data with the crude average of the data, the average has been also shown (empty diamond for original data and empty triangle for centred data). It is self-evident that the average does not represent
8
V. PAWLOWSKY-GLAHN & J. J. EGOZCUE xl
F-3"273 |-0.606 = | 0.121 / 0.768 L 2.788
-2.000q -2.000| -2.000 / -2.000 / -2.000._]
(21)
In the same way the coordinates of the centre and its inverse are obtained, x2
ilr(cen[X]) = [-0.040, -2.000],
x3
Fig. 2. Ternary diagram. Sample, empty circles; centre, filled diamond; average, empty diamond; centred sample, filled circles; barycentre, filled triangle; centred average, empty triangle.
ilr(G cen[X]) = [0.040, 2.000].
Centring is now equivalent to a translation in the coordinate plane by the inverse of the centre and the ilr coordinates of the centred sample are
the location of the data properly. The line through the centred data is a compositional straight line, as discussed in the previous section.
ilr(X) =
1 0 45 - - In v.._ 46.1
1 0.45 946.1 v,'~ In 53.142
~
1
4.73 lnll.15
1 8.02 --ln -6.75 __~21n 12.39 4.19 ~2~2in 37.99 _ 0.74
~
1
1
-2.000 + 2.000
0.121 + 0.040
-2.000 + 2.000
0.768 + 0.040
-2.000 + 2.000 ]
2.788 + 0.040
-2.000 + 2.000
-3.233
0.000
-0.566
0.000 /
0.161
0.000/"
0.808
0.000 /
2.828
0.000 j
(23)
Figure 3 shows the sample, the centred sample and their respective centres in the ilr-coordinate plane. An immediate observation is that the sample (empty circles) is placed on a straight line, a fact difficult to figure out from the ternary diagram (Fig. 2) without a lot of experience. Once the straight line in the coordinate plane has been identified one may be interested in representing it in the ternary diagram
4.73 911.15 ----ff4~1 ~
-1-
l
8.02.6.75 n ~
--2-
1 37 9 9 . 0 74 - In . . . _ v . 61.282
-2.000 + 2.000 "]
-0.606 + 0.040
=
In
12.39.4.19
-3.273 + 0.040
I
Statistical analysis of coordinates and coefficients of r a n d o m compositions Log-ratio coordinates and coefficients of random compositions are real random variables, as they are free to range from - c o to § and thus it is possible to undertake multivariate statistical analysis using them. For air coordinates and clr coefficients, available methods can be found mainly in Aitchison (1986), whereas for ilr coordinates standard methods can be used straightway. However, the problem is how to express results in terms of compositions and not in terms of log-ratios. To achieve that, standard basic algebra dictates to make a linear combination of the vectors of a basis with the coefficients, which in terms of Aitchison geometry means a power-perturbation equation. As an example, consider again the five samples of the previous section. First, compute the ilr coordinates of the sample,
(22)
-3 -4
-'3
--'2
-'1
()
i
2
3
Fig. 3. Coordinates. Sample, empty circles; centre, filled diamond; centred sample, filled circles; origin of coordinates, filled triangle.
4
COMPOSITIONAL DATA AND THEIR ANALYSIS passing through the centred sample. To reconstruct compositions from their coordinates a powerperturbation equation has to be used, which, for the first individual is
9
de Investigaci6n of the Spanish Ministry for Science and Technology through the project BFM2003-05640/MATE.
References AITCHISON, J. 1981. A new approach to null correlations of proportions. Mathematical Geology, 13
(0000 [ex ( =[expi- ~/~ , ~
(2), 175-189.
AITCHISON,J. 1982. The statistical analysis of compositional data (with discussion). Journal of the Royal Statistical Society, Series B (Statistical Methodology), 44 (2), 139-177. AITCHISON, J. 1983. Principal component analysis of compositional data. Biometrika, 70 (1), 57-65. AITCHISON, J. 1984. The statistical analysis of geochemical compositions. Mathematical Geology,
,0
@ [exp (0.000, 0.000, 0.000)] = [exp (-2.286, +2.286, 0.000)]
16 (6), 531-564.
@ [exp (0.000, 0.000, 0.000)] = [0.102, 9.832, 1.000] G [1.000, 1.000, 1.0001 = C[0.102.1.000, 9.832.1.000, 1.000 91.000] = [0.930, 89.923, 9.146].
(24)
This result can be compared with the matrix of the centred sample. As can be seen, the result in the first row is the same. This computation has to be performed with a sufficient number of points to obtain a reasonable representation of the compositional line passing through the centred observations in the ternary diagram, as shown in Figure 2. It is obviously easier to work with suitable software packages, such as those described in section two of this book. Note that, if the above operations are reproduced, results will match those presented here only if performed without rounding decimals. The numbers shown above are shortened; calculations were performed using high precision.
Conclusions Since John Aitchison introduced the log-ratio method for the statistical analysis of compositional data in the early 1980s, a lot of progress has been made in this field of research. Nowadays there are different strategies available, all of them based on log-ratios of parts, and on the acknowledgement of the algebraic-geometric structure of the simplex, the sample space of compositional data. Using coordinates is - in general - straightforward and very powerful, and offers mathematically simple models for complex problems. The authors thank the reviewers, Ch. Thomas and R. Tolosana-Delgado, for their thorough reading and suggestions, which greatly improved the paper. This research has received financial support from the Direcci6n General
AITCHISON,J. 1986. The Statistical Analysis of Compositional Data. Monographs on Statistics and Applied Probability. Chapman & Hall Ltd, London. Reprinted (2003) with additional material by The Blackburn Press, Caldwell, NJ. AITCHISON, J. & EGOZCUE, J. J. 2005. Compositional data analysis: where are we and where should we be heading? Mathematical Geology, 37 (7), 829-850. AITCHISON, J. & GREENACRE, M. 2002. Biplots for compositional data. Journal of the Royal Statistical Society, Series C (Applied Statistics), 51 (4), 375-392. BARCELO-VIDAL, C., ]VIARTIN-FERN.~NDEZ, J. A. & PAWLOWSKY-GLAHN,V. 2001. Mathematical foundations of compositional data analysis. In: G. Ross (ed.) Proceedings of IAMG'01 - The sixth annual conference of the International Association for Mathematical Geology. 20, Kansas Geological Survey, Laurence, KS, (CD-ROM). BILLHEIMER, D., GUTTORP, P. & FAGAN, W. 2001. Statistical interpretation of species composition. Journal of the American Statistical Association, 96 (456), 1205-1214. BUCCIANTI, A. & PAWLOWSKY-GLAHN,V. 2005. New perspectives on water chemistry and compositional data analysis. Mathematical Geology, 37 (7), 703-727. BUTLER, J. C. 1979. The effects of closure on the moments of a distribution. Mathematical Geology, 11 (1), 75-84. CHAYES,F. 1960. On correlation between variables of constant sum. Journal of Geophysical Research, 65 (12), 4185-4193. CHAYES, F. 1971. Ratio Correlation. University of Chicago Press, Chicago, IL. CHAYES, F. 1983. Detecting nonrandom associations between proportions by tests of remaining-space variables. Mathematical Geology, 15 (1), 197- 206. DAVIS, J. C. 1986. Statistics and Data Analysis in Geology. Wiley, New York. EGOZCUE, J. J. & PAWLOWSKY-GLAHN, V. 2005. Groups of parts and their balances in compositional data analysis. Mathematical Geology, 37 (7), 795-828.
10
V. PAWLOWSKY-GLAHN & J. J. EGOZCUE
EGOZCUE, J. J. & PAWLOWSKY-GLAHN, V. 2006. Simplical geometry for compositional data. In: BUCCIANTI, A., ]VIATEU-FIGUERAS, G. & PAWLOWSKY-GLAHN, V. (eds) Compositional Data Analysis in the Geosciences: From Theory to Practice. Geological Society, London, Special Publications, 264, 145-159. EGOZCUE, J. J., PAWLOWSKY-GLAHN, V., lk,IATEUFIGUERAS,G. & BARCELO-VIDAL,C. 2003. Isometric logratio transformations for compositional data analysis. Mathematical Geology, 35 (3), 279-300. KRUMBEIN, W. C. 1962. Open and closed number systems in stratigraphic mapping. American Association of Petroleum Geologists Bulletin, 46, 2229-2245. MARTIN-FERN,g,NDEZ, J. A. 8r THIO-HENESTROSA, S. 2006 Rounded zeros: some practical aspects for compositional data. In: BUCCIANT1, A., MATEUFIGUERAS, G. & PAWLOWSKY-GLAIN, V. (eds) Compositional Data Analysis in the Geosciences: From Theory to Practice. Geological Society, London, Special Publications, 264, 191-201. PAWLOWSKY-GLAHN, V. & EGOZCUE, J. J. 2001. Geometric approach to statistical analysis on the simplex. Stochastic Environmental Research and Risk Assessment (SERRA), 15 (5), 384-398.
PEARSON, K. 1897. Mathematical contributions to the theory of evolution. On a form of spurious correlation which may arise when indices are used in the measurement of organs. Proceedings of the Royal Society of London, LX, 489-502. ROCK, N. M. S. 1988. Numerical Geology. Lecture Notes in Earth Sciences, 18. Springer-Verlag, Berlin. ROI~LINSON, H. R. 1992. Another look at the constant sum problem in geochemistry. Mineralogical Magazine, 56 (385), 469-475. ROLLINSON, H. R. 1995. Using geochemical data: Evaluation, presentation, interpretation. Longman Geochemistry Series, Longman Group Ltd, Essex. SARMANOV, O. V. 8r VISTELIUS, A. B. 1959. On the correlation of percentage values. Doklady of the Academy of Sciences of the USSR - Earth Sciences Section, 126, 22-25. VON EYNATTEN, H., PAWLOWSKY-GLAHN, V. EGOZCUE, J. J. 2002. Understanding perturbation on the simplex: a simple method to better visualise and interpret compositional data in ternary diagrams. Mathematical Geology, 34 (3), 249-257.
Major-oxide compositional discrimination in Cenozoic volcanites of Hungary L. 6.KOV,/~CS 1, G. P. KOV,/~CS 1, j. A. M A R T I N - F E R N A N D E Z 2 & C. B A R C E L ( 3 - V I D A L 2
1Hungarian Geological Survey, H-I143 Budapest, Stefdnia tit 14, Hungary (e-mail: Lajos. O. Kovacs @mgsz. hu; Gabor. Kovacs @mgsz. hu) 2Departament Informgttica i Matem&tica Aplicada, Universitat de Girona, Campus Montilivi, Edifici P-1V, E-17071, Girona, Spain (e-mail: josepantoni, martin @udg. es, carles, barcelo @udg. es) Abstract: In the present study multivariate statistical methods emerging from log-ratio method-
ology are applied to an extensive major element dataset of Cenozoic volcanic rocks of Hungary. For an easy comparison, several conventional data plots are also given. The revealed compositional geometry shows a good agreement with geological models based on geoscientific methods which do not fulfil a definite statistical approach. Subcompositional pattems within the two major groups (alkaline basalts and calc-alkaline rocks), and their good separation, known from earlier interpretations of stratigraphical, petrographical, geochemical and other data, are statistically confirmed. The linear compositional patterns disclosed here for these two rock series provide additional details helping to describe quantitatively and, hence, elucidate the nature of petrogenetic processes that affected the Cenozoic volcanites in Hungary.
Generation and evolution of magma, although there are a lot of hypothetical points in the relevant theories, are usually related to the tectonic evolution of the region concerned. The Cenozoic tectonic and/or magmatic history of the CarpathoPannonian region (CPR) is discussed and modelled in numerous papers (for a list see 6.Kovfics & Kovfics 2001). Areal distribution of the Cenozoic volcanites (Fig. 1), together with their stratigraphy, petrography, petrology, age and geodynamic aspects, have been examined in a large number of studies since the beginning of geological exploration in the CPR, which includes the territory of Hungary. At the same time, hundreds of major element and, later, trace element and isotope analyses have been made of bulk rock and constituent minerals. Trace element and isotope data serve as the most up-to-date geochemical tool for solving genetic questions such as that of magma origin, age, evolution and relationship to tectonic processes. Major element analyses are traditionally used for calculation of normative composition and various petrochemical indices, as well as for making bivariate and triangular diagrams which, in turn, are useful in comparing and classifying particular samples, volcanic bodies or regions. Statistical elaboration of compositional data from volcanites in Hungary began with pioneering works that dealt mainly with basic rock types (for
references see O.Kovfics & Kov~ics 2001). A large dataset embracing most of the Neogene and Quaternary formations was compiled and statistically processed by P6ka (1985, 1988). Production of statistically significant, high quality analytical data of Cenozoic basalts was reported by EmbeyIsztin et al. (1993). More sophisticated multivariate methods were applied by Kovfics & O.Kov~ics (1990) and O.Kovfics & Kov~ics (1994) for a statistical evaluation of major element data from the young alkali basalts of Hungary. Recently, several papers (Martfn-Fern~indez et al. 2003b, 2004b, 2005) have delivered the first attempts to use logratio techniques. The present paper is a natural continuation of the preliminary work by MartJnFern~indez et al. (2004b). A detailed description of the database, containing over 3000 samples of Cenozoic volcanic rocks with major element chemistry, used in the present paper is given in O.Kovfics & Kov~cs (2001). The data were collected from the available publications and technical reports. The set of major components analysed and used in different reports is not always standard, which is especially true for earlier sources. The basic set of components comprises the most widely used major oxides: SiO2, TiO2, A1203, Fe203, FeO, MnO, MgO, CaO, Na20, K20, P205, H2 O+, H 2 0 - and CO2. Sometimes, mainly in earlier analyses, for the Fe content total-iron was given,
From: BUCCIANTI,A., MATEU-FIGUERAS,G. & PAWLOWSKY-GLAHN,V. (eds) Compositional Data Analysis in the Geosciences: From Theory to Practice. Geological Society, London, Special Publications, 264, 11-23. 0305-8719/06/$15.00
9 The Geological Society of London 2006.
12
L. 0.KOV,~CS E T A L .
Fig. 1. Map of distribution of the Neogene-Quaternary volcanic formations in the Carpatho-Pannonian region (modified after Prcskay et al. 1995). Key: 1, outcrops of pre-Tertiary basement; 2, Alpine-Carpathian flysch belt; 3, Neogene-Quaternary sedimentary infill of the Pannonian Basin; 4, outcropping intermediate volcanic rocks; 5, buried intermediate volcanic rocks; 6, outcropping silicic volcanic rocks; 7, buried silicic volcanic rocks; 8, areas of basaltic volcanism; 9, state border of Hungary.
therefore total-iron - expressed as ,I~'A c 2 u,,~total 3 or FeO t~ (they are convertible) - has been calculated in each case, and is considered a basic component when Fe + and Fe + are not determined. The goal of this paper is to illustrate the overall effectiveness and certain particular capabilities of the log-ratio methodology, which is introduced and explained in other chapters of this book, through analysing the given major element database. The statistical analysis proceeds by elucidating the properties of the methods rather than exposing details of (otherwise extensively studied) geological problems. The data are analysed and one can see how log-ratio techniques answer some geological questions, based on the statistical properties of the data. Then the results are compared with observations made using some basic traditional methods. Finally, both geological and methodological conclusions are drawn. In recognizing/resolving many geological problems, individual observations with anomalous
values in one or several components are very informative. However, the statistical behaviour of a sample is often distorted or obscured by them. Different applications may require different conditions to be reliable. The database has been screened using the selection scheme proposed by Le Maitre (1984) and adjusted later by Le Bas e t al. (1986), which suggests that we should consider as unbiased those compositions with H20 < 2%, CO2 < 0.5% and, as an additional condition in this case, loss-on-ignition < 3 % . Nine hundred and fifty nine samples meet those criteria and they are the ones submitted to statistical analysis below. Since some of these observations have null values and it is assumed that these values are rounded zeros, a multiplicative replacement (Martfn-Fern~indez et al. 2003a) is applied to substitute each sample which contains null values by a new composition without zeros. Mart/n-Fernfindez et al. (2004a) presented a detailed analysis of the application of this replacement procedure to this dataset. For
VOLCANITES COMPOSITIONAL DISCRIMINATION more information see Martfn-Fernfindez & Thi6Henestrosa (2006) in section three of this volume.
Exploratory analysis of the data Log-ratio m e t h o d o l o g y
As in almost all instances when there is a large number of geochemical samples, a primary question relates to the 'natural' groups of samples based on some kind of similarity. In log-ratio methodology relative differences in the values of a given variable (component) are taken into account, as opposed to many traditional quantitative methods used in geology where absolute differences are considered. Exploratory searching for groups among the samples can be done with a number of methods, e.g. obtaining and visualizing a suitable projection of the multivariate data. Visualization is usually done with two- or threedimensional data. There exist an almost infinite number of projections which reduce dimensionality and, obviously, a search is made for those carrying the most possible statistical features (homogeneities, inhomogeneities, extremes, trends, correlations, etc.) of the multivariate data. The simplest projections that can be explored are the three-part subcompositions. Nevertheless, as a first step, it is advisable to apply some technique of dimensionality reduction in order to have an idea about which subcomposition could be more informative. The combined plot of components and samples (Fig. 2), called a compositional biplot (Aitchison & Greenacre 2002), can provide an essential aid, thanks to a number of favourable properties. A more detailed exploratory study based on the compositional biplot in combination with subcompositional analysis is presented in Daunis-i-Estadella et al. (2006) in section three of this book. The compositional biplot in Figure 2 explains 77% of the total variability, which can be considered reasonably high. In searching for those components that would be optimal candidates to obtain three-part subcompositions with the largest possible retained variability, it is found that the rays associated with parts MgO, P205, TiO2, K20 and SiO2 are the longest. On the other hand, it can be observed that the direction along which the projections of the alkaline basalts and the calc-alkaline series are well separated (Fig. 2a) is the first axis of the biplot. Note that rays associated with parts SiO2, KaO, Na20 and A1203 are the closest rays to the first axis (Fig. 2b). Thus, three-part subcompositions including some of these parts can help in analysing the subcompositional patterns of these two groups of rocks. In general, if three vertices of rays in the biplot lie approximately on a straight
13
line, it might indicate one-dimensional variability between the components. The pattern of the samples of the corresponding subcomposition in the ternary diagram is linear whenever two vertices of the three are close together, and curved if the three vertices lie apart. In both cases they are onedimensional patterns with respect to the geometry of the simplex, called 'compositional lines' (Aitchison et al. 2002). Observation of the biplot provides some hints about which three-part subcompositions could be useful for interpretation. The three-part subcomposition with the largest variability is [MgO; P205, K20], and it explains 49% of the total variability (Figs 3a and 3b). Note that in Figure 3b the data are centred. The centring operation, which improves the view of the data in the ternary diagram, was first introduced by Martin-Fernandez et al. (1999). In the non-centred representation (Fig. 3a) the pattern might look linear. Nevertheless, after centring (Fig. 3b) one can observe that the dispersion is large and the two groups do not appear well separated. If the requirement is a reasonable separation of the two groups and a subcompositional linear pattern, simultaneously, then one can select ternary diagrams including SiO2 and K20, or SiO2 and Na20, whose vertices in the biplot (Fig. 2) are close together and close to the first axis; and a third component, e.g. MgO, whose vertex lies further apart. Let the plot of subcomposition [SiO2; MgO; Na20] (Fig. 4a) be considered first. The points seem to form two clouds, in spite of the fact that due to the high relative values of SiO2 all points are very close to the SiO2 vertex, which always renders recognizing any regularity within the data more difficult. As expected, patterns in the data are visualized better after the points are centred (Fig. 4b). In Figure 5a the two point clouds seen in Figure 4b appear also convincingly, although here they are more elongated. This allows one to make the first numerical conclusion: as a primary structure two groups are detected, and they are well separated from an exploratory point of view, which suggests carrying out some confirmatory analysis. Geologically this is clear: most of the researchers accept the classification according to which all, or at least the overwhelming majority, of the given rocks represent two rock suites, a calc-alkaline and an alkali-basaltic. This demonstrates how classification tasks may be supported with subcomposition plots. Comparing Figures 4b and 5a provides further details of the group structure: although in Figure 5a separation of the two main groups (calc-alkaline and alkali-basaltic) is a bit less expressed, the inner structure of the calc-alkaline suite begins to come out. Namely, there are
14
L. 0.KOV,~CS ETAL.
% I I I I
clr(MgO)
oo
o
o o
~[oo 9;~ pu
eO~
axis 1
o(~
oO~
o
0
O0
o
o o 8 cIr(CaO) ~.el ~ ~, b . f f ~ ~'., ~total~ ,~o ~, oO~ ''"'~/~3o ~o~ vo% clr~Aol203 ) ~ ~ clr(Si02)
................
i,~-~ll ~
..............
"--
~ e ~ o _~!~1a~3)
..
clr(TiO 2) /
/ /
o o
,
~o
|
o ~ o O~oo
o
~o ~o clr(K 2O)
oOo
o
~
o
o
o
o
OoOo
clr(P205)
o
axis 2
O
(a)
O~
oo
o
o o
clr(MgO)
~ c l r ( C a O )
~
clrlFe20~~
......
~ . \ ~I clr(AI2u3/ axis1 . . . . . . . . . . . . . . . . . ~ ( S i 2 2 ) ~ clr(TiO2)/
/
clr(P205)
i /
clr(Na20)
clr(K20)
I
i !
1 axis2
(b)
Fig. 2. Compositional biplot of the clr-transformed data. Dashed lines show the axes of the biplot. (a) Samples displayed. Circles denote calc-alkaline rocks, dots alkaline basalts. (b) Samples not displayed.
VOLCANITES COMPOSITIONAL DISCRIMINATION
15
MgO
90
10
'- - - ' , - ( - \
,l~"Zdf
!1
o
~1~,~-,-~,~
~lr ~
0
/
\
. . / _LEX
\
o.
-/
\
~'~,e o, o 2 5 , 4(~,,~t.5~"O - , a ~ ' _ _ L/ . . . . . 61F~
Y
" ~
~7-~-
5
No
~b\
r~ffo
e~ o-
~::)
-o-
~-
\
\ \ =n C - ] - -~ ~v
, -, \
~
/
,, _~c ..... /\
~ \
\
-/
~ -
/
75
50
\
\
\
, \ - ~\ - - ' - - \- ~ \
IX
75
./ \
\
'\
\
\
~----.~--C-~-7~u ',- -
/
V
o,
\
-,
\
V
\
,.
~
/
U---7
\
v
\
,
/
-~----)
95 90
(a)
\/ -/~- . . . . . .
,.., _v_ ........... / \
/
\
\
,
/
/\
\
/
t~t~ ~ /eO ~ v v ~ l ~ } - ~ ....
/_,.~,V:,.,
/\
,, ,, ,, \ ,,,, ,, ,, \
\
,
:" . . . .
,/ %"
~ -
\
~:-
25
-~
/\
\/
/
\
v
10 5
MgO c 95 90/
;o ~ , , \ x\
75
)
x\
0
0
"X\\O0
50
\\ \ x
\\ \\
OmO
\ \
\x x
\\ \\
c .,,. ,..
o
o
25
o "~
)o (~i~0
10 ..'9
K20c
-..T
~|
o
0
~
0
10
o
~
o
o
d~
o-
o-.~
L
%.0
*~
,'\
,
95
90
75
5025
(b) Fig. 3. Subcomposition [MgO; P205; K20]. Circles denote calc-alkaline rocks, dots alkaline basalts. (a) Non-centred ternary diagram; (b) centred ternary diagram. Superscript c indicates centred parts.
~O
\
~o~
16
L. 0.KOV,~CS ET AL.
si% 95 ~ ,
5
% I II /
i
/
I
I
/
~n a9v
/ / /
I
/
/
Y ix
I i
i
I
I
I
/
/
"
\ \
ago
/
//'
v ''
95
I
/
';\ . . . . . . / \ / \ I \ / \ I \ i \ i \ i \
\
' -1"_ 7,'
\. \
\
--x-
\
~" -
~ - -~ - - \
~" . . . . . .
/ \
'
7 \ \7 - -
'
\
- , ~\ - ~
- -
\, \ \
/
\
,
\
',,L
\
,
x- - - - ;
r ,av
7X
\ l \ \ v \ \ Ix \ \l \ \ ix \ \ I \ \ \ I \ \ \ / \ \ \ I \ \ \ \ \ \
\
/
\ \
X \
\
-
\
\1
7",. \. . . . . . . . . . ,' \
-,--/
90
~
/
,
\
\
-~/-
/
v/
\
i
\ /
\
\
\1
'
/---~--/
\
I
\ / ) / , /\ I9 r~ Z - J _ _~L ._x. . . . u/V - 7 - 7 ' - - x . . . . / \, / \ 5
/
\
\ \
I
I
\
i
P.~ g - J - - , , . . . . . . - , - /, ~ \ - / 7 - -/ / . . . . /
i \
i
i
i
I
/
i \/ i \
\
\ \
\
--/ . . . . . . i
-
/ \ /
I
/ /
7
\
,
I
/
i -
\
i
/
I
F
/
/
/
/\ I\
i\
/
I
\\\\
/\
\ - -
-7
\
~--,
,-
\,/ \" /x / \
'
75
-"; \ - \-
\
\
\
\
\ \ \ \ cm ~'" -\- -/k ~v \ \, \ ,.,\
-'-
, - - e - ~ ,
' v'"
\ ,, '
' ,, '
\ ,, ''
75
50
25
10
~o
,, ' \ \
Na..O
5
(a)
o~ -
o
"~
~
,
O0
0
0
8
Oo 0
0
0
0
0
0
O0
r
_0
95
g
-.
0 O0
5 -o'~
90/--_-
-.~
= -.,,-~
s~
10
~
~ ~---..- - _ - - ...,._
s:5 ...-<-_-
-;
~
~..... ~
75
_o- . . . . .
0
......
25
;.-~
-_._._
NapC
MgOC (b)
9590
75
50
25
10
5
Fig. 4. Subcomposition [SiO2; MgO; Na20]. Circles denote calc-alkaline rocks, dots alkaline basalts. (a) Non-centred ternary diagram; (b) centred ternary diagram. Superscript c indicates centred parts.
VOLCANITES COMPOSITIONAL DISCRIMINATION
17
s~o c
0
0 0
0
95
90 / i
/
~
/
"'-9
,-~
u:,_~ b _ ~ ~ . ~
~..~,~-,~:,,P-~' mnB,,--~-...t,i,
o ~ P m w - -mm-~=- "-. . . . . . .
75 r
0
_~_ _ ~ . . . . . . . . . . . .
. .--
C~
...+ ....
10 1u
^,.,~.
.-----...
9 o_~-_'~'-. "'-. :---" ...... -"~ - ~ . . . . . . . . . .
oo-~ zz~=~'k , ' a
~ , , = ~. . . . .
MgOC 25
75 K20 C 95 90
(a)
75
50
25
10 5
sio c
5
10 95
7
5
/
~
.
.
__~ . . . . . ~ 5 0
Mg OC 2 5 0
(b)
95 90 75
75 (Na20+K20)c
50
25
10
5
Fig. 5. Centred ternary diagram o f subcomposition: (a) [SiO2; M g O ; K 2 0 ] ; (b) [SiO2; M g O ; Circles denote calc-alkaline rocks, dots alkaline basalts.
Na20 +
K20].
18
L. 0.KOVACS ET AL.
samples with elevated SiO2 contents (Fig. 4b, upper corner), samples with slightly higher SiO2 values and close to zero MgO values (Figs 4b and 5a, along the edge opposite the MgO vertex), and samples with low MgO and elevated K20 contents (Fig. 5a, towards the K20 vertex). These sample sets may partly overlap, but they should be considered as possibly separable sub-groups. Geologically, they are easily interpretable varieties of the calc-alkaline suite: those with elevated SiO2 and K20 are products of metasomatic alterations, those having slightly higher SiO2 and forming a numerous set at the MgO-poor end of the elongated calc-alkaline patch are representatives of a silicic volcanism treated sometimes as a separate one. Figures 4b and 5a show a similar general structure, but the slight differences suggest considering a combination of the two. Even if nothing was known about the data, putting together K and Na is a frequent practice in petrochemical analyses, which is based on general knowledge about the similar behaviour and rather common mutual replacement of the two elements. Hence, for each sample the sum of K20 and Na20 is calculated, and a similar ternary diagram is produced with these values (Fig. 5b). Figure 5b shows a somewhat clearer two-class structure and, hence, emphasizes the elongated character of the point clouds. Recalling also what is said about the properties of these ternary plots (see above), elongated (straight or curved) point clouds suggest the presence of compositional trends, occurring, incidentally, so often in volcano-magmatic systems. Here, too, the observable numerical properties of these two rock-suites
strongly suggest that their compositions may be considered as different states of two compositional paths or trends. In Martfn-Fern~indez et al. (2005) an extended analysis of these trends is presented. Other exploratory tools
Most models, conclusions, interpretations, etc. based on major element data in petrochemistry use percentage data provided directly by chemical analyses. Although from the point of view of logratio theory the statistical foundation of these treatments is not always clearly justified, the great number of geological case studies and crossreferences ensures a high degree of consistency in the most often used schemes and tools of data analysis and evaluation. This section applies several widely used methods of visualization and/or interpretation of major element data. Obviously, there is a large number of similar methods or tools; the ones applied here provide results that may be interesting to compare with those gained with log-ratio techniques in the previous section. In other words, from a certain point of view the data evaluation given below is also a subcompositional analysis. Although the petrochemical problems and the analytical data are typically multivariate, univariate statistics may provide important primary hints about the group structure of the studied samples. For large numbers of samples, the frequency distribution of SiO2 is customarily used to reveal the most fundamental groups of rocks. The SiO2 histogram of the given samples (Fig. 6a) strongly suggests the presence of three rock assemblages
200] '~176 1
160
120
t"
120
0 c(D
80
80
,,,I,--
40
40
56
40 44 48 52 56 60 64 68 72 76 80 84 88
SiO2, weight % (a)
60
64
68
72
76
80
84
88
SIO2+AI203, weight % (b)
Fig. 6. Empirical frequency distributions in the studied rocks: (a) for Sit2; (b) for the sum of Sit2 -t- A1203.
92
96
VOLCANITES COMPOSITIONAL DISCRIMINATION divided by pronounced silica gaps. These groups can be readily identified as a low-SiO2 (basaltic), an intermediate (mainly andesitic), and a highSiO2 (rhyolitic) one, a pattern well known from many complex igneous provinces world-wide. Concerning the studied region, the basaltic group statistically corresponds to the alkali-basaltic suite, the andesitic and rhyolitic groups represent the calc-alkaline series. Despite the fact that the construction of any histogram is not unique, using all the other components - or their reasonable combinations - may also be informative. For example, on the histogram of SiO2 + A1203 given in Figure 6b, this three-peak pattern is seen even more clearly. One of the most widely used bivariate variation diagrams in volcanite petrochemistry is the TAS (total alkali-silica) plot (Le Bas et al. 1986). The main purpose of this plot is rock classification that results in assigning a rock name to any sample. While technically this is a simple procedure, in the background there is a large amount of statistically supported professional experience incorporated in the plot. Once a name is given to a sample, it begins to carry numerous probable chemical, petrographical, mineralogical, genetic,
19
etc. properties. The TAS diagram of the Cenozoic volcanites of Hungary (Fig. 7) shows a large compositional variability. In this plot, three rock clusters can be distinguished again, a trachybasalticbasanitic, an andesitic (including basalto-andesites and minor dacites), and a rhyolitic group, in agreement with the SiO2-based labelling (recall Fig. 6a). The TAS diagram is also commonly used to discriminate between the alkaline and subalkaline rock series. In the Hungarian volcanites both series are present: the silica-oversaturated (andesitic and rhyolitic) formations belong to the subalkaline series, whilst the silica-saturated and undersaturated (basaltic) rocks represent the alkaline series. Just as in log-ratio methodology (see previous section), multivariate petrochemical data are often reduced to three dimensions and visualized on ternary diagrams. In relation to various problems, different ternary diagrams are used with success. Compositional changes, especially petrologically interpretable compositional trend(s), within a volcanic suite are frequently investigated in the AFM (alkalis-total Fe-Mg) plot. Although it is mostly applied to discriminate between calcalkaline and tholeiitic series within a subalkaline
16
14
bta= basaltic trachyandesite tb = trachybasalt
phonolite tephriphonolite
12
+ .4.1.+
trachyte t-, O1
++
+
+
-I-
10
+
++
andesite
++ +
+ +
9
+
6
+% +
++ +++
+ +
+§
+
+
?'q-'.L + %~.":
\
!+,t,
+ +
§ +-H- -~+
++ +
+
O ,~ z
+ +
+
§
dacite rhyolite
+ +
basalt
37
41
45
49
andesite
53
57
61
65
69
73
77
Si02, weight% Fig. 7. TAS diagram, as suggested by Le Bas et al. (1986), of the studied rock samples. Samples satisfy: HzO < 2%, CO2 < 0.5% and loss-on-ignition <3%.
81
85
20
L.
6.KOVACS E T A L . confirmed numerically by a linear discriminant analysis (LDA) of the two groups. LDA, applied to the clr-transformed ( c l r = centred log-ratio) data, leads to a classification where only 4% of the observations are classified incorrectly (misclassification rate) by the linear discriminant function (LDF). Properties of this confirmatory analysis have been discussed in Barcel6-Vidal et al. (1999). One of the most important questions analysed concerns which log-ratio transformation should be applied to the data. Fortunately, based on the properties of the suggested log-ratio transformations, three fully equivalent strategies can be applied.
assemblage, it can also indicate the actual stage of differentiation for individual samples, or the differentiation/evolution path for a rock suite. In Figure 8a, the studied subalkaline volcanites show an elongated pattern, which can be interpreted sometimes as a scattered trend (Rollinson 1993), but here it gives the impression that several different, partly superimposed processes are responsible for the compositional variability of these rocks. No point clusters separated with clear gap(s) can be observed, except, potentially, for the high-alkali tail. The cloud of points straddles the left part of the boundary line set between the calcalkaline field and the tholeiitic field by Irvine & Baragar (1971), similarly to the observations of Szab6 et al. (1992) related to other Cenozoic volcanic areas of the CPR. Obviously, one would like to see all the given (both the calc-alkaline and alkali-basaltic) samples at the same time in this diagram (Fig. 8b). Note that despite their significant geological and geochemical differences, the two rock series overlap considerably in this classical ternary diagram.
Air-transform the data and then apply LDA to the air-transformed dataset. The properties of this transformation ensure that selection of the denominator in the air-transformation has no influence on the results. First, apply clr-transformation to the data. Second, eliminate one clr-variable (clrcomponent) of the produced dataset. Finally, apply LDA. The question about the clr variable to be eliminated has no importance because this selection has no influence on the results (Barcel6-Vidal et al. 1999).
Log-ratio linear discriminant analysis The good separation of the alkaline basalts and the calc-alkaline series seen in the biplot (Fig. 2a) is
~r~total
F~2v 3
9%; 5 90/--,~-k / /
/
\ /
-/.- - , \
10 /
X
\
/
0/ / / \0 r
2 ; 0' \
\\
k
80
~
/ &'(~8 50
v/
~Oo~
o\
o '\
O
/
,_ + ~e a o~
Na20+K20
/0
/
, ~\
-jo o
./, \
/
~-Dj~-~' - - - ~ -
-t ~,/ 75 \
'
'
25
/ ),. \ \
\\
,
\
, \', -
-k-
\
/
50
/-
\
o '\ //
,, ~ 25 ' ,~_~,_o__~._ y ~'-", , ,
v \V / 95 90
/
,~ o-
_~~
5
\
\ \
/.Y-,,^~^s
%;
\
- ,,- - * -
\
\
'<-.......... /', /
\
\_~ .
XX _
~/ / 50
\
c
/
/
',- ',--",,
\ \; \
\,
/ . . .
\
" k
V \ \ \
. . . . .
75
/ ~- - - - / ~
\
~--~ v/ 25 \
\
- ~-
\
\
~-\-,,k\9 0
~ - + ' - -)x 9 5 \ ,.'/ \ ,.'/ \ x. .M g u 10 5
(a)
Fig. 8. AFM plot: (a) of the studied calc-alkaline rocks, with the boundary line between the calc-alkaline and tholeiitic fields, as defined by Irvine & Baragar (1971).
21
VOLCANITES COMPOSITIONAL DISCRIMINATION t-~total
Fu2,J3
51o
.
(b)
/
s
,/
Na20+K20 _ 9s 90
75
MgO 25
lo s
Fig. 8. (b) of the studied calc-alkaline rocks (circles) and alkaline basalts (dots).
9 Apply ilr-transformation to the dataset, then apply LDA to the transformed dataset. In the step of fir-transformation an orthonormal basis must be selected. The results of LDA are invariant to this basis selection.
give an account for many different aspects of similar petrochemical datasets.
Using the clr-transformation, log-ratio linear discriminant analysis confirms the descriptive patterns detected in the previous sections. In particular, in Figure 3b a large dispersion in the data is observed and the two groups appear poorly separated. The calculated LDF misclassification rate is 16.5%. In Figure 5a the two groups appear better differentiated and, indeed, here the LDF misclassification rate is 7.1%. When the amalgamated subcomposition [SiO2; MgO; Na20 + K20] (Fig. 5b) is considered, in order to allow comparison of this analysis with the results from traditional methods (e.g. the TAS diagram, Fig. 7), the separation of the two groups is the best (the LDF misclassification rate is 3.9%). This paper selects, from the continuously increasing set of techniques of log-ratio methodology, only a few methods. In the actual data analysis, serving as a background to the observations, different varieties of these methods, several others and numerous subcompositions have been employed. For simplicity, this paper presents only one possible path of analysis, but it is important to note that with log-ratio methods one can investigate and
A statistical analysis of major element data from Cenozoic volcanites of Hungary based on Aitchison geometry revealed two tight subcompositional trends not obvious on traditional variation diagrams. These trends were numerically confirmed by log-ratio linear discriminant analysis. The existence of well-defined subcompositional patterns is consistent with the general understanding of magma evolution. In other words, compositional geometry shows once again a good agreement with geological models based on scientific methods which do not include a strict statistical approach. So far, few geoscientific, and particularly few petrochemical, case studies can be found in the literature that make extensive use of rigorous statistical methods based on the log-ratio, or Aitchison, geometry. The latter builds on the assumption that relative differences are more significant for percentage data than absolute differences and, at the same time, avoids spurious interpretations that may well occur when one applies traditional statistical tools to percentage data. An important precedent is given by Grunsky et aI. (1992), who use this logratio methodology to provide an initial assessment
Conclusions
22
L. 6.KOV,~CS ETAL.
of the chemical variation of volcanic rocks of the Superior Province of Ontario (Canada) and to perform canonical variate analysis on three reference magma clans and twelve rock types. Another is given by Rollinson (1993), who considers the log-ratio approach as the basic strategy for analysing geochemical data in general. The suggested geological interpretation is a very general understanding of certain statistical features of the data. Although it is consistent with prevailing ideas about the geological history of the region, it could certainly be refined. For this, similar geological applications suitable for comparison would be of use and, in order to gain important details for the evolution of the involved magmas, appropriately selected subsets of the samples could be analysed statistically. To sum up, two straightforward subcompositional analyses were performed in this study, one with the relatively new log-ratio methodology, the other using mathematically less intricate but traditionally trusted tools, thought to be informative in petrochemistry. Contrasting the two sets of techniques is somewhat arbitrary because they represent different approaches and both might have their own place. It is suggested that due to its demonstrated capabilities, significant versatility and mathematical consistency, log-ratio analysis should be considered a new alternative in quantitative examination of major element data. This work has received financial support from the Direccirn General de Investigaci6n of the Spanish Ministry for Science and Technology through the project BFM2003-05640/MATE, and from the National Research Fund, Hungary (Project T037581), and from the Agency for Research Fund Management and Research Exploitation, Hungary (Project E-18/2005).
References
AITCHISON, J. & GREENACRE, M. 2002. Biplots of compositional data. Applied Statistics, 51, 375-392. A/TCHISON, J., BARCEL()-VIDAL,C., EGOZCUE,J. J. & PAWLOWSKY-GLAnN,V. 2002. A concise guide for the algebraic-geometric structure of the simplex, the sample space for compositional data analysis. In: BURGER, H. & SKALA, W. (eds) Proceedings oflAMG'02, The 8th Annual Meeting of the International Association for Mathematical Geology: Berlin, Germany, 387-392. BARCEL0-V1DAL, C., MARTiN-FERNANDEZ, J. A. & PAWLOWSKY-GLAHN, V. 1999. Comment on 'Singularity and Nonnormality in the Classification of Compositional Data' by G. C. BOHILINGET AL. Mathematical Geology, 31 (5), 581-586. DAURIS-I-ESTADELLA, J., BARCELL0-VIDAL, C. & BUCCIANTI, A. 2006. Exploratory compositional
data analysis. In: BUCCIANTI, A., MATEUFIGUERAS, G. & PAWLOWSKY-GLArtN, V. (eds) Compositional Data Analysis in the Geosciences: From Theory to Practice. Geological Society, London, Special Publications, 264, 161-174. EMBEY-ISZTIN, A., DOBOSI, G., JAMES, D., DOWNES, H., POULTIDIS,CH. & SCHARBERT,H. G. 1993. A compilation of new major, trace element and isotope geochemical analyses of the young alkali basalts from the Pannonian Basin. Fragmenta Mineralogica Paleontologica, 16, 5-26. GRUNSKY,E. C., EASTON,R. M., THURSTON,P. C. & JENSEN,L. S. 1992. Characterization and statistical classification of Archean volcanic rocks of the Superior Province using major element geochemistry. Geology of Ontario, Ontario Geological Survey, 4 (2), 1397-1438. IRVINE, T. N. & BARAGAR,W. R. A. 1971. A guide to the chemical classification of the common volcanic rocks. Canadian Journal of Earth Sciences, 8, 523 -548. KovAcs, P. G. & O.KovAcs, L. 1990. A dumlnttili fiatal alk~ilibazaltok krzetk~miai adatainak vizsg~ilata sokwiltoz6s matematikai mrdszerekkel [Mathematical evaluation of petrochemical data of alkali basalts in Transdanubia, West Hungary]. M,X~FIEvi Jel., 1988/1, 255-265 [in Hungarian]. LE BAS, M. J., LE MAITRE, R. W., STRECKEISEN,A. & ZANETTIN, B. 1986. A chemical classification of volcanic rocks based on the total alkali-silica diagram. Journal of Petrology, 27 (3), 745-750. LE MAITRE, R. W. 1984. A proposal by the lUGS Subcommission on the Systematics of Igneous Rocks for a chemical classification of volcanic rocks based on the total alkali silica (TAS) diagram. Australian Journal of Earth Sciences, 31, 243-255. MARTIN-FERN.~NDEZ, J. A. & THIO-HENESTROSA, S. 2006. Rounded zeros: some practical aspects for compositional data. In: BUCCIANTI, A., MATEUFIGUERAS, G. & PAWLOWSKY-GLAHN, V. (eds) Compositional Data Analysis in the Geosciences: From Theory to Practice. Geological Society, London, Special Publications, 264, 191-201. MARTiN-FERNANDEZ, J. A., BREN, M., BARCEL0VIDAL, C. & PAWLOWSKY-GLAHN, V. 1999. A measure of difference for compositional data based on measures of divergence. In: LIPPARD, S. J., NAESS, A. & S1NDXNG-LARSEN, R. (eds) Proceedings oflAMG'99, The Fifth Annual Conference of the International Association for Mathematical geology, Trondheim, Norway, 1, 211-216. MARTiN-FERNANDEZ, J. A., BARCELO-VIDAL, C. & PAWLOWSKY-GLAHN, V. 2003a. Dealing with zeros and missing values in compositional data sets. Mathematical Geology, 35 (3), 253-278. MARTIN-FERNANDEZ, J. A., BARCELO-VIDAL, C., PAWLOWSKY-GLAHN, V., O.KovAcs, L. & KovAcs, G. P. 2003b. Major-element trends in Cenozoic volcanites of Hungary. In: THIOHENESTROSA, S. & MARTiN-FERNANDEZ, J. A. (eds) Proceedings of CODAWORK'03, The First Compositional Data Analysis Workshop, October
VOLCANITES COMPOSITIONAL DISCRIMINATION
15-17, University of Girona, Girona, Spain. CD-ROM. MART~N-FERNANDEZ, J. A., 6.KovAcs, L., KOVACS, G. P. & PAWLOWSKY-GLAHN,V. 2004a. The treatment of zeros in compositional data analysis: the database of Cenozoic volcanites of Hungary. 32nd International Geological Congress, Florence (I), Abstracts Volume, 1 (abs. 41-12), 213. MARTIN-FERNANDEZ,J. A., PAWLOWSKY-GLAHN,V., O.Kov~,cs, L. & KovAcs, G. P. 2004b. Subcompositonal exploration in the database of Cenozoic volcanites of Hungary. 32nd International Geological Congress, Florence (I), Abstracts Volume, 1 (abs. 41-16), 214. MARTIN-FERNANDEZ, J. A., BARCEL0-VIDAL, C., PAWLOWSKY-GLAHN, V., O.Kov,/,CS, L. & KOVACS, G. P. 2005. Subcompositional patterns in Cenozoic volcanic rocks of Hungary. Mathematical Geology, 37, 729-752. 6.Kov~,cs, L. & Kov/~cs, G. P. 1994. Petrochemical features of young alkali basalts in Hungary. International Association for Mathematical Geology Annual Meeting, Mont Tremblant, Qurbec, Canada, 210-215. 6.Kov~,cs, L. & KovAcs, G. P. 2001. Petrochemical database of the Cenozoic volcanites in Hungary: structure and statistics. Acta Geologica Hungarica, 44 (4), 381-417.
23
PI~CSKAY,Z, LEXA,J., SZAKACS,A. ETAL. 1995. Space and time distribution of Neogene-Quarternary volcanism in the Carpatho-Pannonian Region. Acta Vulcanologica, 7, 15-28. POKA, T. 1985. Changes in petrochemical composition of the Miocene and Quaternary volcanism and the basin formation. Abstracts of VIIlth RCMNS Congress, Budapest, 472-473. POKA, T. 1988. Neogene and Quaternary volcanism of the Carpathian-Pannonian Region: Changes in chemical composition and its relationship to basin formation. In: ROYDEN, L. H. & HORVATH, F. (eds) The Pannonian Basin. A study in basin evolution. American Association of Petroleum Geologists Memoir, 45, 257-277. ROLLINSON, H. 1993. Using Geochemical Data: Evaluation, Presentation, Interpretation. Prentice Hall, Harlow. SALTERS, V. J. M., HART, S. R. & PANTO, GY. 1988. Origin of Late Cenozoic volcanic rocks of the Carpathian arc, Hungary. In: ROYDEN, L. H. & HORV~,TH, F. (eds) The Pannonian Basin. A study in basin evolution. American Association of Petroleum Geologists Memoir, 45, 279-292. SZABO, CS., HARANGI, SZ. & CSONTOS, L. 1992. Review of Neogene and Quaternary volcanism of the Carpathian-Pannonianregion. Tectonophysics, 208, 243-256.
Log-ratios and geochemical discrimination of Scottish Dalradian limestones: a case study C. W . T H O M A S 1 & J. A I T C H I S O N 2
aBritish Geological Survey, West Mains Road, Edinburgh EH9 3LA, UK (e-ma il: cwt@ bgs. ac. uk) 2Department of Statistics, University of Glasgow, Glasgow G12 8QQ, UK Geochemical data are used widely to help correlate lithostratigraphical sequences, particularly where they are unfossiliferous and/or affected by metamorphism and deformation. In this study, geochemical data for variably impure metamorphosed limestones from the Neoproterozoic - Cambrian Dalradian Supergroup of Scotland have been used to aid lithostratigraphical discrimination and correlation in a region where apparently similar sequences crop out in widely separated regions affected by major deformation. The key problem in the statistical analysis of geochemical data is that the data are constrained to a constant sum and cannot, therefore, be analysed in raw form by conventional statistical techniques. To overcome this problem, the data have been transformed into log-ratios. Subsequent statistical analysis of the transformed data focuses on determining the simplest subcomposition that effectively discriminates between the limestones. The results show that the Fe203-MgO-CaO subcomposition is sufficient in this regard, with the additional benefit that it reflects the composition of the carbonate component in the limestones. Mann-Whitney tests of the log-ratio Fe203/MgO show that limestones from most different lithostratigraphical levels are statistically different from each other at high levels of significance. The results have clarified interpretations of the lithostratigraphy and subsequent interpretations of the tectonic history of the region.
Abstract:
This case study presents the results of a statistical analysis of whole-rock geochemical data for limestones from the Dalradian Supergroup of Scotland. The aim was to establish a simple and robust quantitative means of discriminating the limestones as an aid to lithostratigraphical correlation of the host Dalradian successions. A central objective was that the discrimination should be statistically testable using standard techniques, but in a way that overcomes the difficulties imposed by constant-sum constraint inherent in compositional data. Because compositional data sum to a constant (typically 100%), they have a number of properties that hamper their interpretation by the application of standard statistical techniques to 'raw' percentage data. It was Karl Pearson (Pearson 1897) who first highlighted the problems faced by analysts of such multivariate, compositional datasets. Aitchison (1986) summarized the key problems most pertinent to geochemical compositional data, presented in the following sections.
Spurious correlation coefficients between raw components. There is bias towards more negative correlations and correlations are not free to range from 1 to + 1. This is because as the proportion of one component increases, the proportions of some other components must decrease. The trivial case is the
two-component system where the correlation coefficient must be - 1 ; clearly things become more complex with increasing numbers of components. The mineralogical controls on geochemical data for rocks etc., complicate matters further, since the variation of other components may be partly or entirely independent of the components of interest, yet still affect their covariation. In 'open' datasets, such problems do not arise, since the variables can vary independently without inducing covariation in other components. Thus, in compositional data, the correlation between two components is affected by the variation of other components in the compositional dataset, whether or not there is a genetic link. In any analysis, it is precisely the genetic links, if any, that are of interest.
Lack of subcompositional coherence. There is a lack of coherence between the product moments (variance, covariance, etc.) of components in the full composition and those calculated between components in a selected subcomposition. The variance-covariance structure of a re-proportioned subcomposition of raw components will be radically different to that for the components in the full composition and there is no simple relationship between the two. This means that the covariances of
From: BUCCIANTI,A., ]VIATEU-FIGUERAS,G. & PAWLOWSKY-GLAHN,V. (eds) CompositionalData Analysis in the Geosciences:From Theory to Practice. Geological Society, London, Special Publications, 264, 25-41. 0305-8719/06/$15.00
9 The Geological Society of London 2006.
26
C.W. THOMAS & J. AITCHISON
'raw' components are effectively uninterpretable, especially in any strictly statistical sense. The statistical analysis of 'raw' components is thus outwardly fraught with intractable difficulties. Therefore, a statistical methodology applicable to such data is needed. The key issue here is the nature of the sample space. 'Open' data are free to vary from 0 to +co in positive real number space, and conventional statistical techniques are designed to work with such data. However, compositional data occupy a restricted space, known as the simplex, in which the components are free to vary only from between 0 to 1; they are constrained by the unit sum. Standard statistical techniques are not designed to treat such data, so should not be used. Another approach is needed. In discussion of this problem, Aitchison (1986, p. 65) highlighted the key property of compositional data that underpins the new methodology: that "ratios of components are the same within a subcomposition and the parent composition'. The study of compositional data should, therefore, be the study of relative magnitudes of components. Ratios are the natural way to study compositions and their analysis is made more tractable by being transformed to logarithmic form: log(xi/xj) (Aitchison 1986, p. 65). In many respects, of course, studying ratios of compositional components is nothing new - geochemists and geologists have long made use of ratios in their investigations of compositions, and T. H. Pearce' s approach probably remains one of the most sophisticated (Pearce 1968, 1970, 1987). Such ratios are usually based on prior knowledge of the chemical relationships between elements in minerals. However, the value in using log-ratios is that they 'open up' the data, free analysis from the confines of the simplex and allow the plethora of well-established multivariate statistical techniques designed for unconstrained data to be applied to compositional data. The analysis in the rest of the paper is based on this premise and the examination of ratios in the search for a statistically rigorous and robust means of discriminating geochemically between various Dalradian limestones. In doing so, the aim is to demonstrate by example the efficacy of the log-ratio approach to compositional data analysis that has been developed in the last two decades, but which has been slow to have an impact on the treatment of compositional data in the geosciences.
Geology of the Dalradian Supergroup Lithostrati g raphy
The Dalradian Supergroup of Scotland and Ireland (Fig. 1) is a heterolithic sequence of metamorphosed and deformed siliciclastic, carbonate and
subordinate volcanic and volcaniclastic rocks. The rocks were deposited on continental crust on the eastern Laurentian margin during the Neoproterozoic through to at least the early Middle Cambrian (Harris et al. 1994; Stephenson & Gould 1995; Tanner 1995). Dalradian lithostratigraphy comprises four Groups, namely (oldest to youngest): Grampian, Appin, Argyll and Southern Highland. The Grampian Group is dominated by fluvial to marine siliciclastic rocks with very rare limestone units (Glover et al. 1995). The succeeding Appin Group is characterized by grossly cyclic sequences of arenites and quartzites, carbonates (chiefly limestones) and semipelitic rocks deposited in generally shallow-marine environments. This style of sedimentation continues into the lower part of the overlying Argyll Group, which is traditionally separated lithostratigraphically from the Appin Group by the glacigenic Port Askaig Formation (PAF). Middle to upper parts of the Argyll Group are dominated by coarse siliciclastic and semipelitic to pelitic rocks, with some limestones, deposited in rapidly subsiding, relatively deep-marine rift basins (Stephenson & Gould (1995) and references therein) related to the opening of the Iapetus Ocean. Volcaniclastic and intrusive mafic igneous rocks are characteristic of parts of the Argyll Group. Extensive intrusion of mafic sills into only partially lithified sediments occurred particularly in the SW Highlands (Graham 1976). The succeeding Southern Highland Group is characterized by coarse siliciclastic and pelitic units containing volcaniclastic horizons ('Green Beds'). Rare limestones include the graphitic Leny Limestone near the top of the Group (Tanner 1995), which contains an early Middle Cambrian Pagetides trilobite fauna (Pringle 1940; Cowie et al. 1972). Thus, Dalradian sedimentation continued well into the Cambrian, and may have continued on into the Ordovician until the onset of Grampian orogenesis at c. 480Ma. U - P b zircon ages for mafic volcanic rocks in the uppermost Argyll Group date volcanism and sedimentation at the top of the Argyll Group at 595-600 Ma (Halliday et al. 1989; Dempster et aI. 2002). Recent work suggests strongly that the maximum age of the Dalradian is unlikely to exceed c. 750 Ma and could be as young as c. 700-670 Ma (Thomas et al. 2004). Arc collision in the late Cambrian and Ordovician resulted in the Grampian (Taconic) Orogeny of Scotland and Ireland (Dewey & Mange 1999). Polyphase deformation was accompanied by greenschist to upper amphibolite facies metamorphism with local anatexis (Harte 1988; Stephenson & Gould 1995). The limestones discussed in this paper occur within the Appin Group, with the exception of the
LOG-RATIOS AND GEOCHEMICAL DISCRIMINATION
27
Fig. 1. Generalized geological map of the Dalradian outcrop in the Central and Northeast Grampian Highlands of Scotland, showing the distribution of the samples used in this study.
Ord Ban/Kincraig suite of limestones, which lie within the Grampian Group.
Lithostratigraphical correlation of Dalradian Supergroup successions in the Northeast and Central Grampian Highlands, using limestone geochemistry Northeast Grampian Highlands. In the 1980s and early 1990s, the British Geological Survey (BGS) was engaged in systematic re-mapping of the Dalradian geology of the Northeast and Central Grampian Highlands of Scotland (Harris et al. 1994; Stephenson & Gould 1995) (Fig. 1). The key aim of this work was to establish the regional lithostratigraphy within areas of outcrop of the Dalradian Supergroup that had been relatively neglected, compared to the much-better known successions in the South Central and the Southwestem Grampian Highlands. One reason for the neglect is the often poor exposure, particularly in the Northeast Grampian Highlands. Because of the relatively poor exposure, a programme of geochemical analysis of the carbonate rocks (principally the limestones) was established to support the mapping by discriminating between the limestones geochemically, thereby providing
an additional framework for interpretation of the lithostratigraphy (Fig. 1; Region 1). The limestones are valuable in this respect because they are distinctive lithologies within the dominantly siliciclastic successions, sufficiently common to provide several lithostratigraphical 'keys' and because their outcrop is enhanced in many areas by pitting and quarrying. The results of the work in the Northeast Grampians demonstrated that the limestones had distinctive compositions at formation level and could be used effectively to assist in lithostratigraphical correlation (Thomas 1989). However, a key problem with this work was that the interpretations were subjective, relying on visual assessment of normalized, multi-element variation diagrams ('spider' diagrams) that were somewhat arbitrary and empirical. No levels of confidence, in the statistical sense, could be attached to these interpretations and the method in no way overcame the constraints of closure in the data.
Central Grampian Highlands. There have been similar problems with variable levels of exposure and difficulties in establishing lithostratigraphy in the Central Highlands, complicated by issues relating to putative basement-cover relationships (Piasecki & Tempereley 1988; Noble et al. 1996;
28
C.W. THOMAS & J. AITCHISON
Highton et al. 1999; Smith et al. 1999). The lithostratigraphical affinities of the metasedimentary rocks in the Central Highlands have long been disputed, with the dominantly siliciclastic successions being correlated either with the Dalradian Supergroup, or with the Moine Supergroup north of the Great Glen. The problems are compounded by the lack of clearly definable and traceable marker horizons, and the presence of shear zones that have been considered to represent horizons of significant dislocation. The presence of a number of minor, but possibly key, carbonate rock units, suggested that limestone whole-rock geochemistry might help to provide additional control on the lithostratigraphy. The key carbonate rock-bearing successions include limestones in the Kinlochlaggan Syncline, southwest of Kinlochlaggan [NN537 897] - especially those associated with dark, pelitic rocks; thin limestones north of the River Findhorn WNW of Kyllachy House [NH786 259] and north of Dulnain Bridge [NH997 249] ('Grantown limestones') and limestones near Kincraig [NH831 056] and at Ord Ban [NH 891 085] (Fig. 1; Region 2). Also included are data for limestone strata from Strathfionan in the Schiehallion district (Fig. 1; Region 3) that belong to the Blair Atholl Subgroup (Treagus 2000); these are lithostratigraphically equivalent to those samples included in the Inchrory Limestone, discussed above (Table 1). The geochemistry of these limestones provides a useful check on the regional consistency in composition at this stratigraphical level. The limestones at Kincraig and Ord Ban are particularly important because they are interpreted to lie immediately above a putative unconformity at the boundary between basement and cover sequences (Piasecki & Van Breeman 1979; Smith et al. 1999). These limestones are thus deemed to be at or close to the local base of Grampian Group rocks in the two respective localities. To this end, the geochemical study of limestone units was continued in the Central Highlands, once again based on the use of 'spider' diagrams (Thomas 1995; Thomas et al. 1997). This work helped underpin the lithostratigraphy currently established for the Central Highlands (Thomas 1995; Thomas et al. 1997; Thomas & Aitchison 1998; British Geological Survey 2002), but, like the earlier work in the Northeast Grampian Highlands, the interpretations were essentially based on subjective interpretation of the variation diagrams. The lithostratigraphical correlation of the Central Grampian Highlands limestones is shown in Table 1. The correlations are based on the results of previous geochemical studies (Thomas 1995; Thomas et al. 1997), but significantly strengthened by the results of the work presented here.
Although the 'spider' diagram approach appeared to give geologically sensible results, there were key problems with the graphical analysis. These centred on the lack of objectivity and the absence of a means of testing statistically between compositions, within the constraints of the constant sum problem. Problems are particularly acute for marginal compositions with ambiguous 'spider' patterns. For example, pattems for limestones from the Kincraig and Ord Ban localities indicated geochemical similarities and, therefore, possible correlation with limestones from the Kinlochlaggan area and Blair Atholl Subgroup limestones from the NE Grampian Highlands. Yet, the interpreted lithostratigraphical position of the Kincraig and Ord Ban limestones makes such a correlation impossible without recourse to contrived and unreasonable structural and/or stratigraphical models. Without a more objective means of discrimination it was unclear whether the limestones were just chemically similar, but unrelated, or whether there were clear differences that were not revealed by the 'spider' diagrams. A more rigorous and objective means of discriminating between the limestones has been undertaken using the log-ratio approach advocated by Aitchison (1986). In doing so, the constraints imposed by closure in the data have been overcome, permitting objective identification and use of a subcomposition that provides for simple, but statistically rigorous, discrimination between limestones. The approach is detailed after discussion of the geochemical characteristics of the Dalradian limestones and their significance.
Limestone petrography and geochemistry Samples and lithology
Whole-rock samples of limestone were collected from the lithostratigraphical units discussed above (Table 1) at localities across the Dalradian outcrop on the mainland of Scotland, shown in Figure 1 (Thomas 1989, 1995; Thomas & Aitchison 1998). The limestones generally vary from pale to very dark grey in colour and are commonly banded in appearance; some varieties are white. Grain size is variable, the purest limestones being the coarsest. Although a few are dolomitic, for most limestones, the low Mg content generally precludes the presence of dolomite at the observed metamorphic pressures and temperatures (Goldsmith & Newton 1969; Thomas 2000). Minute traces of dolomite found during electron microprobe analysis in a sample of the Torulian Limestone
LOG-RATIOS
AND
GEOCHEMICAL
DISCRIMINATION
29
r~
. ,...~
~
~
<~
o
9
. ,,.., . .,...~
e~
=~
r..)
~
0
"~
~
~
~ "~
c~
=~
. ....~
~ ~
~.~
~.~ ~
~ .~,
"~
~
.~.
..o r~
0
~.~
2; .....i
Z ~
e. ,...~
m
.,.-t
[-,
<
~
~-
b~
0
30
C.W. THOMAS & J. AITCHISON
are thought to reflect primary high-Mg calcite in the original sediment (Thomas 2000). Quartz, plagioclase feldspar (typically oligoclaseandesine) and white mica are the chief siliciclastic impurities, occurring in variable amounts (Thomas 2000); a number of limestones are graphitic (Thomas 1989). The lithological characteristics are summarized in Table 1.
Sample preparation and analysis Whole-rock samples weighing 1 - 2 kg were collected from fresh outcrops and veining and any weathered rock was removed. Samples were then cleaned, jaw-crushed and homogeneous aliquots ground in an agate mill. Aliquots were used for preparation of fused beads and pressed powder pellets for major and trace element analysis by standard XRF techniques, as discussed by Thomas (1989). Summary compositions for each of the limestone units are presented in Table 2.
Significance of geochemical signatures in Dalradian limestones Comparison with published empirical limiting values for a range of chemical criteria used to assess the preservation of limestone isotope compositions shows that Dalradian limestones have not suffered significant post-diagenetic alteration, particularly due to metamorphism or metamorphic fluid infiltration (Thomas et al. 2004). The high Sr content of most Dalradian limestones (102103 ppm) demonstrates that the primary carbonate mineralogy was chiefly aragonite (Bathurst 1975). Preservation of high Sr values in limestones which now only contain low-Mg calcite indicates that diagenesis was primarily within the marine environment, and that diagenetic fluids were dominated by coeval sea water. The limestones are unaffected by decarbonation reactions, as indicated by loss-on-ignition values consistent with virtually all Ca being sequestered in calcite. Low Mg and AI limit the formation of calc-silicates in all the limestones. At metamorphic temperatures (400-600~ calc-silicate phases consistent with limestone bulk compositions are stable only in the presence of very water-rich fluids (Spear 1993). The absence of calc-silicate phases indicates no significant infiltration of hydrous metamorphic fluid, consistent with the low permeability of calcite-rich matrices at most metamorphic temperatures, pressures and likely fluid compositions (Holness & Graham 1995). The compositions of the limestones thus reflect their geochemical character after diagenesis and can be interpreted accordingly.
Statistical analysis Log-ratios and graphical exploratory data analysis Before proceeding to discuss discriminatory statistical analysis of the limestone data, a slight digression is made about some general points concerning the initial examination of compositional data. In the introduction to this paper, discussion was made about the reasons why log-ratios are the appropriate means of elucidating and analysing variation within compositional datasets. Many studies of compositional data often begin (and commonly end!) with simple bivariate plots of data in the search for patterns of variability and relationships between components. Over the years, many specific diagrams have been developed for geochemical data acquired from various types of rocks - e.g. those delineating tectonic settings for suites of igneous rocks. Rollinson (1993) discussed the form and use of many of these. Use of most such diagrams effectively falls within the realm of exploratory data analysis (Tukey 1977; Velleman & Hoaglin 1981). It is argued that this is a vital step in getting to know one's data and may be all that is required in some studies. However, these diagrams, and the field boundaries that appear on them, are largely empirical. Furthermore, when large datasets are involved, the variability within the cloud of points is such that any clear patterns that may exist are obscured or confused. Many use raw variables, but, because of the reasons discussed above, such plots are likely to be of limited value. However, many others, especially those involving trace elements, use ratios. This is also true of multi-element variation or 'spider' diagrams used for rare earth and other elements, where individual components are normalized to values from some standard composition (e.g. chondrite or the North American Shale Composite). Furthermore, when the plot axes are scaled logarithmically, as they commonly are, the ratios on such diagrams are log-ratios. Hence, these are log-ratio diagrams: from the current perspective, they fall within the compass of log-ratio analysis and therefore form an important part of the toolbox of techniques available to those studying compositions, especially at the exploratory stage. By way of a simple graphical example of the value of log-ratio transformation, the FezO3 and MgO data for four of the limestones in this study are plotted as raw data in Figure 2. Although there are hints of systematic patterns of variability, the data are very crowded where they are most numerous and converge on the origin; patterns of variability are obscure. In addition, because the raw data are
LOG-RATIOS AND GEOCHEMICAL DISCRIMINATION
~o
oE ~ .--,
O
~
7,
@ o
~ e,i .u
~.i ~
31
32
C.W. THOMAS & J. AITCHISON 20 18 16 r
14
r
~
12
0
o
0")
~
8
4-
9
4-
4
2 0
~,,~,~"~."
t~-+
4- -,-
,
I
0
2
+ ~.++ '
'
'
'
I
'
'
'
'
4
I
6
'
' ' ' I ' ' ' ' I 8
10
Fe203 (wt %) + Inchrory Limestone <>Dufftown Limestone 9 Torulian Limestone
9 Pitlurg Limestone
Fig. 2. A plot of MgO vs. Fe203 (wt%) for four of the limestones discussed in this study. Although there are hints of systematic patterns of variability, the data are crowded towards the origin, obscuring any patterns that might exist. Compare this with the same data plotted as log-ratios, using CaO as the denominator, in Figure 4. Any patterns that emerge on such simple bivariate plots of raw data cannot be analysed statistically because the data are constrained within the simplex (are 'closed'). See the text for detailed discussion.
closed, any apparent variation cannot be analysed by conventional statistical techniques applied to these data, as discussed in the introduction. Compare Figure 2 with Figure 4, where the same data are plotted as log-ratios, using CaO as the denominator. In the latter, the log-ratios provide clear evidence for significant compositional differences between the four limestones. The additional, fundamental advantage is that the log-ratio transformation removes the data from the constraints of the simplex, permitting, with due care, the application of standard statistical techniques. With 'spider' diagrams, the appeal is in the graphically revealed patterns of variability, especially in rare earth element diagrams, where variation resulting from the systematic change in chemical properties of the elements across the Lanthanides can be modelled with respect to element fractionation between minerals and melts, etc. However, statistical testing of differences between such patterns is difficult and, where components in the dataset are 'noisy', the necessarily subjective examination of the graphical patterns means that marginal compositions are often difficult to classify. This is the particular problem that faced Thomas (1989), as discussed above in the latter part
of the section on the geology of the Dalradian. The work leading to the proceeding discussion arose from this problem. Below is shown how patterns of variability revealed or hinted at on log-ratio diagrams may be quantified and probabilities attached to hypotheses about the variations within the data that we may wish to test.
Determining discriminating subcompositions Subcompositions are popular because they reduce the number of variables under consideration and can often be considered in terms of discrete petrological properties of rocks. Furthermore, full compositions are likely to contain components that are 'noisy' and ultimately contribute little useful information to the problem in hand; indeed they can often confuse matters. Using subcompositions in this case study is a reasonable approach because, as discussed above, the limestones comprise a carbonate component variably diluted by a siliciclastic component and hence are mixtures of chemically distinctive parts. The initial objective in this analysis is to examine
LOG-RATIOS AND GEOCHEMICAL DISCRIMINATION whether some particular subcomposition may be used to discriminate between the limestones at least as effectively as the full composition. In searching for a generally effective discriminating subcomposition, the first step determines the subcomposition which most effectively distinguishes between two limestones whose lithostratigraphical positions are well established and for which there are a good number of analyses. The two limestone units chosen for this are: (a) the lithostratigraphically-equivalent limestone formations assigned to the Blair Atholl Subgroup in the Northeast Grampians, and grouped together here as the 'Inchrory Limestone Formation'; and (b) the Dufftown Limestone Member, which occurs in the lowermost part of the Ballachulish Subgroup (Table 1). The reader interested in more lithostratigraphical detail is referred to Stephenson & Gould (1995) and/or Harris et al. (1994). There are 160 analyses of limestones from the Inchrory Limestone Formation and 49 of limestones from the Dufftown Limestone Member. The problem of distinguishing between these two limestones is binomial, i.e. it is only necessary to distinguish between two types - Inchrory or Dufftown. For problems of this sort, binary logistic regression can be used with log-contrasts of the compositional components as the independent regressor (a log-contrast is a simple way of expressing a set of log-ratios in a linear form which is symmetric in the components). The zero sum of the coefficients guarantees that the log-contrast can be expressed as a linear combination of log-ratios. Thus, for a composition, x, consisting of D parts, a log-contrast (lc) is defined as: lc(a,x) = al logxl + . - . + aD 1ogxo, (al + . . . + a o = 0)
(I)
Note that the zero sum constraint on the a coefficients allows the log-contrast to be expressed in terms of D - 1 log-ratios, for example as: lc(a,x) = al log(xz/xo) + . . . + ao-1 log(xo_l/xO)
(2)
using the Dth component as the denominator. Using log-contrasts in this way, for two 'types' (type 0 = Dufftown and type 1 = Inchrory), the binary logistic model is defined by: pr(t = 1 Ix) = 1 - pr(t = 0Ix)
exp{ a0 + lc(a,x)} 1 + exp{a0 + lc(a,x)}"
(3)
33
That is, establishing the probability (pr) of the type = 1 (Inchrory) or type = 0 (Dufftown), given the composition, x. Maximum likelihood estimation of the parameter a = [a0, at . . . . . aD] is straightforward. The beauty of this model is that the adequacy of a subcomposition of, say, components [1 . . . . . C], can be tested readily since this hypothesis can be expressed as O~c+ 1 ~
'''
~
O~D ~
0.
Testing for the best discriminating subcomposition is undertaken by setting up a lattice of hypotheses in which the discriminatory power of increasingly complex subcompositions is examined. Part of the full possible lattice for this example is shown in Figure 3, highlighting some geologically relevant subcompositions. The model at the top of the lattice retains the full 17-part composition as the explanatory variable; it is termed the 'maximum model' in the sense that it is based on the full geochemical data available for geochemical discrimination. At the bottom of the lattice, the hypothesis of 'no compositional effect' effectively says that the geochemical compositions provide no means of distinguishing between the Inchrory and Dufftown limestones. This effectively sets al . . . . . o~o = 0. At successive levels increasingly complex subcompositions are tested. Some of these are shown for a number of two- and three-part compositions, including the carbonate subcomposition [Fe, Mg, Ca], the siliciclastic subcomposition [Si, AI, Ti, K, Rb, V] and the major oxide and trace element subcompositions (Fig. 3). In such lattice testing, generalized likelihood ratio tests are used, always testing the hypothesis against the maximum model. Testing starts with the simplest hypothesis. A move up the lattice to the next level of complexity is only undertaken if there is strong evidence that all the hypotheses at the current lower, simpler level should be rejected. Testing proceeds thus until a hypothesis is found that cannot be rejected. In the event that several equally viable (statistically acceptable) hypotheses are found at a particular level, either the hypothesis with the smallest likelihood ratio is accepted, or one moves to the next level and accepts the next hypothesis that has a smaller likelihood ratio than those at the previous level. The approach outlined here aims to determine the simplest hypothesis with fewest parameters that provides a satisfactory discriminating subcomposition. In this sense and with due regard to the problem in hand, simpler hypotheses with fewer parameters are preferred to more complex ones. It should be remembered that in the absence of any loss structure in a multiple hypothesis situation (whereby one can assign a 'loss' if the ith hypothesis is chosen when the jth hypothesis is true), all
34
C.W. THOMAS & J. AITCHISON
I Full composition I
1 0
E x
E 0 o
I[Fe,Mg,Ca]I.I[AI,Na,K]I. " etc . . . I [D.2,D_I,D]
Three-part compositions
e-
e,c
c
. . . .
t D.,,DI I
Two-part compositions
I No difference I Essentially random variation Fig. 3. Partial graphical representation of the lattice of hypotheses used to test the effectiveness of various subcompositions in discriminating between two types of limestone: Inchrory and Dufftown. The maximum model is at the top of the lattice and includes all 17 components. The minimum model at the bottom implies that there is no difference between the two limestones.
testing procedures are essentially ad hoc. Experience shows that this form of lattice testing usually leads to sensible inferences. With regard to later discussion on processes, an additional benefit of this approach is that it can help identify subcompositions which are geochemically or geologically significant with regard to processes that have operated in the formation of the limestones.
Results The results of the above analysis show that three, three-part subcompositions adequately discriminated between the Inchrory and Dufftown limestone samples: Fe203-MgO-CaO [FMC], Fe203MgO-P205 and Fe203-MgO-Loss On Ignition (LOI; essentially a measure of CO2 in the limestones). Of note here is that all three contain Fe203 and MgO and that the [FMC] subcomposition effectively represents the composition of the carbonate component in these rocks. Furthermore,
the [FMC] and the Fe2Oa-MgO-LOI subcompositions are effectively the same. Thus, it is the [FMC] subcomposition that has been used as a discriminatory tool, as outlined below. In the analysis, just six of the 160 Inchrory limestone samples were misclassified as Dufftown limestone and vice versa for just two of the 49 Dufftown limestone samples. With regard to simple geochemical discrimination, the Fe203-MgO-P205 could also be used. However, this subcomposition is less 'attractive' from a geological perspective, mainly because the mineralogical host for P205 in the limestones is unknown, making the results potentially more difficult to interpret. Using 'spider' diagrams, Thomas (1989) considered that compositional variation in Dalradian limestones resulted from variations in the composition of the siliciclastic component, typically represented by Si, A1, Ti, K, Rb, V and Zr. The analysis here revealed this not to be the case. In addition, the trace elements provide little which is of use in discrimination between these limestones.
LOG-RATIOS AND GEOCHEMICAL DISCRIMINATION
35
Log (Fe203/CaO) -8 i
-7 ,
i
,
,
l
-6 ,
,
,
,
-5
q
v
I
l
,
l
-4 ,
i
,
,
i
-3 ,
,
l
,
i
-2 ,
,
0
,
,
O 0
!
0
-1 i
9
i
,
u
i
,
,
i
0
9
9 Pitlurg Formation limestones <> Dufftown Limestone Member 9 +
<>
Torulian Limestone
+
Inchrory Limestone Fomation O
<> <> oo oo
9
~ O
+
g O
9 o 9
o
+#:++ +
9 + + +++
-1
+ 9
9
+
+
-2
9.
+ +
++
+~+
--O
-3
+~'P+
O v
O _A
-4 @ ~
9
~'0
0
0
9
+
+ . p - - h+~ - ~ - -
-5
-6
Fig. 4. A plot of Fe203/CaO vs. MgO/CaO expressed as log-ratios for four key limestone units from the Northeast Grampian Highlands of Scotland. Systematic patterns of variability are clearly revealed and the underlying data can be analysed by standard statistical techniques.
Discrimination of Dalradian limestones using the [FMC] subcomposition
but shows a very wide range in its Fe content compared with the other limestones.
The [FMC] subcomposition has been used as a basis for discrimination by calculating log-ratio of FezO3/CaO and MgO/CaO for samples from the other Dalradian limestones discussed above. Initial discussion focuses on graphical interpretation of the log-ratios in bivariate plots. A nonparametric statistical technique has then been used with the log-ratios to examine correlations more quantitatively.
Discrimination of limestones from the Central Grampian Highlands. As discussed above, the
Discrimination of Appin Group limestones, Northeast Grampian Highlands. Log-ratios log(FezO3/CaO) and log(MgO/CaO) for the samples from four key Appin Group limestones from the Northeast Grampian highlands are plotted in Figure 4. The clusters of points for each of the limestones units are generally very well differentiated on the plot. Notable is the clear separation of the more magnesian Dufftown and Pitlurg limestone samples from the less magnesian Inchrory limestones. The Dufftown and Pitlurg limestones lie on similar trajectories, but the latter is both more Fe- and more Mg-rich. The Torulian Limestone, one of several very pure, distinctive white limestones that occurs around the top of the Appin Quartzite, is intermediate in composition,
lithostratigraphical correlation of limestones from the Central Grampian Highlands is important because, depending upon the correlation, radically different tectono- and lithostratigraphical models could be erected. Use of the [FMC] log-ratio is extended to these limestones without further testing, assuming that the [FMC] subcomposition is as effective at discriminating between these limestones as it is amongst limestones from the Northeast Grampian Highlands. The results are plotted in Figure 5. Its is clear that many of the Central Grampian limestones (Findhorn, Grantown, Kyllachy) plot in the same areas as those from the northeast (Dufftown, Pitlurg). However, it is of interest that samples from the Kincraig and Ord Ban limestones cluster quite tightly and sit a little apart, as a group, from the Inchrory limestone samples. They have Inchrory-like Fe contents, but are marginally, but consistently more magnesian (compare Figs 4 and 5). Those from the Blair Atholl Dark Schist & Limestone Formation in the Schiehallion district match very closely those from the Inchrory Limestone Formation. This geochemical similarity is
36
C.W. THOMAS & J. AITCHISON
Log (Fe203/CaO) -8
-7
-6
-5
-4
-3
-2
,, es,ones
-1
0
0
-1
l i m e s t ~
g, limestones\
. ~ ~
-4
,4~x/.~"~l~.~l~ limestones
9 BlairAtholl Dark Schist & Limestone Formation. Schiehallion 9 Kinlochlagganlimestones Z~ GrantownA limestones o GrantownB limestones [] Ord Ban & Kincraig limestones r Findhornlimestones 9 Kyllachylimestones
J
-5 -6
Fig. 5. A plot of Fe203/CaO vs. MgO/CaO expressed as log-ratios for limestones from the Central Grampian Highlands and limestones from the Blair Atholl Dark Schist & Limestone Formation in the Schiehallion district. The areas containing the majority of samples from the Northeast Grampian limestones shown in Figure 4 are outlined. The areas simply enclose the data by connecting the outermost points around the cloud, barfing one or two extreme outliers. Technically, they are the convex hulls of each set of data.
consistent with their lithostratigraphical equivalence, despite an along-strike separation of some 70 km or more. The limestones from the Kinlochlaggan area also plot within the domain occupied by the Inchrory Limestone Formation samples, suggesting, on geochemical grounds at least, that they are lithostratigraphically equivalent.
Non-parametric statistical testing of the Fe/Mg log-ratio groupings. Although the plots of the [FMC] log-ratio in Figures 4 and 5 are visually very encouraging in discriminating various Dalradian limestones, the postulated correlations are still based on subjective interpretation of graphical plots. In order to examine these groupings of samples more objectively, the probability that various groups are drawn from the same population has been determined using non-parametric statistical testing. A non-parametric approach has been used because it is clear from the figures that the logratios for different sample groups have asymmetric and variable distributions. For example, the centre of density for the Inchrory limestone samples is skewed towards those samples with less Fe and
Mg. Thus, any testing of the log-ratios using parameters based on symmetric distributions, such as the normal distribution, will violate basic assumptions about the population being tested. In addition, outliers will have undue influence on parametric measures of location and dispersion, especially in smaller sample groups. It therefore makes sense to use more robust non-parametric techniques that make few or no assumptions about any underlying distribution shape in the dataset and are more robust with regard to the effects of outliers (Siegal 1956). Non-parametric tests are mainly univariate and based on rank-transformations of data. The bivariate log-ratio data presented in Figures 4 and 5 show that the variation between samples is a function of variation in Fe and Mg, since CaO is the common denominator. Thus, the single log-ratio, log(Fe203/MgO), can be calculated for each sample population and used in statistical tests as the discriminant variable. Note here that transforming the problem under investigation to one of univariance is consistent with the search for the simplest discriminating model and involves no loss of information relative to the original multivariate dataset, given the questions the study set
LOG-RATIOS AND GEOCHEMICAL DISCRIMINATION out to solve. Indeed, the analysis to this point has been to determine those components which provide the clearest discrimination. Thus, we can be confident that statistical testing of the Fe/Mg log-ratio is sufficient to place probabilities on geochemical correlations between groups of samples from the limestones. Note that this does not necessarily equate with lithostratigraphical correlation; the geochemical data and statistical results must still be interpreted in the round with all other available data before conclusions are reached regarding any lithostratigraphical assignments. The analysis of the geochemical data only deals with a balance of probabilities, not certainties. Formally, the Mann-Whitney U-test has been used to examine the hypothesis that two independent groups of samples are random samples drawn from the same population. The null hypothesis (Ho) is that two groups of samples have the same distribution, with the alternative (H1) that one group is stochastically larger or smaller than the other (the two-tailed form of the test). The test is undertaken as follows. The samples in the two groups are combined and ranked, the group from which each sample was drawn being identified in the ranking. The number of times samples from one group occur before samples from the other group is counted, the count forming the test statistic, U. This is compared to tables for an appropriate level of significance, permitting acceptance or rejection of Ho.
Results of the Mann-Whitney U-tests. The results of the Mann-Whitney tests are presented in Table 3. The Mann-Whitney test results for limestones from the Blair Atholl Subgroup in the Schiehallion district, those from the Inchrory Limestone and those from the Kinlochlaggan area show that with regard to their Fe/Mg log-ratios they are indistinguishable. This result corroborates the lithostratigraphical correlation of the Inchrory and Schiehallion limestones and strongly supports the interpretation that those from the Kinlochlaggan area are also lithostratigraphically equivalent 9 The corollary is that the geochemical composition of the limestones at this lithostratigraphical level in the Dalradian is consistent over a very wide area (Fig. 1). All other limestones groups have Fe/Mg log-ratios which are statistically significantly different (H0 rejected at a < 0.05) from the Inchrory limestones. With regard to the limestones from the Grantown and adjacent areas, and those from Ord Ban and Kincraig, the Mann-Whitney test results lead to important conclusions concerning the relationships amongst and between these groups of limestones from the Central Highlands and Dufftown and Pitlurg limestones from the Northeast Grampians.
37
0
Oo
an
o
e~
88
oo V A
.~o
"~
mm
Z
9 e~
e~
9
~~oo~
38 1.
3.
C.W. THOMAS & J. AITCHISON There is no significant statistical difference between the Fe/Mg log-ratios from the Dufftown and Pitlurg samples and those from Grantown, Findhom and Kyllachy. Thus, statistically, it is difficult to reject the idea that these groups of limestones are from uppermost Lochaber/lowermost Ballachulish Subgroup successions within the lower part of the Appin Group. The groups of samples from Ord Ban and Kincraig have Fe/Mg log-ratios that are statistically very significantly different to all other limestones in the dataset. This is consistent with their position in Figure 5, noting again that they are set slightly apart from the majority of Inchrory limestone samples. The geochemical distinctiveness of these limestones is consistent with the tectono- and lithostratigraphical interpretations of Piasecki and others that these limestones lie within the Grampian Group and are not related to limestones and their host successions in the overlying Grampian Group. The Torulian Limestone has Fe/Mg log-ratios that are statistically indistinguishable from those of the Pitlurg Formation and some of the Grantown samples. Inspection of Figure 5 shows that the Torulian and Pitlurg samples lie on broadly the same area, and so correlate in the Mann-Whitney tests.
Lithostratigraphically, the Torulian Limestone sits some way above the Pitlurg Formation, separated by a succession of pelites, semipelites and a significant quartzite. Its position is very well constrained so that the lithostratigraphical correlation of this limestone with those from the Pitlurg Formation is not possible from basic stratigraphical principles. However, the Pitlurg limestones lie at the top of the Lochaber Subgroup, whilst the Dufftown and Torulian limestones occur in the lowermost to middle parts of the succeeding Ballachulish Subgroup. The consistency in chemistry is not entirely inconsistent with the relative proximity of these limestone units within the lower parts of the Appin Group. Indeed, the sedimentary record suggests that this part of the Dalradian is characterized by stable, cyclic marine sedimentation that would be consistent with limited tectonic activity and concomitant stability in sea-water chemistry. Thus, it may be that this geochemical equivalence marks such stability over a significant period of time.
Discussion
Application of log-ratios Through the rational application of statistical methods applied to data transformed into log-ratio
form, it has been shown that, ultimately, the Fe/ Mg log-ratio is the most effective discriminant between the limestones in this study. The results are almost entirely free from ambiguity and the statistical test used to establish probabilities is appropriate and robust, given the nature of the data. There are several points to emphasize. 1.
2.
3.
No form of systematic statistical analysis could have been undertaken without first transforming the data into log-ratio form because of the closure constraint and its consequences. Resort to more complex multivariate techniques, such as principal components, etc., has not been made to achieve the aims. An approach that defends simple models with few parameters has been adopted; that is, very strong evidence is required before a simple model is rejected in favour of a more complex one with more parameters. The approach has been naturally orientated towards the simplest solution that adequately explains the variation observed in the dataset (cf. the simplicity postulate of Jeffreys (1961)). To this end, it has been possible to establish systematically that just two components from the original complex, multivariate dataset have been sufficient for the purposes of the study.
The advantage of the approach outlined here over that used by Thomas (1989) is that it provides a rational and statistically sound framework within which to investigate questions about compositions. Such questions range from simple classification and discrimination, as discussed here, to the elucidation of processes which may have affected the limestones (Aitchison & Thomas 1998). In this sense the approach outlined here is generic. At its foundation is the transformation of raw constant sum data into log-ratios free to range over - ~ < x < oo, free from the strictures of the constant-sum constraint and the confines of the simplex. The use of ratios, even in logarithmic form, is not new; geochemists have plotted geochemical ratios on log graph paper for many years. What is novel to many (though now of at least 20 years vintage) is the understanding that log-ratios provide the means to overcome basic problems in compositional data analysis that have troubled many since geochemical data first became routinely available to geologists and petrologists (Pearson 1897; Chayes 1960; Pearce 1968; Butler 1979a, b, 1981; Aitchison 1986; Rollinson 1992). The link between graphical plots and statistical analysis of log-ratios should be emphazized here; together they provide a range of techniques for analysing compositions that overcome the constraints of the
LOG-RATIOS AND GEOCHEMICAL DISCRIMINATION simplex sample space within which compositions are confined.
Geochemical significance o f the [FMC] subcomposition The results show that limestones from the same or closely similar lithostratigraphical positions have statistically similar Fe/Mg ratios and that these are regionally extensive. Furthermore, these components are effectively entirely sequestered in the carbonate component. The implication is that the factors that controlled the geochemistry of Fe and Mg in these limestones operated similarly over the region at a given time. It is most probable that the Fe and Mg contents were controlled by the extant sea-water chemistry when the original carbonate sediment was precipitated and that this has varied through time. The limestones from the older Lochaber and Ballachulish subgroups (Grantown, Dufftown, Pitlurg, Findhorn, Kyllachy) are all more magnesian, relative to their Fe contents for a given Ca content, than those from the younger Blair Atholl Subgroup (Inchrory, Kinlochlaggan and Schiehallion). The samples from the Torulian Limestone have Mg contents intermediate between these two groups, consistent with its lithostratigraphical position in the uppermost part of the Ballachulish Subgroup. These data show that the Mg content of coeval sea water, relative to the Ca and Fe contents, declined over the period from Lochaber to Blair Atholl times. Temporal variation in the Mg content of the limestones (and, by corollary, that of the sea water from which they were precipitated) is consistent with variations in the Sr and C isotope compositions of the limestones (Thomas 2000; Thomas et al. 2004).
39
Limestones considered on lithostratigraphical grounds to occur at the same stratigraphical level have essentially the same [FMC] compositions. Limestones at Ord Ban and Kincraig in the Central Scottish Highlands have been shown to have [FMC] compositions statistically significantly different to all other limestones in the dataset. This is consistent with their proposed lithostratigraphical position close to the local base of the Grampian Group where they crop out. Within the Appin Group, limestones from the Lochaber and Ballachulish subgroups are chemically similar or the same, consistent with their adjacent positions in the stratigraphical column. Limestones from the Blair Atholl Subgroup have identical [FMC] compositions. The [FMC] subcomposition essentially represents the chemistry of the carbonate component in Dalradian limestones. Geochemical factors in the marine environment controlled the proportions of these elements in the primary carbonate component of limestone, principally via sea-water chemistry. The results of the analysis show that there were temporal changes in the [FMC] composition, mirroring parallel changes known in Sr and C isotopes in these and other Neoproterozoic limestonebearing sequences world-wide. Dr G. Leslie and Dr M. Cave in BGS, and Prof. H. Rollinson and Dr R. Reyment are thanked for constructive reviews. The authors also wish to thank Prof. R. Howarth for constructive discussion of previous work which influenced elements of this study. CWT publishes with the permission of the Executive Director, British Geological Survey, NERC.
References Conclusions Log-ratio transformation of raw compositional data for limestones from the Central and Grampian Highlands of Scotland has provided the foundation for a sound statistical treatment of the data, free from the effects of the constant-sum constraint. This transformation provided the basis for a systematic search for the simplest subcomposition that discriminated effectively between the various limestone units within the dataset. The results showed that the F e 2 0 3 - M g O - C a O [FMC] subcomposition adequately discriminated between the limestones. Subsequent use of bivariate plots and the application of a simple univariate, non-parametric statistical technique to the [FMC] log-ratio data permitted straightforward elucidation of the various limestone groups, and their correlation, or otherwise, with each other.
AITCHISON, J. 1986. The Statistical Analysis of Compositional Data. Chapman & Hall, London and New York. AITCHISON, J. & THOMAS, C. W. 1998. Differential perturbation processes: a tool for the study of compositional processes. In: BUCCIANT], A., NARDI, G. & POTENZA, R. (eds) Proceedings of the Fourth Annual Conference of the International Association for Mathematical Geology. International Association for Mathematical Geology, Ischia, 2, 499-504. BATHURST, R. G. C. 1975. Carbonate Sediments and their diagenesis. Elsevier, Amsterdam. British Geological Survey. 2002. Dalwhinnie. Scotland Sheet 63E. Solid Geology. 1:50000 Series. Keyworth, Nottingham. BUTLER, J. C. 1979a. The Effects of Closure on the Moments of a Distribution. International Journal of the Association for Mathematical Geology, 11, 75 -84.
40
C.W. THOMAS & J. AITCHISON
BUTLER, J. C. 1979b. Trends in ternary petrologic variation diagrams - fact or fantasy? American Mineralogist, 64, 1115-1121. BUTLER, J. C. 1981. Effect of Various Transformations on the Analysis of Percentage Data. Journal of the International Association for Mathematical Geology, 13, 53-68. CHAYES, F. 1960. On correlation between variables of constant sum. Journal of Geophysical Research, 65, 4185-4193. COWlE, J. W., RUSHTON, A. W. A. & STUBBLEFIELD, C. J. 1972. A correlation of Cambrian rocks in the British Isles. Geological Society, London, Special Reports, 2. DEMPSTER, T. J., Rogers, G., TANNER, P. W. G. eTAL 2002. Timing of deposition, orogenesis and glaciation within the Dalradian rocks of Scotland: constraints from U - P b zircon ages. Journal of the Geological Society, London, 159, 83-94. DEWEY, J. F. & MANGE, M. A. 1999. Petrography of Ordovician and Silurian sediments in the western Irish Caledonides: tracers of short-lived Ordovician continent-arc collision and the evolution of the Laurentian-Appalachian-Andean margin. In: MAC NIOCAILL, C. & RYAN, P. D. (eds) Continental Tectonics, Geological Society, London, Special Publication, 164, 55-107. GLOVER, B. W., KEY, R. M., MAY, F., CLARK, G. C., PHILLIPS, E. R. & CHACKSFIELD, B. C. 1995. A Neoproterozoic multi-phase rift sequence: the Grampian and Appin groups of the southwestern Monadhliath Mountains of Scotland. Journal of the Geological Society, London, 152, 391-406. GOLDSMITH, J. R. & NEWTON, R. C. 1969. P - T - X relations in the system CaCO3-MgCO3 at high temperatures and pressures. American Journal of Science, 267-A, 160-190. GRAHAM,C. M. 1976. Petrochemistry and tectonic significance of Dalradian metabasic rocks of the SW Scottish Highlands. Journal of the Geological Society, London, 132, 61-84. HALLIDAY, A. N., GRAHAM, C. M., AFTALION, M. & DYMOKE, P. 1989. The depositional age of the Dalradian Supergroup: U - P b and S m - N d isotopic studies of the Tayvallich Volcanics, Scotland. Journal of the Geological Society, London, 146, 3-6. HARRIS, A. L., HASELOCK, P. J., KENNEDY, M. J. & MENDUM, J. R. 1994. The Dalradian Supergroup in Scotland, Shetland and Ireland. In: GIBBONS, W. & HARRIS, A. L. (eds) A revised correlation of Precambrian rock in the British Isles. Geological Society, London, Special Report, 22, 33-53. HARTE, B. 1988. Lower Palaeozoic metamorphism in the Moine-Dalradian belt of the British Isles. In: HARRIS, A. L. & FETTES, D. J. (eds) The Caledonian - Appalachian Orogen. Geological Society, London, Special Publication, 38, 123-134. HIGHTON, A. J., HYSLOP, E. K. & NOBLE, S. R. 1999. U - P b zircon geochronology of migmatisation in the northern Central Highlands: evidence for preCaledonian (Neoproterozoic) tectonometamorphism in the Grampian block, Scotland. Journal of the Geological Society, London, 156, 1195-1204.
HOLNESS, M. B. & GRAHAM, C. M. 1995. P - T - X effects on equilibrium carbonate-H20-CO2NaC1 dihedral angles: constraints on carbonate permeability and the role of deformation during fluid infiltration. Contributions to Mineralogy and Petrology, 119, 301-313. JEFFREYS, n. 1961. Theory of Probability. Oxford University Press, Oxford. NOBLE, S. R., HYSLOP, E. K. & HIGHTON, A. J. 1996. High precision U - P b monazite geochronology of the c. 806 Ma Grampian Slide and the implications for the evolution of the Central Highlands. Journal of the Geological Society, London, 153, 511-514. PEARCE, T. H. 1968. A contribution to the theory of variation diagrams. Contirbutions to Mineralogy and Petrology, 19, 142-157. PEARCE, T. H. 1970. Chemical variations in the Palisade Sill. Journal of Petrology, 11, 15-31. PEARCE, T. H. 1987. The identification and assessment of spurious trends in Pearce-type ratio variation diagrams. A discussion of some statistical arguments. Contributions to Mineralogy and Petrology, 97, 529-534. PEARSON, K. 1897. Mathematical Contributions to the Theory of Evolution. On a Form of Spurious Correlation which may arise when Indices are used in the Measurement of Organs. Proceedings of the Royal Society, 60, 489-498. PIASECKI, M. A. J. & TEMPERLEY, S. 1988. Central Highland Division. In: WINCHESTER, J. A. (ed.) Later Proterozoic Stratigraphy of the Northern Atlantic Regions. Blackie, Glasgow, 46-53. PIASECKI, M. A. J. & VAN BREEMAN, O. 1979. The 'Central Highland Granulites': cover-basement tectonics in the Moine. In: HARRIS, A. L., HOLLAND, C. H. & LEAKE, B. E. (eds) The CaledoHides of the British Isles - Reviewed. Geological Society, London, Special Publications, 8, 139144. PRINGLE, J. 1940. The discovery of Cambrian trilobites in the Highland Border rocks near Callander, Perthshire (Scotland). Report of the British Association for the Advancement of Science, London. ROLLINSON, H. R. 1992. Another look at the constant sum problem in geochemistry. Mineralogical Magazine, 56, 469-475. ROLLISON, H. R. 1993. Using geochemical data: evaluation, presentation, interpretation. Longman Scientific and Technical, Harlow. SIEGAL, S. 1956. Nonparametric statistics for the behavioral sciences. McGraw-Hill, New York. SMITH, M., ROBERTSON, S. & ROLLIN, K. E. 1999. Rift basin architecture and stratigraphical implications for basement-cover relationships in the Neoproterozoic Grampian Group of the Scottish Caledonides. Journal of the Geological Society, London, 156, 1163-1174. SPEAR, F. S. 1993. Metamorphic phase equilibria and pressure-temperature-time paths. Mineralogical Society of America, Washington, DC. STEPHENSON, D. & GOULD, D. 1995. The Grampian Highlands. Mer Majesty's Stationary Office, London.
LOG-RATIOS AND GEOCHEMICAL DISCRIMINATION TANNER, P. W. G. 1995. New evidence that the Lower Cambrian Leny Limestone at Callender, Perthshire, belongs to the Dalradian Supergroup, and a reassessment of the 'exotic' status of the Highland Border Complex. Geological Magazine, 132, 473 -483. THOMAS, C. W. 1989. Application of geochemistry to the stratigraphic correlation of Appin and Argyll Group carbonate rocks from the Dalradian of northeast Scotland. Journal of the Geological Society, London, 146, 631-647. THOMAS, C. W. 1995. The geochemistry ofmetacarboHate rocks from the Monadhtiath Project area. British Geological Survey Technical Report WA/ 95/40R. THOMAS, C. W. 2000. The petrology and isotope geochemistry of Dalradian carbonate rocks. PhD thesis, University of Edinburgh. THOMAS,C. W. & AITCHISON,J. 1998. The use of logratios in subcompositional analysis and geochemical discrimination of metamorphosed limestones from the Northeast and Central Scottish Highlands. In: BUCCIANTI,A., NARDI,G. & POTENZA,R. (eds) Proceedings of the Fourth Annual Conference of the International Association for Mathematical
41
Geology. International Association for Mathematical Geology, Ischia, 2, 549-554. THOMAS, C. W., SMITH, M. & ROBERTSON, S. 1997. The geochemistry of Dalradian metacarbonate rocks from the Schiehallion District and Blargie, Laggan: implications for stratigraphical correlations in the Geal Charn - Ossian Steep Belt. British Geological Survey Technical Report WA/ 97/81. THOMAS, C. W., GRAHAM, C. M., ELLAM, R. A. & FALLICK, A. E. 2004. 87Sr/86Sr chemostratigraphy of Neoproterozoic Dalradian limestones of Scotland and Ireland: constraints on depositional ages and timescales. Journal of the Geological Society, London, 161, 229-242. TREAGUS, J. E. 2000. Solid Geology of the Schiehallion District. Memoir for 1:50000 Geological Sheet 55W (Scotland). The Stationary Office, London. TUKEY, J. 1977. Exploratory Data Analysis. AddisonWesley, Reading, Massachusetts. VELLEMAN, P. F. & HOAGLIN, D. C. 1981. Applications, Basics and Computing of Exploratory Duxbury Press, Boston, Data Analysis. Massachusetts.
Discriminating geodynamical regimes of tin ore formation using trace element composition of cassiterite: the Sikhote'Alin case (Far Eastern Russia) N. G O R E L I K O V A 1, R. T O L O S A N A - D E L G A D O 2, V. P A W L O W S K Y - G L A H N 2, A. K H A N C H U K 3 & V. G O N E V C H U K 3
1Institute of Geology of Ore Deposits, Petrography, Mineralogy and Geochemistry RAS, Staromonetny Per., 35, 119017 Moscow, Russia (e-mail:
[email protected]) 2Departament d'Informdttica i Matemgttica Aplicada, Universitat de Girona, Campus Montilivi, P4, E-17071 Girona, Spain (e-mail:
[email protected],
[email protected]) 3Far East Geological Institute of Far East Branch of RAS, Prospect Stoletia, 159, 690022, Vladivostok, Russia (e-mail:
[email protected], gonevchuk@ hotmail.com) Abstract: A possible interpretation of the Sikhote'Alin accretion system (Asian margin of the Pacific Ocean) assumes that this region underwent an alternation of subduction and transform tectogenesis (here called the tectogenetic switch hypothesis). This palaeotectonic model fits well with the observed complexity of ore districts and deposits of the region. In this contribution, several statistical analyses are applied to a compositional dataset of trace elements in cassiterite obtained from this area. The goal is to assess the reliability of the tectogenetic switch hypothesis, based solely on cassiterite compositional information. First, biplots are used to get an insight into the variability of the data. Secondly, cluster analysis is applied to detect the existence of natural groups of samples, without using the existing geological information. Finally, discriminant analysis uncovers the main differences in the composition of cassiterite from the different groups obtained. Results highlight the contrast between areas formed under different tectogenetic environments, being subduction-related cassiterite richer in siderophile elements (In, Fe, Sc, W, Cr) and transform-related cassiterite richer in lithophile elements (Mn, Zr). Further natural groups discriminate cassiterite samples depending on their V/Be ratio, which might be related to the age of the deposit. These results suggest that sources of ore magmas and fluids within the region might have a mixed or varied mantle-crust origin and support the tectogenetic switch assumption.
The Sikhote'Alin accretion folded system is located in the easternmost part of Asian Russia, north to the city of Vladivostok. It is a very complex system, containing several terranes of ages ranging from the Jurassic to the Neogene (Fig. 1) and its formation is open to debate. Some authors (e.g. Faure et al. 1995) consider this area as the result of a series of collisions of island-arc systems on the Asian main foreland. A newer interpretation of this area, following Khanchuk (2000), suggests that a better explanation of the observed characteristics of this structure is achieved by considering a switch between geodynamic regimes. According to this model of geodynamic evolution, a shift in the tectogenetic conditions occurred in the interval 100-45 Ma in this part of the Asian margin of the Pacific Ocean (Khanchuk et al. 2003). From a transform continental margin of Californian type (between the Early Cretaceous and the beginning of the Late Cretaceous), the margin changed to an active regime of Andean type (from the Late
Cretaceous to the Paleocene), and switched back to a transform margin type again (during Eocene). Such a process is called here tectogenetic switch. Under continental transform margin conditions, active magmatism is connected with the asthenospheric diapiric penetration into slab-windows of the subducted lithosphere, able to produce the formation of major tin deposits with polymineral compositions. The magmatism over slab-window zones is characterized by an evolution of the fluids in the magmatic chamber from basic to acid compositions, by low partial pressure of H20 and H + and high partial pressure of B and F, and also by the prolonged development of these magmatic chamber systems. This might give rise to large polygenetic deposits of tin and tungsten. On the contrary, subduction-related magmatism within active margins is characterized by the inverse sequence of magmatic series (from acid to basic), by high partial pressure of H20 and H + and by tin-polymetal mineralizations
CompositionalDataAnalysis in the Geosciences:From Theoryto Practice. Geological Society, London, Special Publications, 264, 43-57.
From: BUCCIANTI,A., MATEU-FIGUERAS,G. & PAWLOWSKY-GLAHN,V. (eds)
0305-8719/06/$15.00
9 The Geological Society of London 2006.
44
N. GORELIKOVA ET AL.
Fig. 1. Geological map of the Russian Far East (after Khanchuk & Kemkin 2003). Tin regions: 1, Komsomol'sk (35 x 25 km2); 2, Kavalerovo (35 x 50 km2).
ORE FORMATION GEODYNAMICS BY CASSITERITE
45
Fig. 2. Geodynamic model of ore-magmatic systems for the Russian Far East. Legend: 1, volcanics of Sikhote' Alin belt; 2, accretion complex; 3, suboceanic complex; 4, turbidite complex; 5, accretion complex; 6, accretion prism; 7, basalt layer of the 'transition' crust; 8, basalt layer of the oceanic crust; 9, upper mantle; 10, lower mantle (asthenosphere); 11, mantle diapirs.
(Khanchuk et al. 2003) (Fig. 2). This Andean-type environment is related to poly-formational systems, which may have undergone many periods of magmatic activity. Some experimental evidence supports the tectogenetic switch hypothesis. Isotope geochemical data (Gonevchuk 2002), for example, show that sources of magmas and associated fluids within the region have a heterogeneous mantle-crust origin. Moreover, this model offers a natural explanation for the observed multi-stage character of the polymineral ore bodies present in this region, as well as for the geochemical differences observed in their several stages: the formation of some deposits occurred during several (2-3) mineralization stages connected with different geodynamic
settings. In fact, the proposed model was suggested primarily by Khanchuk (2000) to explain the complexity of the many tin- and tungsten-bearing deposits of this region, which are of capital economic importance. The goal here is to search for statistical evidence for this theory, by studying the compositional differences in cassiterite from some of the tin-bearing deposits of the region. Why cassiterite? This choice is determined by two factors: (1) cassiterite, being the main mineral of various genetic groups of tin deposits, crystallizes during an ore-forming stage and therefore reflects physico-chemical conditions of mineralization - providing the most important genetic information; (2) geochemical trace element associations in cassiterite are studied in
46
N. GORELIKOVA ETAL.
detail for many tin deposits, thus the approach may be extended easily in the future to other tin districts. Using trace element paragenesis on a mineral species to complement classical petrologic mineral paragenesis is an old idea (Fersman 1953; Shcheka et al. 1987), since both are considered essential indicators of metallogenic provinces and of its genetic type. In this line, this work focuses on the trace element composition of cassiterite, instead of on the mineral paragenesis accompanying it. The studied database involves 600 analyses of trace elements (In, Sc, W, Nb, Zr, Be, Fe, Ti, Cr, V, Mn). To obtain them, cassiterite grains were selected from tin ores under binocular microscope - possible because of the coarse-grained nature of both cassiterite and the accompanying minerals. Cassiterite grains were powdered in an agate mortar and leached with acids to remove other mineral traces. Samples of cassiterite were analysed by quantitative spectral analysis using spectrograph Qu-24 Zeiss Jena and Adam Hilger with quartz optics (Harrison 1949). For the analysis, samples weighting up to 10 mg were admitted into coal electrodes and burnt down. Spectra of atoms were recorded on photo plates and spectrograms were deciphered using atlases of analytical lines (Gossler 1942; Brode 1943; Crook 1935). Spectral response is 1-3 ppm, duplication is characterized by rms. error in the 10-25% area. Accuracy of an analysis is represented by a relative error in the 10-20% area. Based on a classical multivariate statistical analysis of an extended version of this dataset, Gorelikova & Tchijova (1997) argued that the available trace elements appear to give reliable criteria for genetic type identification and mineralization productivity assessment. According to them, the trace element paragenesis is conditioned by different factors, such as the deep crust structure, the associated magmatism type and the genetic type of the deposit. Their results indicate that the geochemical associations of minerals formed in the continental basement are characterized by a predominant lithophile composition (Be, Nb, Zr, Mn), while siderophile trace elements (Fe, Ti, V, Cr, Ni) are preferentially linked to potassic series. Such potassic series are typical for intermediate zones between ocean and continent (Gorelikova 1988). Using classical factor analysis, the authors found that: (1) W - B e - S c or B e - W - I n - F e associations appear to be typical for cassiterite from rare-metal ores; (2) T i - V - C r association might be typical for cassiterite-quartz ores; (3) the S c - N b - T i - V association might be typical in cassiterite-tourmaline and cassiterite-chlorite ores; and (4) cassiterite-sulphide ores would behave in a way similar to the silicate paragenesis, showing a preferential enrichment in T i - V - N b - C r .
Setting
Within the Sikhote'Alin accretion folded system, there are several metallogenic belts: from them, the Khingan-Okhotsk and the Sikhote'Alin (Fig. 1) belts were selected for this study. Within the Khingan-Okhotsk metallogenic belt, the Komsomol'sk ore-magmatic system was considered and, within this system, the present study analyses the deposits of Festival'noe and Pereval'noe. Within the Sikhote'Alin metallogenic belt, the Kavalerovo ore region was selected, represented by the tin deposits of Vysokogorskoe and Arsen'evskoe. These regions and deposits, jointly with the zones in which they are divided, are included in Table 1, which describes their main geological characteristics. According to the Khanchuk (2000) model, the formation of the tin deposits of the whole Komsomol'sk ore region (Cretaceous) and Vysokogorskoe deposit (Eocene) in Kavalerovo is connected to a Californian-type environment, and Arsen'evskoe deposit (Late Cretaceous-Paleocene), also in the Kavalerovo region, to an active continental Andean margin. These differences in the assumed tectonic environment justify the choice of these deposits. Moreover, Arsen'evskoe is one of the most complex, largest and most important deposits from an economic point of view. The main characteristics of these regions and deposits, as well as their internal zones, are described in Table 1. It includes the geodynamic regime assumed to govern the formation of each ore body, the ages of the mineralization and of the associated magmatic complex, and the main mineral paragenesis. An extended description of the ore bodies is compiled in Appendix A, due to the lack of references outside the Russian literature. Dataset description
For each sample of cassiterite a composition of trace elements included in the crystalline structure of this mineral is recorded, and they are reported in percentages. Once complemented with the rest up to 100 (approximately representing the amount of tin), they form a closed vector and fall within the scope of compositional data analysis. Therefore, log-ratio techniques, introduced by Aitchison (1986) and described in section three of this volume, are applied. First, the clr coefficients of the compositional vectors x = Ix1, x2 . . . . . x12] are computed using the clr transformation, r,
Xl,
X2
,
Xl2q
clr(x) = [mg~-~,lng--~ . . . . . , n g - ~ j , where g(x) = ~/xl 9x2...x12,
(1)
ORE
FORMATION
GEODYNAMICS
.,..~
~
.
~
.
~
~
~
~ . ~ e ~ ' ~
~
'
~
~
~
m
~
I l l l l l l
I
1
I
I
I
I
I
I
I
I
I
I
E 9,...i
~-~ " ~
E ~
S
~ ~ ~~
~
~ ~
t.
.=.
BY CASSITERITE
47
given that the dataset has 12 parts (11 trace elements and the rest). Using the clr-transformed observations, some descriptive statistics can be computed, and those which are interpretable in terms of the original parts can be back-transformed. This is the case of the mean, computed first as the average of the clr-transformed dataset, taking then exponentials of each component and finally closing it to 100%. Table 2 brings the global mean and the means by zones, including also the number of samples in each zone and other global descriptive statistics. To explore the dataset in more detail, the biplot is used (Gabriel 1971; Aitchison & Greenacre 2002; Daunis-i-Estadella et al. 2006). It is a joint graphical representation of both variables and individuals projected onto the plane of the two first principal components obtained from the clr-transformed variables. Being a projection, this plot displays only part of the variability of the data, here 68% of the total variability (Fig. 3). Therefore, conclusions based on visual perceptions should be considered with care. To assess visual differences between the zones, in Figure 3 each sample is plotted using a different symbol for its ore body. At first sight, there is a clear separation to the right of the plot of most of the samples in the Yuzhnaya zone. Following a circular pattern clockwise, the mixture of some samples of Yuzhnaya and Induktzionnaya can be recognized. The next group of samples is a mixture of 8th Vostok, Kulisnaya, Severnaya, Tektonicheskaya, and the two Fel'zitovaya zones. Finally, the last group of samples corresponds to a sequence going from Podruzhka through Tourmalinovaya to Geophysicheskaya. However, there is no clear separation of the two regions, Komsomol'sk and Kavalerovo. Nevertheless, at the level of the deposits, and considering the regions separately, some trends can be recognized. For example, the two deposits in the Komsomol'sk region, Festival'noe and Pereval'noe, are quite well separated. Within Kavalerovo, separating Vysokogorskoe from Arsen 'evskoe, the separation of the 8th Vostok and Kulisnaya from the Tektonicheskaya samples can be recognized. But it is the overlapping of some zones, such as the similarity of the latitudinal zones (Tourmalinovaya and Podruzhka) with those of the Komsomol'sk region, which suggests that the geodynamic environment exerts a primary control over the mineralization regime. At least, the pure geographical control appears not to be sufficient to define different groups.
Methodology
<
Using the clr-transformed dataset, Ward cluster analysis, binary logistic discriminant analysis and
48
N. GORELIKOVA ET AL. II
I I I I I I
l t l
XX
X X X X X X
XXX
J X
X
X
r,l II
I
I
I
I
I
I
/
I I I
~
"~"
~
~
~
~
.,.-,
xx
x x x x x x
xxx
x
x
XX
X X X X X X
XXX
X
X
X
# I
XX
1
I
I
I
I
I
X X X X X X
I
XX
I
I
I
I
I
~"
I
XXX
I
X X X X X X
I
I
X
bbgbbg bbg b b
g
X
~X
I
X X X X X X
I I I I I I
XXX
X
X
XX
X X X X X X
og, bT,,, XX
X X X X X X
XXX
X
X
X
XX
X X X X X X
XXX
X
X
X
.~ ~
~.~ ~ .~ ~
XX
XXX
I
I
I
I
XXX
X
X
X
X
I
, ?T"
X
?
X X X X X X
XXX
X
X
X
X X X X X X
XXX
X
X
X
g I
",~
I~ b
~
~
~~
~
~ _
~ ~ .~~
~ ~
XX
ORE FORMATION GEODYNAMICS BY CASSITERITE -1.0
-0.5
I
0.0
0.5
I
I
I
1.0 9
z~
o
04 --
v
A
0
~ w ~
o~
"~ '\
Zr o ~ A 9 Mn ~ 0
~
o
vv \ @
o o -
~
/
9
|
[]
o
~> L oo~176176
ooO-Cr 0
1 h
o
B
In
O O
~176176
_o
V
O
.I--
\
%/,~. 9 a
I
_
Be
o
i -~,-~ ~ - - - - _ / ~ c \ ~ - " - - - ~ _ , ,,
_
o
I
Geophysicheskaya Severnaya Fel'zitovayal Fel'zitovaya2 Induktzionnaya Podruzhka Yuzhnaya Tourmalinovaya 8thVostok Kulisnaya Tektonicheskaya
A
O
1.5
I
V
49
o
0-
tO o I
,,
0
0
o' 0
0
O
o
?
13 I
I
I
I
I
I
I
-2
-1
0
1
2
3
3. Biplots of cassiterite trace elements from the studied tin deposits (Far Eastern Russia), distinguishing samples according to their zone of origin (variables use the upper and right scales, data use the lower and left scales; horizontal scale is the first principal component direction, and vertical scale the second principal component one).
Fig.
classical linear regression were applied (Fahrmeir & Hammerle 1984; Krzanowsky 1988). Cluster analysis works with a measure of similarity and looks for an optimal splitting of the sample according to a given criterion (Fahrmeir & Hammerle 1984). It looks for existing natural groups without using information about existing groups. Ward's method takes as similarity measure the homogeneity (computed as the sum of distances between the centre of a group and all individuals within the group) and looks for the splitting which produces a smaller loss of homogeneity. Generally, it results in compact, spherical clusters. Martin-Fernandez (2001) discussed drawbacks and advantages of several clustering techniques applied to compositional data, as well as the influence of the distance used. Here, a Ward cluster analysis is applied to the standardized, clr-transformed 12-part compositional dataset. Binary logistic discriminant analysis looks for a linear function which predicts the log-odds of allocation of an individual over two groups. This linear function is constructed as a projection of the available explicative variables onto a direction, which is
obtained by regressing a binary variable (valued 0 for one group, and 1 for the other) against the explicative variables (Krzanowsky 1988). Barcel6-Vidal (1996) introduced and discussed equivalent parametric discrimination techniques for compositional data. Here, the logistic technique is applied to the clr-transformed dataset. A classical measure of goodness of discrimination is obtained by computing misclassification rates: the proportion of samples from each group which are classified erroneously by the discrimination rule.
Results
As a first approach, discriminant analysis was applied to existing groups defined by regions, deposits and zones (Table 1). In spite of the small number of samples (Table 2) compared with the number of variables, misclassification rates were clearly unsatisfactory (more than 20% of samples are classified badly), thus supporting the idea that the formation process of these ore bodies was
N. GORELIKOVA ETAL.
50
Table 3. Misclassification rates of groups, defined according to regions, deposits and zones, as well as the several discrimination analyses performed with groups defined by the Ward clustering method Criterion
Region Deposit Zone Ward groups (full) Ward groups (simplified)
Global misclassification rate (%)
Highest misclassification rate in a group (%)
Worse classified group
22 22 35 3 5
31 100 87.5 17 14
Komsomol'sk Vysokogorskoe Fel'zitovayal group 3 group 2
Note: 'Full' uses the discriminationrules defined in Table 5, whereas 'simplified' uses the discriminationrules of equations (2)-(4).
highly complex. Global misclassification rates and maximal misclassification rate for a single group are included in Table 3. Given this unsuccessful direct approach, the Ward clustering technique was applied to the standardized clr-transformed dataset to look for natural associations between observations, independently of the original geotectonic information. Results suggested the existence of, at most, four different groups, which were a posteriori crosstabulated with the existing zones of the dataset (Table 4). The first group is identified with the Vysokogorskoe deposit, Fel 'zitovaya zones (Arsen'evskoe deposit) and PerevaI'noe deposit (Komsomol'sk district). The second group includes Tourmalinovaya and Podruzhka zones from the Arsen'evskoe deposit, as well as the Festival'noe deposit from Komsomol'sk district. The third group corresponds to a subgroup of samples of Yuzhnaya zone and to Induktzionnaya zone from the Arsen'evskoe deposit, while the remaining samples of Yuzhnaya zone form group 4.
The distribution pattern in the original biplot (Fig. 4) shows a clear visual difference between groups 1 - 2 versus groups 3 - 4 . Such a difference can be related to the fact that both groups 1 and 2 are considered to be formed under a transform margin regime, whereas formation of groups 3 and 4 is regarded to be linked to an active margin (Table 1). The good separation of the obtained groups suggests that a hierarchical binary logistic discriminant analysis could be applicable. First, zones from Andean versus Californian regimes were discriminated (groups 1 and 2 vs. groups 3 and 4). Afterwards, discrimination inside these regimes was conducted. The obtained discriminating rules among the groups produce misclassification rates below 4%. The coefficients in Table 5 correspond to the discriminating directions expressed as compositions. Table 3 reports misclassification rates of this hierarchical discriminating system. The obtained discriminating functions were not easy to interpret. Therefore, the next step was to
Table 4. Classification of tin deposits by Ward's cluster analysis Ward clusters Region
Komsomol 'sk Kavalerovo
Deposit
Festival'noe Pereval' noe A rsen 'evskoe
Vysokogorskoe
Zone
1
2
Geophysicheskaya Severnaya Fel 'zitovaya 1 Fel 'zitovaya2 Induktzionnaya Podruzhka Yuzhnaya Tourmalinovaya 8-th Vostok Kulisnaya Tektonicheskaya
2 4 8 8 1 1 1
10
4
5 4 10
4 7 5 5
3
69
Final Ward group 2 1 1 1 3 2 3 and 4 2 1 1 1
The numberof samplesof each zonefallingin each group obtainedby clusteranalysisis reported.The finalcolumnreports to whichgroup the whole zone was finallyattached. For example, for the rest of the analysis, Geophysicheskayawas consideredwholly in group 2.
ORE FORMATION GEODYNAMICS BY CASSITERITE
-1.0 I CO
4.5 I
0.0 I
0.5 I
1.5 I
[]
--
[]
| o
[]
9 (,q
1.0
51
[]
1 2
/x
3
T
4
__
It')
v,-
_
[]
~ib
-o
#[] Mn
o':'\ ~ \
-
o \
-
[]
Zr", 9 9
0
m
0
n
tO O
--
O
9
99
_
V B
e
[]
--
O
tO
%/I/
5-
V 9~ 9
I
~
~ A
~
C r ""
5'
~ A
9 /%
t~
"7
0
h
I
I
I
I
I
I
-2
-1
0
1
2
3
F i g . 4. B i p l o t s o f cassiterite trace e l e m e n t s f r o m the s t u d i e d tin d e p o s i t s (Far Eastern Russia), d i s t i n g u i s h i n g s a m p l e s to the g r o u p s o b t a i n e d f r o m W a r d ' s cluster analysis. N o t e that g r o u p s tend to h a v e a s i m i l a r s p r e a d (as e x p e c t e d f r o m a W a r d cluster) and that the b i p l o t is the s a m e as in F i g u r e 3 (since g r o u p i n g d o e s not c h a n g e the c o v a r i a n c e structure o f the variables).
T a b l e 5. Coefficients of the linear discriminating rules between groups, following a hierarchical scheme: Andean (3-4) vs. Californian (1-2) margin type, main Yuzhnaya (4) vs. Induktionnaya (3) zones in Arsen 'evskoe deposit, and Geophysicheskaya, Podruzhka and Tourmalinovaya zones (2) vs. the rest (1) Discriminating loadings t & 2 vs. 3 & 4 dr(In) clr(Sc) dr(W) clr(Nb) clr(V) clr(Cr) dr(Be) clr(Ti) clr(Zr) clr(Fe) clr(Mn) clr(Rst)
-
0.567 0.195 0.432 0.093 0.149 0.121 0.251 0.049 0.375 0.243 0.539 0.200
Correlation coefficients
3 vs. 4
1 vs. 2
1.956 0.315 0.045 0.099 1.381 0.120 0.988 0.527 0.121 1.948 0.217 0.536
- 0.043 0.459 - 0.030 0.604 - 0.147 0.020 0.316 - 2.199 - 0.080 0.469 - 0.383 1.014
-
-
Highest correlation coefficients have been marked with an asterisk (*).
l & 2 vs. 3 & 4
-
-
0.863* 0.610" 0.678* 0.237 0.384 0.477 0.558 0,552 0.813" 0.756* 0.817" 0.568
3 vs. 4
-
-
0.801" 0.540 0.283 0.135 0.782* 0.074 0.845* 0.604* 0.461 0.314 0.544* 0.613
1 vs. 2
-
-
0.706* 0.431 0.373 0.035 0.608* 0.200 0.702* 0.871" 0.572 0.495 0.568* 0.219
52
N. GORELIKOVA E T A L .
look for balances with the highest correlation with each discriminating rule. Balances are log-ratios of the geometric mean of the parts of two subcompositions, multiplied by a normalizing constant (Egozcue & Pawlowsky-Glahn 2005, 2006). To determine these balances, those elements which had clr-transformed values with the highest correlation with each discrimination function (Table 5) were selected first, afterwards all balances between these elements were computed, and finally those with the highest correlation with the discriminating functions were selected. The first discriminating functions, expressed as a linear function of a balance and afterwards rounding the coefficients, resulted in
fl ~ 0.961n
(In- Fe. Sc. W . Cr) 2 (Mn. Zr) 5
In (In. Fe. Sc. W- Cr) 2 (Mn. Zr) 5
simplified second rule, f2 ~ 3.91
41nBe - ~ - +9In
1.16
(3)
1.2,
characterized by negative values in group 4 and positive values in group 3. Finally, discrimination inside the Californian group is based on using the simplified third rule,
f3 ~ 1.06 In
+ 0.07 ~ In
Be. Nb V------T--,
(2)
which has positive values in groups 3 and 4 (Andean regime) and negative values in groups 1 and 2 (Californian regime). Inside the Andean group, groups 3 and 4 are discriminated using the
cO--
~~ ~ o q , 9
O O
[]
9
9
9
Ax
/x
g~o~3
On
nz~ Z~
go
~O
113 v e-
Z~
0
9
A
9 D aM-n
~a
Tt~y
v
9
group 2D ~ n (Xl--
[]
[]
rn
[] []
JB
9
0--
I
9
9
9
~
9
[] [] I
-20
-10
0
(4)
with negative values in group 2 and positive values in group 1. The similarity between the simplified second and third functions suggests that the ratio Be/V is very informative. The scatter-plot (Fig. 5) of the first simplified discriminant function (1) and In(Be/ V) clearly separates the four groups. Note that in this figure, the borders between the groups in the vertical directions do not coincide with the zero level. This is due to the means of In(In/V) and ln(Nb/V), which were subtracted from equations (3) and (4), respectively. Misclassification rates of this balance discriminating system are also reported in Table 3.
- 0.115
0.12,
In Be 9 In ~ - +
I
I
I
I
10
20
30
40
simplified first discriminant function
Fig. 5. Scatter-plot of the linear discriminating functions approximated by balances among groups defined by Ward cluster analysis, with indication of mean values for each group. The legend is the same as in Figure 4.
ORE FORMATION GEODYNAMICS BY CASSITERITE
Discussion Balance (2) compares two groups of parts: (In, Fe, Sc, W, Cr) vs. (Mn, Zr). The ratio of the first against the second group is greater in those samples from the Andean type vs. the Californian type. These groups have a relative resemblance to the siderophile (Fe, Ti, V, Cr, Ni) and lithophile (Be, Nb, Zr, Mn) associations of Gorelikova & Tchijova (1997). They suggest that Andeanregime fluids have a higher proportion of siderophile trace elements, while Californian-regime fluids are relatively richer in lithophile trace elements. Another interesting difference between groups 1-2 vs. 3 - 4 is the presence/absence of tourmaline (Table 1). In the Californian group, group 1 (Fel'zitovaya zones in the Arsen'evskoe deposit, jointly with Vysokogorskoe and Pereval'noe deposits) is characterized by higher B e / V ratios than group 2 (an early stage represented by Tourmalinovaya, and Podruzhka in the Arsen'evskoe deposit, plus Geophysicheskaya). In general terms, group 2 zones are older than those integrated in group 1, and intense metamorphism is reported for some of these older zones. Therefore, one could tentatively relate equation (4) with age. Thus, this rule discriminates zones formed during the early Cretaceous episode from those formed during the Eocene. Alternatives would be to relate equation (4) with the influence or superposition of metamorphism, as well as with a balance between tourmaline parageneses, dominant in group 2 (Table 1), against sulphide and quartz parageneses, dominant in group 1. Under an Andean regime, group 4 (main part of the Yuzhnaya zone in the Arsen'evskoe deposit) is characterized by higher V (or lower Be and In) content than group 3 (formed by a small part of the Yuzhnaya zone, and the whole Induktzionnaya zone, both in the Arsen'evskoe deposit). No mineral difference may be associated with these groups, according to Table 1. Finally, note that the results of Gorelikova & Tchijova (1997) are not inconsistent with those obtained here, although theirs were obtained using a classical approach and those presented here using a log-ratio one, showing that they obtained a reasonable approximation. Here, siderophile fluids would be indicated by the association (In, Fe, W, Sc, Cr), whereas lithophile fluids would be linked to the association (Mn, Zr). Mineral parageneses are not so clearly linked to trace element parageneses.
Conclusions Geochemical associations in cassiterite from tinbearing zones of Far Eastern Russia allow the classification of the zones in groups corresponding
53
to probable geodynamic settings: an early Cretaceous transform margin (Pereval'noe deposit and some metamorphized zones of the Arsen'evskoe deposit), an active margin (zones Yuzhnaya and Induktzionnaya from the Arsen'evskoe deposit) and, finally, a late Eocene transform margin (Vysokogorskoe deposit and more recent zones of Arsen'evskoe). On the basis of these results, it is admissible that the differences in cassiterite compositions are governed primarily by the fluids involved, which are eventually determined by the geodynamic regime. Geochemical differences could then be used to identify tin deposits formed under various settings and to give additional support to the geodynamic model of ore deposits of Khanchuk (2000). Complementary results show that those deposits or zones formed under a Californian regime are almost always marked by the presence of tourmaline in their mineral paragenesis, whereas this silicate does not appear in those ore bodies formed under an Andean regime. A further division of both the Californian and the Andean groups is achieved by the ratio Be/V. In the Californian group, this ratio separates acceptably well ore bodies formed in the Cretaceous and those formed in the Eocene. The interpretation of this division within the Andean group needs further investigation. The work has been financially supported by the Russian Basic Research Foundation through the project N 04-0565270 and the Spanish Ministry of Education and Science through the project BEM2003-05640/ MATE. The original manuscript greatly benefited from reviews by O. Vaselli and G. Bonham-Carter, to whom the authors express their gratitude.
Appendix A: Geological characterization of studied ore bodies
Komsomol' sk region The formation of the Khingan-Okhotsk metallogenic belt, including the Myao-Chan zone from the Cretaceous stage - to which the tin mineralization is related - happened under both the early transform margin regime and the subduction margin (Gonevchuk et al. 2000; Khanchuk 2000; Khanchuk & Kemkin 2003), covering the whole Cretaceous. In the Myao-Chan zone (Komsomol'sk ore region) tin deposits are represented by mineral associations of cassiterite- silicate- sulphide type composed of thick quartz-tourmaline zones. The main tin-bearing magmatic association of this region is dated in the interval 102-85 Ma (Gonevchuk 2002). The most probable age of tin mineralization is 95-84 Ma, and some isotope dating fixes it at 103-75 Ma. Petrochemical and geochemical features of magmatic and ore assemblages define the Komsomol'sk ore-magmatic system as a heterogeneous
54
N. GORELIKOVA ET AL.
mantle-crust one. From the several deposits of this region, Festival' noe and Pereval'noe are included in the study.
Festival'noe deposit. The Festival'noe deposit is located in the southern linear Perevalnensk ore structure within the Jurassic terrigenous-volcanic rocks of Mesozoic age. The main ore zones are Yagodnaya and Geophysicheskaya, dominated by copper-tin-sulphide paragenesis. Hosting structures are south-strike fractures, with related quartz-feldspar metasomatites regarded as an early mineralization stage. Mineralization of quartz-cassiterite, quartz-sulphide and quartz-carbonate composition occurred in the following stages. The mineral composition of these ore stages is rather complex: cassiterite, arsenopyrite, wolframite, scheelite, chalcopyrite, pyrrhotite, stannite, magnetite, galena, sphalerite and pyrite, among others. Besides, there are mainly: quartz, tourmaline, chlorite, micas, fluorite, calcite and siderite, with an accessory presence of wolframite and arsenopyrite (particularly with quartz). Sulphides are distributed widely in thick mineralization veins. Chalcopyrite and pyrrhotite are widespread. Cassiterite is represented by meso-crystalline and coarse-crystalline varieties and its maximal abundance is related to the structural discordance between sedimentary and volcanic rocks. The formation of ore zones from the Festival'noe deposit occurred over a long period under the influence of intensive tectonic deformations. This study considers the mineralization of the zone Geophysicheskaya. Pereval'noe deposit. The Pereval'noe deposit is located in the northern part of the Komsomol'sk region within the large Amut syncline composed of the thick volcanic Choldamy and Amut series of Cretaceous age. The strata of the Early Cretaceous volcanic-sedimentary rocks is represented by conglomerates, breccia, sands, gritstone, tufts, and tuff breccia of quartz porphyrys which are placed above pyroxene-plagioclase porphyrites. Intrusive rocks of the Amut syncline are dykes and stocks of diorites and gabbro-diorites, intruded within the Jurassic sedimentary rocks with a sharp angle. Severnaya zone is the most important one in this deposit. It is composed of quartztourmaline-sulphide ores. It has a complicated morphology: a thick vein of quartz-tourmaline metasomatites is cut by cassiterite-quartz veins with sulphides, among which arsenopyrite, pyrrotite, chalcopyrite and pyrite prevail. In the upper part of the porphyrite section there are series of imbricate zones and veins with a P b - Z n mineralization stockwork. In the deposit, a mineralogical zonation is observed as a change in the quartz-sulphide mineralization at the upper horizons by the cassiteritequartz one at the middle level, and the quartz-tourmaline at the lower horizons. Based on a study of the structuralmorphological features of ores, it is believed that the Severnaya zone was formed in several stages: (1) the quartz-tourmaline stage; (2) the quartz-cassiterite stage, with formation of wolframite, arsenopyrite and
sericite; (3) the quartz-carbonate-sulphide stage, with formation of chlorite, pyrrhotite, chalcopyrite, sphalerite, galena and bournonite; and (4) the calcite stage, with formation of chalcedony, pyrite and marcasite.
Kavalerovo region Within the Sikhote-Alin' metallogenic belt tin deposits from the Kavalerovo ore region are considered, the largest and most studied. In accordance with the 'tectogenetic switch' hypothesis, the Kavalerovo region is located within the Taukhinsky terrane of the Early Cretaceous accretion prism and the Zhuravlevsky terrane of the Jurassic-Early Cretaceous turbidite basin accreted to the continent in the Late Alb time (Golozubov & Khanchuck 1995). The tinbearing magmatism of this region is dated in the interval 115-45 Ma and, according to Khanchuk (2000, Khanchuck et al. 2003), it is related to both the geodynamic regimes of transform margin (Early Cretaceous) and subduction (Middle-Late Cretaceous), although the Paleocene transform margin significantly altered it. Most deposits of this region suffered two stages of tin mineralization, namely, the Late Cretaceous and the Paleocene ones, both with a varied composition (Finashin 1986; Tomson et aL 1996). This work considers tin zones of the Vysokogorskoe and Arsen'evskoe deposits.
Vysokogorskoe deposit. The Vysokogorskoe deposit (Kokorin et al. 2001) is located in the eastern part of the region and is related to a tectonic block of terrigenouscarbonaceous rocks of the Taukhinsky terrane (Cretaceous-Paleocene volcanic materials). Isotope dating shows that the ore-magmatic association of this deposit was formed in the interval 95-45 Ma. However, some results fix the age of the early granodiorites at 105 Ma. In the opinion of most researchers, three different mineralization stages related to several magmatic stages occurred in this deposit. About one hundred ore veins have been identified, as well as stockwork zones and mineralizations in explosion breccia. The main age of mineralization is 55-45 Ma, using K/Ar data (Finashin et al. 1978), and the formation of the main ore zones corresponds to the regime of a young transform margin. The main mineralization stage is a tin ore, formed by metasomatic veins of complex morphology and mineralization, as well as breccia zones of about 1 km width, related to the Silinsky jointing zone. At approximately 500 m depth, mineralization in explosive breccia replaces the veins. These explosive breccia, made of fragments of siltstones and silica, are cemented by tourmaline, sericite, chlorite, quartz, cassiterite and sulphides. The early formation stage of this explosive breccia is dominated by a quartz- tourmaline paragenesis, with fragments of microquartzites cemented by a fine-grained aggregate. The later quartz-sulphide assemblage is formed by quartz, pyrrhotite, pyrite, sphalerite, chalcopyrite and Ag-galena. The final stage is characterized by a quartz-fluorite-carbonate
ORE FORMATION GEODYNAMICS BY CASSITERITE paragenesis, with accessory pyrite, galena and sphalerite. The vein series of quartz-tourmaline-chlorite ores, according to Khanchuk & Kemkin (2003), essentially corresponds to the late Cretaceous-Paleocene mineralization stage. The present study considers ore associations of vein zones Tectonicheskaya, Kulisnaya and 8th Vostok.
Arsen'evskoe deposit. The Arsen'evskoe deposit is located in the western part of the ore region in rocks of the Zhuravlevsky turbidite terrane. It is related to the central Sikhote' Alin fault, a major tectonic structure controlling some large intrusions in the region. Magmatic rocks are represented by trachymonzonites and andesitediorites formed during the Alb-Turonian, the Cenomanian and the Paleocene periods (Popovichenko 1989). From the Alb-Turonian, there are stocks of monzonite-porphyries and rare dykes of trachyandesite-basalts of the AraratBeresovsky complex (114-90 Ma), characterized in detail by Gladkov (1982, 1988) and Gonevchuk (2002). At the Cenomanian period, formation of widespread dykes of granodiorite-porhyry and rare granites and rhyolites occurred. Granodiorites and granites are present as fragments in explosive breccia of the Paleocene period. Their age, using K/Ar dating, is 80 __+5 Ma (Gonevchuk 2002; using biotites) and 76 + 4 Ma (Tomson et aL 1996; using the bulk rock). This igneous complex might be related to the Uglovsky volcanic-plutonic complex (100-80 Ma), which is traditionally related in this region with ore mineralization of the Late Cretaceous stage (predominant within the central part of the deposit). The Paleocene magmatic stage is present as highly aluminiferous andesites and andesitebasalts in dykes and subvolcanic bodies of ultrapotassic rhyolites. Most researchers distinguish two stages of tin mineralization within this deposit: the earliest Early-Late Cretaceous one (95-93 ___ 8 Ma; Tomson et al. 1996) is represented by the so-called latitudinal zones; the later Upper Cretaceous stage is characterized by the formation of the main vein series from the economically interesting tin ores (60-70 Ma), dipping approximately to the south. There are different views about the formation of Fel'zitovaya I and Fel'zitovaya 2 zones, both connected with felsite dykes from the final stage. Some researchers connect these zones with the main ore stage, considering felsite dykes as intrametaUiferous (Nekrasov & Popov 1990), others consider it as an independent stage (Khanchuk et aL 2004). The early stage (Tourmalinovaya zone) is characterized by metasomatic tourmaline-cassiterite-sulphosalt ores. It is related to the Late Cretaceous Berezovsky-Ararat volcanic-plutonic complex. Ore bodies of this stage are linked with biotite metasomatites, and are represented by zones of fine-grained tourmalinites and streakydisseminated sulphide-sulphosalt ores (Tourmalinovaya and Podruzhka zones, aka latitudinal zones). They cut porphyry dykes and they are, in turn, cut and metamorphized by other felsitic porphyry dykes and cassiterite-chloritesilicate-sulphide veins. The mineralization of the early
55
stage, in vein-shaped bodies with unclear contacts, was formed as a result of metasomatic substitution of tectonic jointing zones. Ore zones are composed of silicified rocks, together with sericite-quartz paragenesis and quartztourmaline metasomatites (Finashin 1986). Ore textures are massive, compact breccia and disseminated veins. The zones of the second stage form thick vein series of cassiterite-quartz composition (Yuzhnaya and Induktzionnaya zones). Both Fel'zitovaya zones are formed by a thick fracture structure of substitution, which occurred after the felsite dyke intrusion, and have a dominant quartz-cassiterite composition, with calcite and fine impregnation of sulphides. Metasomatic ores present segregations of massive sulphides involving pyrrhotite, chalcopyrite, sphalerite, stannite and impregnation of Bi-Ag-galena, arsenopyrite, pyrite, marcasite, wolframite, tetrahedrite, freibergite, sulphosalts of Pb and Sb, minerals of the series lillianite-gustavite, and native Bi and Sb. Non-metalliferous minerals are: quartz, tourmaline, chlorite, sericite and Mn-bearing carbonates. The main vein series of the second stage is represented by thick quartz-cassiterite-sulphide ore bodies placed along N W - S E fractures. They occur in low metamorphized sedimentary rocks of the Jurassic-Early Cretaceous stage. The age of intrametalliferous dykes and metasomatites is 58-53 Ma, using K/Ar dating (Tomson et al. 1983). Filling veins have banded and zonal textures, resulting from successive deposition of cassiterite, chlorite and quartz. The deposition of the main mass of cassiterite occurred at the early stage of the ore process, associated with early quartz and chlorite. In zones of the main vein series, the following mineralization stages have been identified: (1) quartz-chlorite-cassiterite, (2) quartz-sulphidesulphosalt, (3) quartz-fluorite-carbonate. Each stage cuts all the preceding stages. Samples were collected from the cassiterite-chlorite-quartz vein zones called Yuzhnaya and lnduktzionnaya. The mineralization of the third stage is represented by quartz-cassiterite-sulphide ores, connected with potassic rhyolitic dykes posterior to the basic magmatic rocks (Nekrasov & Popov 1990). They might be related to a young transform margin of Eocene age. The spatiotemporal association of these K-rhyolitic dykes with high-aluminiferous basalts suggests that they should be interpreted as a genetic association, of age 60-48 Ma (Finashin 1986). These rhyolites can be considered as a subvolcanic analogue of the leucogranites of the Uglovsky complex. The mineralization is composed of two veins, Fel'zitovaya, which forms a thick fracture structure with a vertical dip parallel to the main vein series, going from the volcano Samovar to the Intermontane fault. Fel'zitovaya is observed at 800 m depth and it crops out poorly as quartz-sulphide veins. Veins form a stock-shaped substitution deposit, and essentially have a quartz-cassiterite composition, with calcite and fine impregnation of arsenopyrite, pyrrhotite, galena, chalcopyrite and pyrite.
56
N. GORELIKOVA ET AL.
References AITCHISON, J. 1986. The statistical analysis of compositional data. Chapman & Hall, London (2nd edn 2003, The Blackburn Press, Caldwell, NJ, USA). A1TCHISON,J. & GREENACRE,M. 2002. Biplots for compositional data. Applied Statisties, 51 (4), 375-392. BARCEL0-VIDAL, C. 1996. Mixturas de datos composicionales [Mixtures of Compositional Data]. PhD thesis, Universitat Polit~cnica de Catalunya. Spain [in Spanish]. BRODE, R. 1943. Chemical spectroscopy. Chapman and Hall, London. CROOK, H. Y. 1935. Metallurgical spectrum analysis with visual atlas. Oxford University Press. DAUNIS-I-ESTADELLA, J., BARCELO-VIDAL, C. & BUCCIANTI, A. 2006. Exploratory compositional data analysis. In: BUCCIANTI, A., MATEUFIGUERAS, G. & PAWLOWSKY-GLAHN, V. (eds) Compositional Data Analysis in the Geosciences: From theory to practice. Geological Society, London, Special Publications, 264, 161-174. EGOZCUE, J. J. & PAWLOWSKY-GLAHN, V. 2005. Groups of parts and their balances in compositional data analysis. Mathematical Geology, 37 (7), 795-828. EGOZCUE, J. J. & PAWLOWSKY-GLAHN,V. 2006. Simplicial geometry for compositional data. In: BUCCIANTI, A., MATEU-FIGUERAS, G. & PAWLOWSKY-GLAHN, V. (eds) Compositional Data Analysis in the Geosciences: From theory to practice. Geological Society, London, Special Publications, 264, 145-159. FAHRMEIR, L. & HAMERLE, A. 1984. Multivariate statistische Verfahren [Multivariate statistical analyses]. Walter de Grnyter, Berlin [in German]. FAURE, M., NATAL'IN,B. A., MONIt~,P., VRUBLEVSKY, A. A., BORUKAIEV, CH. & PRIKHODKO, V. 1995. Tectonic evolution of the Anuy metamorphic rocks (Sikhote Alin, Russia) and their place in the Mesozoic geodynamic framework of East Asia. Tectonophysics, 241, 279-301. FERSMAN, A. E. 1953. Izbrannye Trudy. Publishing House of the Academy of Sciences USSR, Moscow [in Russian]. FINASHIN, V. K. 1986. Tin deposits of Primorye. Far Eastern Scientific Centre of RAS, Vladivostok, Monograph [in Russian]. FINASHIN, V. K., LITAVRINA KOSENKO, W. I., OVCHAREK, E. S., GRACHEVA,A. A., & ARAKELYANTS, M. M. 1978. About geological age of tin mineralization from the Kavalerovo region. In: KOROSTELEV, P. G. & RADKEVICH,E.A. (eds) The ore mineralization of the Far East. Far East Scientific Centre, Vladivostok, 71-80 [in Russian]. GABRIEL, K. R. 1971. The biplot - graphic display of matrices with application to principal component analysis. Biometrika, 58, 453-467. GLADKOV,N. G. 1982. Late Cretaceous and Paleocene tungsten-bearing and tin-bearing magmatic associations from the western part of the Kavalerovo region. In: BORSUK, A. M. (ed.) The tin-bearing and tungsten-bearing granitoides from some
regions of the USSR. Nauka, Moscow, 202-232 [in Russian]. GLADKOV, N. G. 1988. Early - Late shoshonite latite - monzonite volcano-plutonic complex. In: KOVALENKO, V. I. & BOGATIKOV, O. A. (eds) Ore mineralization of magmatic associations. Nauka, Moskow, 98-105 [in Russian]. GOLOZtJBOV, V. V. & KHANCHUK, A. I. 1995. The Taukhinsky and Zhuravlevsky terranes (southern Sikhote-Alin') are the fragments of the Early Cretaceous Asia margin. Far East Geology, 14, 13-25. GONEVCHUK, V. G. 2002. Tin-bearing systems of the Far East: Magmatism and ore genesis. Dal'nauka, Vladivostok [in Russian]. GONEVCHUK, V., SEMENUAK, B. I., & KOROSTELEV, P. G. 2000. Khingano-Okhotsk metallogenic belt in the concept of terranes. In: KHANCHUK, m. I. (ed.) Ore deposits of continental margins. Far Eastern Branch of RAS, Russian Far East IAGOD group, Dal'nauka, Vladivostok, Issue 1, 35-54. [in Russian]. GORELIKOVA, N. V. 1988. Parageneses of trace elements in tourmalines from tin formations. Far Eastern Branch of RAS. Vladivostok. 127 p. [in Russian]. GORELIKOVA, N. V. & TCHIJOVA, I. A. 1997. Application of mathematical methods in creating forecasting recognition model for tin deposits and placers. In: PAWLOWSKY-GLAHN,V. (ed.) Proceedings of the IAMG'97. Part 2, Barcelona, 997-1008. GOSSLER, F. 1942. Boden und Funkenspectrum des Eisens. Jena, Germany [in German]. HARRISON, G. R. 1949. Practical spectroscopy. Prentice Hall, New York. KHANCHUK, A. I. 2000. Paleogeodynamic analysis of ore deposit formation in the Russian Far East. In: KHANCHUK, A. I. (ed.) Ore deposits of continental margins. Far Eastern Branch of RAS, Russian Far East IAGOD group, Dal'nauka, Vladivostok, Issue 1, 5 - 3 4 [in Russian]. KHANCHUK,A. I. & KEMKIN,I. V. 2003. The evolution of Japan Sea area in the Mezozoic period. The Herald of FEB RAS, 6, 94-108 [in Russian]. KHANCHUK, A. I., GONEVCHUK, V. G., BORTNIKOV, N. S. & GORELIKOVA, N. V. 2003. Paleogeodynamic model of the Sikhote-Alin tin-bearing system (Russia). In" ELIPOULOS, DEMETRIOUS G. ET AL. (eds) Mineral Exploration and Sustainable Development. Proceedings of 7th SGA meeting, Athens, 295-298. KHANCHUK, A. I., GORELIKOVA,N. V., PAWLOWSKY, V. & TOLOSANA, R. 2004. New Data on Trace Element Distribution in Cassiterite from Tin Deposits of the Russian Far East. DokIady Earth Sciences, 399, 1146-1149. KOKORIN, A. M., GONEVCHUK,V. G., KOKORINA,D. K. & OREKHOV, A. A. 2001. The Vysokogorskoe tin deposit: peculiar genesis and mineralization. In: KHANCHUK, A. I. (ed.) Ore deposits of continental margins. Far Eastern Branch of RAS, Russian Far East IAGOD group, Dal'nauka, Vladivostok, Issue 2, 156-170 [in Russian].
ORE FORMATION GEODYNAMICS BY CASSITERITE KRZANOWSKI, W. J. 1988. Principles of Multivariate Analysis: A user's perspective. Clarendon Press, Oxford. MARTIN-FERNANDEZ, J. A. 2001. Medidas de diferencia y tgcnicas de clasificaci6n no paramgtrica para datos composicionales [Measures of difference and non-parametric classification techniques for compositional data]. PhD thesis, Universitat Polit~cnica de Catalunya, Spain [in Spanish]. NEKRASOV, A.Ya. & PoPov, V. K. 1990. About stepped mechanism of ore matter concentration by the example of Arsen'evskoe tin deposit. Doklady of Academy of Sciences USSR, 315, 1437-1440 [in Russian]. POPOVICHENKO, V. V. 1989. The relation of the magmafism and ore mineralization at the Kavalerovo ore region. In: GONEVCHUK, V. G. & KOKORIN, A. M. (eds) The genetic models of deposits and forecasting at tin regions. Far Eastern Branch of RAS, Vladivostok, 45-57 [in Russian].
57
SHCHEKA, S. A., GORELIKOVA, N. V., NAUMOVA, V. V. & VRZHOSEK, A. A. 1987. The typomorphism of minerals as a prospecting criterion. In: FINASHIN, V. K. & GORELIKOVA, N. V. (eds) New data on the mineralogy of the Far East. Far Eastern Scientific Branch of RAS, Vladivostok, 9-33 [in Russian]. TOMSON, 1. N., KAZANSKY,V. 1. & DYUZHIKOV,O. A. 1983. The deep structure of the earth crust and the distribution of endogenic ore regions, fields, and deposits. In: CHUKHROV, F. V. (ed.) The deep structure and the formation conditions of endogenic regions, fields, and deposits. Nauka, Moscow, 25-47 [in Russian]. TOMSON, I. N., TANANAEVA, G. A. & POLOKHOV, V. P. 1996. The relation of various types of tin mineralization at the Southeren Sikhote Alin (Russia). Geology of ore deposits, 38, 357-372 [in Russian].
On stability of compositional canonical variate vector components R. A. R E Y M E N T
Palaeozoology Section, Swedish Museum of Natural History, Box 50007, $10405, Stockholm, Sweden (e-mail:
[email protected]) Abstract: Canonical variate analysis (aka discriminant coordinates) is viewed from the aspect of Aitchisonian compositional data analysis and the concept of stability in canonical vectors examined in relation to their reification (i.e. providing canonical vector components with a practical interpretation). The log-ratio transformation was found to have computational and interpretational advantages over the centred log-ratio transformation. The ad hoc application of N. A. Campbell's application of the method of shrinkage estimators to multiple discrimination, and the optimal retention of discrimination power, is exemplified by two cases, one drawn from quantitative sedimentology, the other from biomolecular palaeontology, with the intention of probing the effect of instability on interpreting the relative importance of standardized canonical variate coefficients in relation to the suppression of near-redundant directions of within-groups variation. Pronounced instability in canonical vectors may endanger the validity of an analysis.
Campbell & Reyment (1978) considered the problem of instability in canonical vectors in palaeontology using data on the Nigerian Cretaceous foraminiferal species Afrobolivina afra Reyment for exemplifying procedures. It was reported that where instability occurs, stability of the coefficients within the framework of repeated sampling can be achieved by suppression of nearredundant directions of the within-groups variation. The identification of redundant information in multivariate analysis has been slow to enter into the praxis of statistical methods (Campbell 1979, 1980a; Seber 1984). So far, interest in investigating reliability in canonical vectorial components has been centred on full-space data, with a preference for biological material. The question of stability for multivariate models in simplex space, such as occur in geostatistics, has been given scant attention, despite its potential significance for good analytical practice. In this note, the problem is illustrated for geochemical data in stratigraphical sedimentology, and for amino acids in biomolecular palaeontology. Both cases are typically compositional in nature. The presentation followed here is made at two levels: one for the interested geologist with little experience of multivariate statistics; the second, in Appendix A, for anybody desiring to undertake his/her own investigations and with a basic understanding of multidimensional analysis. It should be understood that this article does not profess to have the status of mathematical novelty because it is no more than an exemplification of existing techniques, rather well known in biometrics, to geological materials.
An important issue arising in canonical variate analysis conceming stability in canonical vector coefficients was studied by Campbell (1979, 1980a,b) who demonstrated that shrunken estimators in discriminant and canonical variate analysis (discriminant coordinates; multiple discrimination) may lead to improved stability of the resulting coefficients when the between-groups sum of squares for a particular principal component (more stringently, latent roots and vectors), defined by the within-groups covariance or correlation matrix, is small and the corresponding latent root is small. Campbell's engagement in the problem arose out of collaboration with marine biologists concemed with practical aspects of the commercial utilization of marine crustaceans and gastropods in Australian and New Zealand waters. The ideas underlying shrinkage constants in multivariate analysis derive from practical problems of stability that occur in multiple regression analysis (Goldstein & Smith 1974; Campbell & Furby 1994; Gui 1999). Granted that the same mathematical structure exists in discriminant functions, the application of ridge-type estimators to that field is a logical step and one that can be tentatively extended to multiple discriminant analysis.
The provenance of the data In this paper two examples are considered, the one a canonical variate analysis of the geochemistry of Lithuanian Silurian sediments using data kindly made available by Dr Donata Kaminskas, Geology Department, University of Vilnius. The
From: BUCCIANTI,A., MATEU-FIGUERAS,G. & PAWLOWSKY-GLAHN,V. (eds) CompositionalData Analysis
in the Geosciences: From Theory to Practice. Geological Society, London, Special Publications, 264, 59-66. 0305-8719/06/$15.00
9 The Geological Society of London 2006.
60
R.A. REYMENT
second example is concerned with a biomolecular study of fossil and living brachiopods (Endo et al. 1995). A presentation of the observations used here for the Lithuanian material is given by Kaminskas & Malmgren (2003) in a statistical analysis of three sequences of Silurian sediments dominated by carbonates and mudrocks; in that paper, the compositional nature of the data was taken into account adequately. The brachiopod data concern intra-crystalline molecules isolated from the shells of four species collected from horizons extending over the last 1.47 million years. The material was analysed by immunoassay and amino-acid analysis. The species studied were Japanese occurrences of Coptothyris grayi (Davidson), Terebratalia coreanica (Adams & Reeve), Pictathyrispicta (Dillwyn) and Laqueus rubellus (Sowerby). A detailed account of the methods of analysis is provided in Endo et al. (1995). The original statistical analysis was made on 14 amino acids determined on 52 specimens. For present purposes the dataset was reduced to eight parts, none of which has zero entries.
Overview of statistical procedures Aitchison (1986) showed that a canonical variate analysis of compositions can be made using either the log-ratio covariance matrix or the centred logratio covariance matrix, in both cases for the within-groups and between-groups sums of squares and cross-products. The advantage of the former formulation is that the within-groups matrix W and the between-groups matrix B are positive definite and hence possess a normal inverse. The awkward aspect here is that a common divisor is required. The latter has the attractive property that all parts are represented in the formulation, but the centred log-ratio covariance matrix is singular and therefore requires a generalized inverse. Further reservations have been noted by Egozcue et al. (2003). For detailed discussions of Aitchison's statistical results in compositional data analysis, the reader is referred to specialist papers appearing in the theoretical section of this issue. It is recommended that the vector-stability analysis proceeds in the following steps. Make a preliminary (graphical) inspection of the data for redundancy and other singularities. Transform the frequencies for m parts to the desired log-ratio form (Aitchison 1986, 1997). This may be as a simple log-ratio transformation or, le cas dcheant, a centred logratio transformation.
Perform a canonical variate analysis of k groups. If the results indicate there could be a likelihood of marked instability in the canonical vectors, apply the method of ridge regression (shrunken estimators) in orientational mode, as described in Campbell (1979, 1980a) and Campbell & Reyment (1978), and examine the coefficients of the canonical vectors for significant change.
Canonical variate analysis The method of canonical variate analysis is usually considered to be a suitable multivariate statistical procedure for treating the problem posed by the simultaneous analysis of several sampling levels in, say, observations made over a stratigraphical sequence. In many applications of canonical variate analysis, the relative magnitudes of the coefficients for the variables (or parts in the case of compositions) standardized to unit variance by the pooled within-groups standard deviations often prove useful indicators of those coefficients that are likely to be influential for discrimination. The success of such an operation presupposes that the coefficients are stable over repeated sampling (Campbell 1980b; Campbell & Atchley 1981; Seber 1984). In accordance with accepted methodology, and adhering to Aitchison's (1986) formulation, the computation of canonical variates may be regarded as a two-stage rotational procedure. The first step rotates to orthogonal variables, which may be referred to as the principal components of the pooled samples. The second rotation corresponds to a principal component analysis of the group means in the space of the orthogonal variables. The first step transforms the within-groups dispersion ellipsoid into a concentration spheroid by scaling each latent vector by the square root of the corresponding latent root. Consider the variation between groups along each orthogonalized variable (i.e. each principal component). Where there is slight variation between groups along a particular direction, and the corresponding latent root is small, marked instability can be expected in some of the coefficients of the canonical variates, granted that the instability is under the sway of small changes in the properties of the data. An approach that often serves to overcome the problem of instability in canonical coefficients is to add shrinkage, or ridge-type constants (Goldstein & Smith 1974) to the latent roots before they are used to standardize the corresponding principal component. When an 'infinitely large' constant is added, this confines the solution to the subspace orthogonal to the
STABILITY OF CANONICAL COMPOSITIONS vector, or vectors affected by the addition. Experience dictates that when the between-groups sum of squares for a particular principal component is small (say, less than 5% of the total betweengroups variation) and the corresponding latent root is also small (less than 2-3%) then shrinking of the principal component will often prove useful (Campbell 1980a). It is often observed that although some of the coefficients of the canonical vectors corresponding to the canonical variates of interest change magnitude and, moreover, often sign, shrinkage has little effect on the corresponding canonical roots, thus betokening that little discriminatory information has been lost. This indicates that those variables contributing most to the shrunken principal component have little influence for discrimination and may be considered for removal. Furthermore, variables (parts) with small standardized canonical variate coefficients can be vetted for exclusion. A simple account of applications of canonical variate analysis in biological (including palaeontological) studies of Reyment et al. (1984, chapter 7). A brief account of the method of shrinkage, as presented by Campbell (1979, 1980a) is given in Appendix A.
The Lithuanian Silurian sediments The samples derive from three boreholes drilled in connexion with a chemo-stratigraphical study of the Silurian sedimentary sequences of Lithuania, located geographically as outlined below. A detailed account of the geological aspects of the problem are given in Kaminskas & Malmgren (2003). 1.
2. 3.
The deepest part of the sedimentary basin was penetrated by the Kurtuvrnai-161 borehole in northwestern Lithuania. The intermediate beds were drilled by the Ledai-179 borehole in central Lithuania. The uppermost beds of the sedimentary basin were encountered by the Jocionys-299 borehole in southeastern Lithuania.
The oxides determined were (Paskevicius 1997): SiO2, A1203, Fe203, MnO2, MgO, CaO, Na20, K20, TiO2, P205. A preliminary appraisal of the data array indicated that two of the oxides were not contributing anything essential to the analysis and they were therefore deleted, after which the constant row-sum constraint was re-established. The reduced set of eight parts then lacked the columns for manganese and phosphate. Recall that the constituent categories of the geochemical array are referred to as 'parts' since they are not variables in the accepted statistical sense. This usage is to underline the fact that deletion of one
61
or more of the proportions necessitates reinstating the constant sum condition which automatically alters the covariance relationships between the parts of each of the constituent rows. Should this step be neglected, then the relationships between parts are rendered spurious. In the case of what may be referred to as true variables, this restriction does not apply. In the ensuing analysis, the common divisor for the log-ratios was taken to be SiO2. The geochemical analysis of these data is given in Kaminskas & Malmgren (2003), who also report multivariate statistics for comparisons. The within-groups and between-groups matrices of sums of squares and cross products of the logratios in standardized form (correlation mode) are listed in Table 1. There are several very high correlations which, according to the results of Campbell (1979, 1980a), can set the stage for instability in canonical vector coefficients. The latent roots and vectors for W* are given in Table 1. There are two large latent roots. The two smallest latent roots do not differ greatly from zero. Comparison of these smallest roots with the appropriate values of diag G shows that the smallest latent root is connected to a larger value than is the sixth latent root. The corresponding latent vector (principal component) represents mainly a bipolar relationship between parts 1 and 2. The third latent vector, which connects to the smallest value of diag G, weighs parts 3 and 5 against part 4. Both of these directions could well be candidates for shrinkage. In order to test this, the two smallest latent vectors were suppressed, in turn (Table 2). The coefficients for a~ highlight the contributions from principal components 2 and 4 to the first canonical variate. The main contributing principal component of at~ to the second canonical variate is the seventh (Table 2). The effect of shrinking the smallest principal component has a notable effect on the sum of the canonical roots and the canonical vectors are perturbed rather strongly. The effect of shrinking the third principal component has virtually no influence on the sum of the canonical roots, which implies that discrimination power is undiminished as a result of the shrinkage exercise. This is even more marked for shrinkage of the sixth principal component. The effect of shrinking on the matching canonical vectors is slight, with the exception of part 1 in the first canonical vector. The conclusion that presents itself here is that shrinkage of the sixth principal component can be expected to improve the statistical quality of the analysis with respect to reification.
Brachiopod biomolecules study Several species of brachiopods (cf. Introduction) were made the object of this study. Using modern
62
R.A. REYMENT
Table 1. Log-ratio matrices for the Lithuanian data and spectrum of W* (n = 221) in correlation mode The input within-groups matrix, W* (upper triangle) and between-groups matrix B* (lower triangle)
1
2
3
4
5
6
7
0.0190 0.0251 -0.1453 -0.0939 0.0778 0.0183 0.0180
0.9917 0.0331 -0.1936 -0.1233 0.1033 0.0224 0.0238
0.1333 0.1365 1.2941 0.6723 -0.6628 - 0.1593 - 0.1481
-0.0583 -0.0446 0.6040 -0.4742 -0.3675 -0.0859 -0.0860
0.7437 0.7412 0.0533 -0.1712 0.3438 0.0822 0.0775
0.9422 0.9332 -0.0362 -0.2762 0.7276 0.0197 0.0185
0.9870 0.9905 0.1904 -0.0060 0.7543 0.9065 0.0176
L ~ e n t r o o t s of W* 4.5300
1.6563
0.4047
0.3377
0.0573
0.0081
0.0060
1 2 3 4 5 6 7
Lamntvectors of W* Pa~s P1 1 2 3 4 5 6 7
-0.4634 - 0.4626 -0.0513 0.0595 -0.3893 -0.4487 - 0.4607
P2
P3
P4
P5
P6
P7
0.0437 0.0519 0.6919 0.6984 -0.0567 -0.1308 0.0924
0.1813 0.1956 -0.5580 0.4818 -0.5947 0.1139 0.1372
-0.1090 - 0.0933 -0.4522 0.4863 0.6984 -0.2161 - 0.0632
-0.1149 - 0.2430 0.0460 0.1984 0.0528 0.8186 - 0.4619
0.8359 - 0.4779 -0.0014 -0.0129 0.0083 -0.1487 - 0.2248
0.1631 0.6700 0.0266 -0.0231 0.0315 -0.1721 - 0.7018
0.4938
0.0287
0.3715
Diag G = between groups sums of squares for principal components 0.0506 Trace G = 2.9970 Percentage of trace 1.687
0.9417
31.420
0.1043
1.0064
3.480
33.580
0.959
16.477
12.397
Bold italics denote values of interpretational consequence. Key for Sit2 log-ratios: Al(1), Fe(2), Mn(3), Ca(4), Na(5), K(6), Ti(7).
techniques o f m o l e c u l a r biology, it was possible to obtain detailed information on amino acids contained in preserved shell material (Endo et al. 1995). The results reported briefly b e l o w form part of the imaginative w o r k in molecular palaeontology being carried out b y Professor Kazuyoshi
E n d o and his associates at T s u k u b a University, Japan. The log-ratio covariance matrix W for the amino acids is listed in Table 3. Inspection o f the entries s h o w s that w6, 6 and w7, 7 are b y far the largest. The palaeobiological and ecostratigraphical
Table 2. Standardized log-ratio canonical vectors for the Lithuanian data, including shrunken estimates: vectors adjusted to standard deviations Principal component
al~ a~
PC 1
PC2
PC3
PC4
PC5
0.145 -0.026
0.626 -0.052
- 0.065 -0.396
- 0.597 -0.505
0.413 -0.397
Adjusted canonical variate vectors log-ratios for parts 1 2 3 4 5 c~ e2v elci(o. . . . . . ~ Gx c5 (o. . . . . . ) GI c~ (0. . . . . . . . ~ ei (0..... oo,oo)
-0.009 3.177 0.565 2.301 -0.602 1.534
-2.600 4.946 -0.727 - 0.624 -2.352 5.954
0.855 0.858 0.981 0.622 0.854 0.888
0.196 -1.281 0.080 - 1.477 0.210 -1.273
-0.706 -0.035 -0.639 - 0.246 -0.713 -0.063
PC6
PC7
Canonical roots
- 0.223 0.641
2.397 0.600
6
7
Canonical roots
1.931 -2.822 1.384 - 2.278 2.049 -2.598
1.179 -5.561 -0.894 0.549 1.359 -5.176
2.397 0.600 2.288 0.337 2.388 0.581
0.063 0.178
63
STABILITY OF CANONICAL COMPOSITIONS Table 3. The within-groups matrix W for log-ratio data, its latent roots and vectors and diag G for the brachiopod amino acids (n = 52) 1
2
3
3.8060 2.2520 0.2660 0.9430 3.5450 4.9210 0.0120
2.2520 1.9600 1.0190 0.7190 1.8670 3.7130 -0.1410
0.2660 1.0190 3.7660 1.0980 - 1.4410 - 1.0780 1.1460
37.7762
Latent vectors of W parts 1 2 1 -0.0773 -0.0679 2 -0.0561 -0.0451 3 0.0239 -0.0231 4 0.0249 -0.1701 5 -0.1973 -0.1057 6 -0.9068 -0.3358 7 0.3583 -0.9165
1 2 3 4 5 6 7
Latent roots of W 71.7529
4
5
6
7
0.9430 0.7190 1.0980 4.4320 0.5170 0.2730 5.8970
3.5450 1.8670 - 1.4410 0.5170 9.8330 12.3510 - 1.7170
4.9210 3.7130 - 1.0780 0.2730 12.3510 63.8340 - 11.5440
0.0120 -0.1410 ! .1460 5.8970 - 1.7170 - 11.5440 41.0870
8.8180
5.4903
2.6884
1.9037
0.2885
3 -0.4435 -0.2274 0.1003 -0.1044 -0.8140 0.2533 0.0620
4 -0.3022 -0.3549 -0.6952 -0.4734 0.2497 0.0071 0.1138
5 -0.2873 -0.2843 -0.3017 0.8513 0.0780 -0.0074 -0.1214
0.5924 0.2226 -0.6087 0.1000 -0.4675 0.0197 -0.0113
7 -0.5177 0.8287 -0.2098 0.0262 0.0167 -0.0150 0.0016
5 0.3718
6 0. 0417
7 0.3345
Diag G = between groups sums of squares for principal components 1 2 3 4 0.0941 0.2361 0.1007 0.2450
6
Trace G = 1.42405 Percentage o f trace 6.607
16.580
7.071
17.206
26.111
2.930
23.495
Here and elsewhere in the text, the amino acids designated 1-8 (8 is the divisor) correspond to the designations D, E, G, T, A, P, Y, I for amino acids in Endo et al. (1995).
reasons for this are that the amino acids involved are m u c h affected by degeneration over time. The latent roots and vectors o f W are presented in Table 3 together with the diagonal for the between-groups sums o f squares matrix for the principal components, diag G. The first latent vector is d o m i n a t e d entirely by amino acid P and the second latent vector by amino acid Y. The smallest value of
diag G, d6, is just 2.93% o f the trace o f G, whereas the seventh entry is actually the greatest o f all, being 23.5% o f the trace of G. The principal component corresponding to d6 is mainly an expression o f bipolar covariation in the tog-ratio vector c o m ponent amino acid D (0.6) and vector c o m p o n e n t amino acid G ( - 0 . 6 ) . These several relationships o f small between-groups sums o f squares
Table 4. Standardized canonical vectors for the seven log-ratios, and shrunken estimates, for the brachiopod data: vectors adjusted to standard deviations Principal component PC1
PC2
PC3
PC4
PC5
PC6
PC7
Canonical roots
a~
- 0.283
- 0.524
- 0.210
- 0.062
0.391
- 0.074
0.662
0.723
a~
0.263
0.268
0.352
0.465
0.693
- 0.190
0.049
0.436
Adjusted canonical variate vectors log-ratios of parts 1 2 3 4 c~ - 0.691 0.974 - 0.286 0.264 czv -0.368 -0.176 -0.189 0.236 G~ el (o. . . . . . ~ --0.663 0.991 --0.312 0.261 GI e~ (o. . . . . . ~ -0.339 -0.103 0.288 0.231
5 0.131 0.041 0.110 -0.011
6 0.020 -0.019 0.021 -0.012
7 0.032 -0.048 0.034 -0.043
Canonical roots 0.723 0.436 0.720 0.424
64
R.A. REYMENT
corresponding to principal component 6 and the small latent root for W tend to be amenable to shrinking of the principal component. The coefficients for a ta highlight the contribution from principal components 2 and 7 to the first canonical variate (Table 4). The main contributing principal component of a v2 to the second canonical variate is the fifth. The effect of shrinking the smallest principal component (which is connected to the largest value in diag G) is to perturb the canonical variates and to occasion a serious loss of information in the canonical roots. Shrinkage of the sixth principal component (Table 4) has little effect on the components of all of the canonical vectors and the sums of the canonical roots are likewise little influenced. This indicates that the principal component 6 represents a redundant direction and can be eliminated with very little loss of discriminatory power. The practical effect of such a step is generally to achieve greater stability in the canonical vectors and an improvement with respect to their interpretation.
a considerable amount of work. It may therefore be of value to point out that the author' s experience has been that instability in canonical vectors due to redundant directions of variation is not very common in all aspects of applied canonical variate analysis and this seems to be true of geology. This notwithstanding, advice to the analyst is that a trial test of the data is a safety investment that should not be neglected if one of the main goals of a study is the practical problem of the interpretation of the coefficients of the canonical vectors. Dr D. Kaminskas is thanked for making the Lithuanian data available for analysis. Professor B. A. Malmgren helpfully directed attention to these data. The Trustees of the Swedish Museum of Natural History generously provided working facilities. The author is grateful to Professor K. Endo for continued advice concerning problems in the sphere of molecular palaeontology. A special expression of gratitude is extended to Professor Vera Pawlosky-Glahn for constructive comments, advice and information concerning ongoing research.
Comments A p p e n d i x A: R e v i e w o f the m e t h o d The two examples drawn from published investigations presented briefly in this paper serve to exemplify the application of the shrinkage technique for stabilization in the canonical variate analysis of compositional data. One example is for the correlation mode (the Lithuanian sediments), the other uses the theoretically, perhaps more faithful, representation in covariance mode (the brachiopods). In both examples it could be demonstrated that redundant directions of variation have a negative influence on the reification of the canonical vectors and hence the reliability of an analysis. It could also be shown that where there is a marked stratigraphical component in a dataset, such as occurs in long borehole sequences, this can influence the outcome of an analysis because of trend-weighting. The log-ratio transformation was found to be more reliable for the purposes of the shrinkage stabilizing technique. The centred log-ratio covariances are less easy to work with because of the practical difficulty of overcoming and deciphering the effect of the singularity of the input matrices, as well as other factors (cf. Egozcue e t a l . 2003). Campbell (1979) was careful to point out that shrinkage is not the only way in which stability in discrimination can be approached under the circumstances applying in this paper and he provided alternatives. From the aspect of effective data analysis, selection of parts based on relative magnitudes and stability of coefficients may be preferable to a stepwise procedure. Finally, the successful application of the shrinkage technique often necessitates
of shrinkage It is assumed here that multivariate procedures available for full space carry over, at least approximately, to simplex space (Aitchison 1986, p. 202). The withingroups sums of squares and cross-products matrix W on nw degrees of freedom and the between-groups matrix of sums and squares and cross-products B are computed in the usual manner. Here, however, the data matrix is composed preferably of the log-ratio transformed observations (Aitchison 1983, 1986). Campbell (1979, 1980a) recommended that the within-groups matrix be expressed in correlation form with similar scaling for B. This is not mandatory, however. Aitchison (1997) proved that the correlation coefficient is not defined in simplex space, at least from the aspect of statistical interpretation of the correlation coefficient. In the present connexion, the use of correlations is strictly geometric for achieving spherical distributions with the end in view of bettering analytical stability (Campbell 1980a,b). The following review is in terms of the correlation mode. Hence:
W* = S-1WS -1
(A1)
B* = S-1BS -1
(A2)
and
where S denotes diag W 1/2 and the asterisks indicate the correlation mode to apply. Here, and elsewhere in the text, an apostrophe denotes a transposed matrix. The latent roots e i and latent vectors u, of W* are then found. The corresponding orthogonalized variables are the principal components with E = diag(eL. . . . . ev) and U = (ut . . . . . u~).
STABILITY OF CANONICAL COMPOSITIONS Consequently, W* = UEU'.
(A3)
Usually, the latent vectors are scaled by the square root of their latent roots; this a transformation for achieving within-groups sphericity, which is a manipulation for promoting stability. Shrunken estimators are constructed by adding shrinkage constants ki to the latent roots diag E before scaling the latent vectors (Goldstein & Smith 1974; Campbell 1980a). Let the shrunken estimators be denoted as K = diag(kl . . . . . kv).
(A4)
In a full-scale study, one would wish to test the effect of using different values of k. In the examples, a very large value has been used (in effect this is functionally infinitely great). Smaller values can be tried to test the consequence of reducing the between-group contribution from a particular component. Define now U*, the matrix of latent roots inflated by the chosen shrunken estimators. U* = U(E + K) -1/2 = U(*k,.....k,~)
(A5)
Next form, in the usual manner, the between-groups matrix of sums and squares and cross-products in the space of the within-groups principal components, that is, form G(k...... k,.~U(*~,.....k~B*U(*kL.....~
(A6)
Set di equal to the ith diagonal element of G. This diagonal element, which is the between-groups sums of squares for the ith principal component, is an important diagnostic tool. The usual canonical vectors cU of canonical variate analysis are yielded by c U = U*
. aU
(0,...,0)
(A7)
"
Generalized shrunken (i.e. generalized ridge) estimators are determined directly from the latent vectors a s of G(kL....._~,) with c s = U*( k l ..... k~.) a s
(A8)
The coefficient a Ui involves di/el/2. This implies that where the latent root is small, the value of a vi is given by the ratio of two small quantities and hence can be expected to fluctuate widely from sample to sample. A generalized solution results when ki = 0 for i ~< r and ki = oo for i > r
65
(i.e. the first r columns of U*). This gives aiGi ----- aiu for i ~< r and a TM = 0 for i > r. The generalized inverse solution results from forming G(o..... o;~, ..... ) = U r B U r
(A9)
where Ur* corresponds to the first r columns of U~o..... 0). The generalized canonical vectors are given by e ~ I = U*,.ael, where ael, of length r, corresponds to the first r elements of a u. (Note that c TM = C s(0 . . . . . 0; oo . . . . . c o ) ) . Practical
considerations
In practice it is frequently found that marked instability in vector coefficients is associated with a small value of a latent root ev and a correspondingly small diagonal element d,, of G. A useful rule of thumb is to examine the contribution of dv to the total group separation, to wit, trace (W-1B). In cases where the ratio of d i t o trace G is small, and, or, the corresponding ratio of canonical roots is small (<0.05) then little loss of the power of discrimination will result from excluding one or more of the smallest latent vectors or equivalently from suppressing the corresponding principal component. Total suppression is not always a necessity and some smaller value may be chosen such that the effect of the principal component is merely reduced (Campbell & Reyment 1978; Campbell 1979). Campbell (1979) reported that a generalized inverse solution with r = v - 1 frequently yields stable estimates. Reyment & Savazzi (1999) provide a compiled computer program for carrying out the computations encompassed by the foregoing algebra. Weihs (1995) has taken up the subject of the graphical representation of canonical variate results. As a complement to his treatise, Campbell constructed a very comprehensive interlocking computer program for the Department of Mathematics and Statistics, C S I R O , Wembley, Western Australia for canonical variate analysis which includes shrinkage, robust estimation when the covariances are not homogeneous, M-estimators, graphical procedures and much more. Recent results by Egozcue et al. (2003) on isometric log-ratio transformations are clearly worth extension within the context of stability of latent vector coefficients in compositional data analysis.
References AITCHISON, J. 1983. Principal component analysis of compositional data. Biometrika, 70, 57-65.
66
R.A. REYMENT
AITCHISON, J. 1986. The statistical analysis of compositional data. Chapman & Hall, London. AITCH~SON, J. 1997. The one-hour course in compositional data analysis or compositional data analysis is easy. In: PAWLOWSKY-GLAHN, V. (ed.) Proceedings of the Mathematical Geology third Annual Conference, Barcelona (September, 1997), 3-35. Vera. CAMPBELL, N. A. 1979. Canonical variate analysis: some practical aspects. PhD thesis, University of London. CAMPBELL, N. A. 1980a. Shrunken estimators in discriminant and canonical variate analysis. Applied Statistics, 29, 5-14. CAMPBELL, N. A. 1980b. Robust procedures in multivariate analysis.D: robust covariance estimation. Applied Statistics, 29, 231-237. CAMPBELL, N. A. & ATCHLEV,W. R. 1981. The geometry of canonical variate analysis. Systematic Zoology, 30, 268-280. CAMPBELL, N. A. & FURBY, S. L. 1994. Variable selection along canonical vectors. Australian Journal of Statistics, 36, 177-183. CAMPBELL, N. A. & REYMENT,R. A. 1978. Discriminant analysis of a Cretaceous foraminifer using shrunken estimators. Cretaceous Research, 1, 207-211. EGOZCUE, J. J., PAWLOWSKY-GLAHN, V., MATEUFIGUERAS, G. ~; BARCELO-VIDAL, C. 2003. Isometric log-ratio transformations for compositional data-analysis. Mathematical Geology, 35, 279-300. ENDO, K., WALTON, D., REYMENT, R. A. & CURRY, G. B. 1995. Fossil intra-crystalline biomolecules
of brachiopod shells: diagenesis and preserved geo-biological information. Organic Geochemistry, 23, 661-673. GOLDSTEIN, M. & SMITH, A. F. M. 1974. Ridge type estimators for regression analysis. Journal of the Royal Statistical Society, B36, 284-291. GuI, Q. 1999. Generalized shrunken-type robust estimation. Journal of Surveying Engineering, 125, 177-184. KAMINSKAS, D. & MALMGREN,B. A. 2003. Comparison of pattern-recognition techniques for classification of Silurian sedimentary rocks from Lithuania based on geochemical data. Norwegian Journal of Geology, 84, 117-124. PASKEVIC1US, J. 1997. Geology of the Baltic Republics. Vilnius University and The Geological Survey of Lithuania. REYMENT, R. A. 1991. Multidimensional palaeobiology. Pergamon Press, Oxford (Appendix by L. F. MARCUS). REYMENT, R. A. t~ SAVAZZI, E. 1999. Aspects of multivariate statistical analysis in geology. Elsevier, Amsterdam. REYMENT, R. A., BLACKITH,R. E. 8z CAMPBELL,N. A. 1984. Multivariate morphometrics (2nd edn). Academic Press, London. SEBER, G. A. F. 1984. Multivariate observations. Wiley and Sons, New York. WEIHS, C. 1995. Canonical discriminant analysis: comparison of resampling methods and convex hull approximations. In: KRZANOWSK1,W. J. (ed.) Recent advances in descriptive multivariate analysis. Oxford Science Publishing, Oxford, 35 -50.
Compositional changes in a fumarolic field, Vulcano Island, Italy: a statistical case study A. B U C C I A N T I , F. T A S S I & O. V A S E L L I
Dipartimento di Scienze della Terra, Universitg~ degli Studi di Firenze, Via G. La Pira 4, 50121 Firenze, Italy (e-mail:
[email protected]) Abstract: The identification of compositional changes in fumarolic gases of active volcanic areas
is one of the most important objectives in monitoring programmes. Together with information from seismic data and deformation, it provides key data to the formulation and management of emergency plans for populations living near active volcanoes. Chemical data obtained from different fumaroles collected at Vulcano Island (Sicily, southern Italy) between 2000 and 2004 have been analysed statistically. The methodology has identified parameters able to elucidate the structure of the complex fumarolic field of Vulcano in the investigated span of time, notwithstanding the high data variability. The southern portion of the Tyrrhenian Sea was affected by an earthquake (M = 5.8, 40 km NE of Palermo) in December 2002. Abrupt outgassing on the island of Panarea occurred in November 2002 and Stromboli was significantly active from December 2002 to July 2003. The great geological instability of the area is thought to have had an influence on the variability shown by the data. The time-dependent variations in the components of the data have been investigated using logratios. The H20-HC1-SO2 subcomposition, for a limited set of fumaroles, has been used to check for log-contrast principal components to be considered as a monitoring tool of volcanic activity. Results obtained indicate that the compositional changes are a complex function of time, chemistry, temperature and space. ANOVA (analysis of variance) of log-ratios for which there is no time dependence has elucidated components subject to significant spatial variations across the fumarole field, due to changes in redox conditions, and components dominated by random variations.
The present study was designed to apply statistical methodologies in the investigation of the chemistry of a complex fumarolic field as exemplified by Vulcano Island. The involvement of several possible sources in the chemical composition of volcanic gases and their modification in time and space often requires the use of explorative and inferential statistical tools. In this way systematic behaviours related to time and/or space may be distinguished and significant geochemical parameters able to charaterize the evolution of the volcanic system identified. Furthermore, statistical methodologies are useful in characterizing the high chemical variability that potentially affects these systems, where several phenomena tend to combine and overlap each other, there is the presence of groups of observations with similar compositional behaviour, or the compositional patterns depend on time and/or temperature (Mazor et al. 1988; Chiodini et al. 1995; Montegrossi et al. 2001; Leeman et al. 2005). Adequate models of the compositional evolution of the fluids need to take into account the features of the sample space. The special and intrinsic feature of compositional data is that the proportions of a composition are
naturally subject to a unit-sum constraint. Volcanic gas chemistry is typically a composition where the parts are expressed in Ixmol mo1-1. In the investigation of volcanic gas chemistry the unit-sum constraint has been ignored widely or wished away. Consequently inappropriate standard statistical methods, devised for a successful application to unconstrained data, have been used with possible negative consequences in the interpretation of the information contained in a gas chemical composition. Here, the original, largely intuitive, approach to compositional data analysis formulated by Aitchison (1986), based on the use of log-ratios, has been followed. Recently, several theoretical developments point to a mathematically correct statistical approach to managing constrained data (Aitchison 1999; Pawlowsky-Glahn & Egozcue 2001; Von Eynatten et al. 2002; Egozcue et al. 2003). The power of the log-ratio approach has, furthermore, been demonstrated in the investigation of several problems in the Earth Sciences (Reyment & Savazzi 1999; Pawlowsky-Glahn & Buccianti 2002; Von Eynatten et al. 2003a,b; Buccianti & Esposito 2004; Buccianti & Pawlowsky-Glahn 2005).
From: BUCCIANTI,A., MATEU-FIGUERAS,G. & PAWLOWSKY-GLAHN,V. (eds) Compositional Data Analysis
in the Geosciences: From Theory to Practice. Geological Society, London, Special Publications, 264, 67-77. 0305-8719/06/$15.00 9 The Geological Society of London 2006.
68
A. BUCCIANTI ETAL.
Volcanological and geochemical background The Aeolian archipelago (Sicily, southern Italy) is situated in the southeastern part of the Tyrrhenian Sea, some tens of kilometres off the northern coast of Sicily. It is related to the complex geodynamic situation of the Mediterranean area characterized by the collision between the African and Eurasian plates with N - S to N N W - S S E trending converging over the past 70 Ma (Frepoli et al. 1996). Vulcano is the southernmost of seven volcanic islands that form the Aeolian archipelago and it is one of Italy's active volcanoes (Barberi et al. 1974; Beccaluva et al. 1985; DeAstis et al. 1977). Vulcano, together with Lipari, is situated along a N W - S E tectonic alignment which transversely intersects the arc, coinciding with the structural line known as the Tindari-Letojanni lithosperic fault, with right-lateral strike-slip movements. The last eruption of Vulcano dates back to 18881890, with the main event being located in the crater named La Fossa. Since then, the main fumarolic field, located on the northern inner and outer flanks of the La Fossa crater, covering an area of about 9000 m 2, has exhibited particularly high variability with regard to total gas emission rates and location and composition of the gas discharges (Chiodini et al. 1992). During the last 100 years, the outlet fumarole temperatures have varied from 600 ~ in 1923 (Sicardi 1940) to less than 250 ~ at the end of 1970 (Chiodini et al. 1992). Since 1978 fumarolic activity has been modified strongly, with temperatures increasing up to 700 ~ (Chiodini et al. 1993; Martini 1993). At the present time outlet fumarole temperatures are around 400 ~ The first chemical data for fumarolic discharges were reported by Sainte-Clare Daville (1856) and Fouqu6 (1856). Over the last three decades a number of studies have yielded large sets of chemical data from fumarole gas emissions, thermal waters and soil gases of the island (Martini & Tonani 1970; Shinohara & Matsuo 1986; Badalamenti et al. 1988; Baurbon et al. 1990; Minissale 1992; Panichi & Noto 1992; Bolognesi & D'Amore 1993; Martini 1996; Montalto 1996; Capasso et al. 1999, 2001 ; Di Liberto et al. 2002). Only since 1980 have geochemical models of the physical and chemical properties of the cratergases been proposed and discussed (Martini 1980, 1983) by considering in addition seasonally induced variations that may perturb the primary volcanic signals. One of the main features of the different models is the role of a shallow aquifer located between the magma chamber, at a depth of about 4 km and the surface, controlling the thermal output and the chemical composition of the fumaroles. Carapezza et al. (1981) proposed a pressure
cooker model, inferring the presence of pressurized, two-phase (liquid-vapour) saline fluids in a reservoir at a depth of about 2 kin, able to evolve along the boiling trend of a liquid with variable content of NaCl and CO2 at about 350 ~ The dry model by Cioni & D'Amore (1984) proposed that the fumarolic discharges result from the variable mixing of (1) a deep magmatic component and (2) a shallower component (called marine hydrothermal) formed from the total evaporation of hydrothermal fluids (mostly water) of marine origin, the latter entering the low pressure, high temperature zone, which surrounds the uprising conduits of magmatic fluids affected by evaporation phenomena. The great thermal variability of the Vulcano fumarole field has recently been proven by the results of detailed investigations for a large number of fumaroles carried out in May (end of the rainy season) and September (dry season) 2001 (Pirillo et al. 2002; Vaselli et al. 2003). In general, the fumarole discharges can be classified by their location and chemistry into outer and inner fumaroles with respect to the crater rim and fumaroles located on the crater rim itself. The outer gas discharges are located mainly along fractures that extend outwards from the rim and can reach temperatures up to 100 ~ their composition is generally water-dominated. The gases from the crater-tim fumaroles vary widely in temperatures (100400 ~ Steam is still the main component, up to 98 • 10 4 Ixmol tool -1, but high temperature gases (SO2, HC1, HF, CO) become important. On the inner part of the crater rim, fumaroles display extremely high thermal and chemical variability, with temperatures ranging from 100 ~ to 420 ~ indicating the presence of a complex situation. The wide-ranging behaviour suggests that only the monitoring of an elevated number of discharges over time may give information on the whole system. Consequently, seven fumaroles were selected as representative of the field and have been monitored from 2000 as follows: two from the outer part of the crater rim (labelled FNB and FZ), two from the rim itself (FNA and F5) and three from the inner part of the crater rim (F14, F27, F202). The aim was to identify geochemical parameters able to give simple indications about the chemical structure of the data and how the latter can be affected by temperature, time and/or space. The results can be used in subsequent plans of investigation focused on following changes in the behaviour of selected parameters and/or sampling sites.
Statistical analysis and results The identification of significant changes in compositional data may be performed by taking into account the properties of the simplex, the
CHANGES IN A FUMAROLIC FIELD OF VULCANO
non-parametric W a l d - W o l f o w i t z Runs Test, which evaluates random variation in sequences of data. The variable tested must be dichotomous, that is the values have to be below or above the mean or median values. This test is important since the assumption of mutual independence of successive values may not be sustainable in this case. Such lack of independence would preclude the use of several statistical inferencial methods, which require this property in the data.
appropriate sample space for constrained data (data in which each of the components of the composition is a proportion of the fixed total). The chemical composition of volcanic gases (expressed in volume percentages, ppm, Ixmol mo1-1 or other related units) shows its variability in this sample space, a subset of real space, in which the application of standard statistical methods can lead to spurious results, confusing the underlying structure of the data, as well as producing wrong inferences. The chemistry of the gases for the fumaroles includes the following components: H20, CO2, SO2, HeS, S, HC1, HF, N2, 02, He, At, Ne, CO, CH4, C2H4, C2H6 . Data were collected for 115 samples obtained from the fumaroles discussed above, during the period 2000-2004. Log-ratios of components were used to study patterns of variability (Aitchison 1982, 1986). The part that was put in the denominator is H20 since it is the most important component of the composition and the main constituent of volcanic gases. It is involved in several key chemical reactions (e.g. He + CO2 ~ CO + H20, 4H2 q- 2SO2 ~ $2 q- 4H20 and C O 2 q- 4H2 CH4 + 2H20), all of which are highly dependent on temperature. The temporal variation of both temperature and log-ratios has been studied for each fumarole. Time-dependence was examined using the
0.98
.
.
.
.
.
.
.
r
......
.
.
.
.
.
.
m
r
.
.
.
.
.
.
r
.
.
"
.
.
.
.
r
D
.
.
69
Temperature analysis Probability plots of the outlet temperature gases from the seven fumaroles sampled between 2000 and 2004 are shown in Figure 1. Runs Test results indicate that successive data for each fumarole are independent of time (a = 0.05). The K o l m o g o r o v - S m i m o v test for goodness indicates that the data have Gaussian distribution (a = 0.05). Use of the Student t-test to compare means and variances shows that there are significant differences between all the fumaroles (a = 0.05) with regard to the outlet gas temperatures. Consequently, outlet temperatures can be considered as being drawn from different populations with different mean values increasing from 100 ~ (FZ, n = 10) to 211 ~ (F5, n = 13), 247 ~ (FNB, n = 12), 285 ~ (FNA, n = 15), 306 ~ (F14,
.
.
.
.
!.
.
.
.
.
.
.
.I
"
+.
.
.
.
.
.
.
i.
.
.
.
.
.
9
0
m
~
~
9
0.95
F Z
FNA FNB
0.90
m 0.75
......
[]
m. . . . . . . . . . . . . . . . =
[]
9
o: 9
~
D .........
9 m
"D )D
0.25
O] /x
o ....... o 0
9
o
9
O o
9 9 o oo:
"~
o
.
0....o.
r ~ : , [] :-,-
......................
0.50
9
.~
O:zx A
0 .......
.A
{)...:.A
0 0 .<~i. . . . . .
~[~)
....
.
....
/x :", ~
.
.
.
.
.
.
O
[]
0.10 0.05
. . . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
0 : O O
*
o 9
. . . . . . . . . .
"
A
.. . . . . . . . . . . . .
m .
[] 0.02
*
.. . . . . . .
0
zx
. . . . . . . ; . . . . . . ; . . . . . . ~ . . . . . . ~ . . . . . . ~. . . . . . . i. . . . . . . ~. . . . . . 50 100 150 200 250 300 350 400 450 Temperature (~
Fig. 1. Normal probability plots for outlet temperatures of the seven fumaroles.
O
F14
[]
F5
0 A
F27 F202
A. BUCCIANTI ETAL.
70 n = 15), 363 ~ n = 13).
(F27, n = 13) and 403 ~
(F202,
Log-ratio analysis Log-ratios ln(SO2/H20) and ln(HC1/H20) show patterns that are time-dependent and only for three fumaroles, as follows: F14 for ln(HC1/H20), FNA for both ln(SO2/H20) and ln(HC1/H20), FZ for In(SO2/H20) (c~ -- 0.05). The temporal patterns of these log-ratios are shown in Figures 2 and 3. Other log-ratios show no time-dependenceand, consequently, random variation is the dominant feature governing compositional changes. Analysis of variance (ANOVA) can be used to determine whether time-independent log-ratios are able to distinguish fumaroles. Kolmogorov-Smirnov tests have been carried out to verify for fit to the Gaussian distribution for each log-ratio (a = 0.05). In addition, the value of the Levene statistic for the homogeneity of variance was calculated in order to test whether all the group (fumarole) variances are homogeneous. The results indicate that this hypothesis can be accepted for a = 0.01. ANOVA indicates that only the log-ratios ln(H2/ H20), ln(CO/H20), ln(CH4/H20) and ln(CzH4/ H20) vary significantly among the fumaroles,
-4.5
I
2000
i
independently of time. Thus, the spatial location appears to be the dominant feature governing the differences among F5, F27, F202 and FNB fumaroles for the values of these log-ratios. Box-plots of the values of the log-ratios for the four fumaroles are shown in Figure 4. Significant differences are present for: (1) ln(Hz/H20), comparing F27 with F202 and FNB; (2) ln(CO/H20), comparing F27 with F5 and FNB; (3) ln(CH4/H20), comparing F202 with F5, F27 and FNB; (4) ln(CzHg/H20), comparing F27 with FNB, in all cases for c~ = 0.05. Fumarole F27 consistently differs from the others for all the mentioned log-ratios. For all the other log-ratios [ln(COz/H20) . . . . . ln(C2H6/H20)] ANOVA results indicate that these gas components show no variations among fumaroles (a = 0.05).
Subcompositional analysis For the general form of a chemical reaction the corresponding equilibrium constant has an equivalent logarithmic form. The logarithmic version encourages the view that a sensible way to identify patterns in compositional datasets is to search for natural log-contrasts of the components of the composition (Aitchison 1983, 1999). The
I
2001
2002
I
2003
I
2004
-5 q
-5.5 -6 9
-6.5
9 ~
-7 -7.5 -8
C
F14 FNA
-8.5 Z
--9
FZ I
0
12
I
I
I
24 36 48 time in months from January 2000
Fig. 2. Time behaviour of the log-ratio ln(SOz/H20) for fumaroles F14, FNA and FZ.
6O
72
CHANGES IN A FUMAROLIC FIELD OF VULCANO -5
i
i
2000
2001
71
I
2002
2003
2005
2004
-6
-7
-8
_~-9 -10 O -11
F14 FNA
=
-12 0
FZ 1
I
I
1
|
12
24
36
48
60
72
timein months ~om January2000 Fig. 3. Time behaviour of the log-ratio ln(HC1/H20) for fumaroles F14, FNA and FZ.
-I-6 T
-12
-8
9
:s "~ -10 7Z
z
_1_
9 -14
_1_
4_
_L
-12 -16 -14 F5
F27
F202
FNB
F5
F27
F202
FNB
-18
m
I
-14 -20
9
~
-16
7- it
:
9
~-22
z
-18
I
r
_L
-20
e
_t_ F5
F27
F202
_1_
-24
FNB
F5
9
F27
F202
FNB
Fig. 4. Box-plots of the log-ratios ln(H2/H20), In(CO/H20), ln(CH4/H20) and ln(CzH4/H20) for fumaroles F5, F27, F202, FNB.
72
A. BUCCIANTI
time-dependent log-ratios, 1n(SO2/H20) and ln(HC1/H20), identify the subcomposition H 2 0 HC1-SO2 where the features of the log-contrast principal components can be checked. The aim is to find a simple relationships containing information about the evolution of the volcanic system. The centred log-ratio covariance matrix F as a means of specifying a compositional covariance structure and the concept of the log-contrast of a composition were used by Aitchison (1983) to provide a practical form of compositional principal component analysis. In this context, a log-contrast of a D-part composition x is any log-linear combination a l In x l + 9 9 9+ a o In xD = a' in x with al + .-. + aD = a'j = 0. The first log-contrast is given by the equation: 0.47 ln(H20) - 0.81 ln(HC1) + 0.341n(SO2) = kl.
(1)
The proportion of the total subcompositional variability which is retained by the first log-contrast principal component is about 64% ( A 1 / ~ - 1 Ai, where Ai are the eigenvalues of F and D, the parts of the composition). Dividing each component by 0.81
ET AL.
and approximating the coefficients by 0.5, obtains: (H20)1/2 . ( 8 0 2 ) HC1
1/2
= exp kl.
(2)
for which kl can be calculated for each sample. A probability plot for kl values of each fumarole is reported in Figure 5. The position of the different cumulative curves clearly reveals the presence of the spatial variation from the inner part of the crater to the rim and the outer part. Furthermore, the increase in ks values is related approximately with time and with the decrease in temperature. These results clearly indicate that first log-contrast principal component is a complex function of chemistry, temperature, time and space. It can represent a statistical tool to monitor for processes that take into account about 2/3 of the subcompositional variability affecting the system. The processes involve H20, SO2 and HC1 in the proportion 0.5 : 0.5 : 1. The equation for the second log-contrast principal component (explaining about 36% of the subcompositional total variability) is given by: 0.67 ln(H20) + 0.07 ln(HC1) - 0.74 ln(SO2) = k2.
(3)
0.95 0.90
O
F14 (inner)
-"
FNA (rim)
--
FZ (outer)
0.75
e'~ e~
0.50
0.25
0.10 0.05 2.5
3
3.5
4
4.5
5
First log-contrast values k s Fig. 5. Probability plot of the first log-contrast values kt discriminated by fumaroles.
5.5
6
6.5
CHANGES IN A FUMAROLIC FIELD OF VULCANO I
I
73
I
I
". . . . . . . . . .
'./
0.95 O
0.90
F14 (inner)'
. .........
FNA (rim)
i ~/
~
.........
;
!
4 4.5 Second log-contrast values k 2
5
.......
0.75
o
0.50
0.25
0.10 0.05 L 2.5
3
i 3.5
5.5
6
Fig. 6. Probability plot of the second log-contrast values k2 discriminated by fumaroles.
Given that the coefficient for HC1 is so low, equation (3) can be expressed as: ( H 2 0 ) 0.67 (802)0.74 -- exp k2.
(4)
A probability plot for k2 values of each fumarole is reported in Figure 6. In this case too, the position of the different cumulative curves clearly reveals the presence of the spatial variation from the inner part of the crater to the rim and to the outer part. However, F14 and FNA fumaroles show a more similar behaviour with respect to the previous case and this appear to be related to the lack of HC1. These results indicate that the second logcontrast principal component is a simple function of chemistry (only two variables), space (inner and rim parts of the crater discriminated from the outer one), temperature and time. The increase in its values is related approximately to time and to the decrease in temperature. Consequently, equation (4) can represent a useful tool to follow processes explaining about 1/3 of the data variability and related to H20 and SO2.
Discussion The geochemical-statistical investigation of the compositional data of fumaroles related to the field of Vulcano Island in terms of log-ratios with H20 as the denominator reveals the presence of: (1) a time-space dependency for ln(SO2/H20) and ln(HC1/H20); (2) a space dependency for ln(H2/ H20), ln(CO/H20), ln(CH4/H20) and In (C2H4/ H20), associated with the discriminant behaviour of F27 fumarole; and (3) randomness in time and space for ln(CO2/H20, ln(H2S/H20),ln(S/H20), ln(HF/H20), In (N2/H20), In (O2/H20), ln(Ar/ H20), In (Ne/H20) and, finally, In (C2H6/H20). In the first case, two acid gases (SO2 and HC1) are involved. These species, along with HF, can be highly affected by scrubbing, as the magmatic gases rise to the surface (Symonds et al. 2001). The ln(SOz/H20) log-ratio changes significantly with time for FNA (rim) and FZ (outer), while the F14 (inner) trend is, instead, not significant from a statistical point of view, even if quite similar to those of FNA (Runs Tests, Fig. 2); are should notice here the wide oscillations characterizing the behaviour of FZ before 2003. By considering these features it may be concluded that either the
74
A. BUCCIANTI E T AL.
influence of the shallow environment has decreased in time from 2000 to 2004 or the magmatic contribution has increased (both in a significant manner) and that only on the rim and the outer part of the crater statistically significant changes in ln(SO2/H20) occur. The behaviour of ln(HC1/H20) over time (Fig. 3) is similar for the F14, FNA and FZ fumaroles at the beginning of 2002 and at the end of 2004. It diverges markedly at the end of 2002 for FZ and less markedly for F14 and FNA. However, it may be useful to remember that for FZ the revealed time pattern is not statistically significant. The behaviour of the log-ratio in the three fumaroles appears to reveal the presence of some general mechanism that is able to homogenize the data in time, independently of their spatial location. This type of phenomenon occurred at the beginning of 2002 and at the end of 2004. Data of different times have changed following more or less different evolutive paths, with respect to their spatial location. Fluctuations may be due to the different sources contributing HC1 and, in particular, to secondary non-magmatic processes as, e.g. interactions of sea water with hot rocks leading to the formation of volatile HC1 (Mazor et al. 1988). It is known that Vulcano fumaroles show a marine signature due to sea water that directly enters the hydrothermal system at shallow levels (Cortecci et al. 1996, 2001). Similarity among data drawn from different fumaroles can be related to periods of time in which an increase in fluids circulation smoothed the differences. The similar behaviour of the log-ratio ln(HF/H20) appears to confirm this hypothesis. The investigation of the subcomposition H 2 0 HC1-SO2 yielded two relationships that link the chemical species in a form similar to the Law of Mass Action. The first relationship, called first log-contrast principal component (2/3 of the total subcompositional variability), models the behaviour of chemical species able to give information on important phenomena as the magmatic contribution (SO2), scrubbing (H20) and the extent of the water-rock interactions (HC1). The relationships among these processes can be quantified by the proportion 0.5 : 0.5 : 1. The second log-contrast (1/3 of the total subcompositional variability) models mainly changes in the H20/SO2 ratio and, consequently, scrubbing phenomena compared with the magmatic contribution. If the behaviour of the log-ratios depending only on space, as ln(H2/H20), ln(CO/H20), ln(CH4/ H20) and ln(C2H4/H20), is analysed, one can check that the species involved are discriminated by considering high (CO, H2 and C2H4) and low (CH4) temperature gas compounds. As already mentioned, temperature is a discriminant parameter among fumaroles but significant time-dependent
trends were not observed. Moreover, the above species participate in chemical reactions highly affected by redox conditions of the system (for example, H2 at- CO2 ~ CO "at- H20, 4H2 -k- 2SO2 S 2 + 4H20 and CO2 -~- 4H2 ~ CH4 h- 2H20). In this framework the behaviour of the F27 fumarole is well discriminated from the others due to its higher temperature. If this fumarole is investigated in its future behaviour, changes in the redox conditions of the system can be clearly revealed, thus forecasting for important changes in the volcanic system. The random behaviour of all the other chemical species analysed in this work appears to be related to the high variability of the log-ratios, thus giving a measure of their sensitivity to any change affecting the environment in time and/or space. This result indicates that their utility for monitoring parameters to be used in surveillance programmes is limited to the present state of the volcanic system.
Conclusions Many systematic and random processes control the chemistry of gas discharges from active and quiescent volcanic areas, some of which may be defined poorly or understood inadequately. As a gas mixture is released by a magmatic body and moves toward the surface, complex chemicalphysical reactions occur, according to variations in the level of volcanic activity, resulting in variable quantities of the non-magmatic/magmatic chemical species (Giggenbach 1996). Despite the complexity of volcanic system, fluid discharged at the surface may be modelled stochastically. Thus, experimental sampling designs are crucial since they will determine the independence of observations, thus informing about the best approach to data analysis. Moreover, if data are compositional, the choice of sample space has to be taken into account in order to build sound statistical models. In this experimental design seven fumaroles and sixteen variables were used to investigate the fumarolic field of Vulcano Island. Investigation of geochemical parameters in logratio form has shown that gases from fumaroles F14, FNA and FZ have significant differences in ln(SO2/H20) and ln(HC1/H20) in time and space; however, some mechanism able to homogenize the data in some periods of time independently of their spatial location for ln(HC1/H20) occurs. This indicates that Vulcano has periodically experienced periods of time in which an intense fluid circulation was able to affect large portions of the crater, leading for a spatial diffusion of waterrock interaction processes. The HzO-SOz-HC1 subcomposition has been used to describe the relationships among the chemical species and
CHANGES IN A FUMAROLIC FIELD OF VULCANO log-contrast principal component analysis permitted the research of equations similar to the Law of Mass Action. Following this path, chemical changes summarized by the log-contrasts have been analysed by considering their dependence from temperature, time and/or space. Log-contrast principal components are simple relationships able to take into account changes in important factors, such as the magmatic contribution, scrubbing, water-rock interactions, affecting the evolution of Vulcano. It is clear that in the investigated span of time, gas chemistry was highly variable. Causes of this variability are unclear but probably related to the high number of chemical-physical parameters affecting the chemical equilibria. Seismic activity 40 km offshore of Palermo, resulting in a M = 5.8 earthquake, abrupt outgassing off Panarea and volcanic eruptions at Stromboli in late 2002 through to summer 2003 suggest regional tectonic instability to which the changes in gas composition at Vulcano may be related (Biagi et al. 2004; Di Giovambattista & Tyupkin 2004; Caracausi et al. 2005; Calvari et al. 2005). The first and second log-contrast principal components account for such variation in gas composition of the fumaroles for Vulcano. Potentially, log-contrast equations can be used in the future to monitor further temporal changes. However, it may be that new data yield different log-contrasts that provide additional information to help model volcanic activity. In addition, the statistical investigation reveals that the log-ratios In(H2/H20), ln(CO/H20), ln(CH4/H20) and ln(C2H4/H20) are related to the sample site but no time dependency was revealed. It is concluded that temperature and redox conditions are a function of fumarole but temporal changes are essentially random. Finally, for the other log-ratios In(CO2/H20), In(H2S/H20), ln(S/H20), In(HF/H20), ln(N2/ H20), ln(O2/H20), ln(Ar/H20), ln(Ne/H20), ln(C2H6/ H20)) random variation over time and space is the fundamental pattern, thus compromising their use as monitoring tools in surveillance programmes. The results obtained in this work allow one to conclude that the combined approach between geochemistry and statistics is a powerful tool for investigating complex volcanic systems evolving in time and space and where random variation is an important feature. In these situations, compositional data analysis appears to isolate variables responsible for significant changes, thus yielding parameters to be used in surveillance programmes for volcanic activity on a sound statistical base. This research has been supported financially by Italian MIUR (Ministero dell'Istruzione, dell'Universit~ e della Ricerca Scientifica e Tecnologica), PRIN 2004, through
75
the GEOBASI project (prot. 2004048813-002). GNV (National Group of Volcanology) and ASI (Italian Space Agency) are also acknowledged for partly supporting the field work at Vulcano Island. The authors thank R. Reyment and C. Thomas for helpful reviews.
References AITCHISON, J. 1982. The statistical analysis of compositional data (with discussion). Journal of the Royal Statistical Society Series B, 44 (2), 139-177.
AITCHISON, J. 1983. Principal component analysis of compositional data. Biometrika, 70, 57-65. AITCHISON, J. 1986. The Statistical Analysis of Compositional Data. Chapman & Hall, London. AITCHISON, J. 1999. Log-ratios and natural laws in compositional data analysis. Mathematical Geology, 24 (4), 365-379. BADALAMENTI, B., GUERRIERI, S., HAUSER, S., PARELLO, F. & VALENZA, M. 1988. Soil CO2 output in the island of Vulcano during the period 1984-1988: surveillance of gas hazard and volcanic activity. Rendiconti Societ ltaliana di Mineralogia e Petrologia, 43, 893-899. BARBERI, F., INNOCENTI,F., FERRARA,G., KELLER, J. t~ VILLARI, L. 1974. Evolution of Eolian arc volcanism (southern Thyrrenian sea). Earth Planetary Science Letters, 21, 269-276. BAURBON,J., ALLARD,P. & TOUTAIN,J. 1990. Diffuse volcanic emissions of carbon dioxide from Vulcano Island, Italy. Nature, 344, 51-53. BECCALUVA, L., GABBIANELLI, G., LUCCHINI, R., RossI, P. & SAVANELLI, C. 1985. Petrology and K/Ar ages of volcanics dredged from the Eolian seamount: implications for geodynamic evolution of the southern thyrrenian basin. Earth Planetary Science Letters, 74, 187-200. BIAGI, P. F., PICCOLO, R., CASTELLANA, L. ET AL. 2004. Variations in a LF radio signal on the occasion of the recent seismic and volcanic activity in Southern Italy. Tectonophysics, 384, 243-255. BOLOGNESI, L. & D'AMORE, F. 1993. Isotopic variation of the hydrothermal system on Vulcano Island, Italy. Geochimica et Cosmochimica Acta, 9, 2069-2082. BUCCIANTI, A. 8(: ESPOSITO, P. 2004. Insights into late quaternary calcareous nannoplankton assemblages under the theory of statistical analysis for compositional data. Palaeogeography, Palaeoclimatology, Paleoecology, 202, 209-227. BUCCIANTI, A. & PAWLOWSKY-GLAHN,V. 2005. New perspectives on water chemistry and compositional data analysis. Mathematical Geology, 37 (7), 707731. CALVARI, S., SPAMPINATO, L. & LODATO, L. 2005. The 5 April 2003 vulcanian paroxysmal explosion at Stromboli volcano (Italy) from field observations and thermal data. Journal of Volcanology and Geothermal Research, 149, 160-175. CAPASSO, G., FAVARA, R., FRACOFONTE, S. (~r INGUAGGIATO, S. 1999. Chemical and isotopic variations in fumarolic discharge and thermal waters at Vulcano Island (Aeolian Islands, Italy) during 1996: evidence of resumed volcanic
76
A. BUCCIANTI ETAL.
activity. Journal of Volcanology and Geothermal Research, 88, 167-175. CAPASSO, G., D'ALESSANDRO, W., FAVARA, R., INGUAGG1ATO,S. & PARELLO, F. 2001. Interaction between the deep fluids and the shallow groundwaters on Vulcano Island (Italy). Journal of Volcanology and Geothermal Research, 108, 187-198. CARACAUSI, A., DITTA, M., ITALIANO,F., LONGO, M., NuccIo, P., PAONITA, A. & RJZZO, A. 2005. Changes in fluid geochemistry and physico-chemical conditions of geothermal systems caused by magamtic input: the recent abrupt outgassing off the island of Panarea (Aeolian Islands, Italy). Geochimica et Cosmochimica Acta, 69 (12), 3045 -3059. CARAPEZZA, M., NUCCIO, P. M. & VALENZA, M. 1981. Genesis and evolution of the fumaroles of Vulcano (Aeolian Islands, Italy): a geochemical model. Bulletin of Volcanology, 44 (3), 547-563. CHIODINI, G., CIONI, R., FALSAPERLA,S., MONTALTO, A., GUIDI, A. & MARINI, L. 1992. Geo-chemical and seismological investigations at Vulcano (Aeolian Islands) during 1978-1989. Journal of Geophysical Research, 97, 11025-11032. CHIODINI, G., CIONI, R. & MARINI, L. 1993. Reactions governing e chemistry of crater fumaroles from Vulcano Island, Italy, and implications for volcanic surveillance. Applied Geochemistry, 8, 357-371. CH1ODINI, G., CIONI, R., MARINI, L. & PANICHI, C. 1995. Origin of the fumaroles fluids of Vulcano Island, Italy and implications for volcanic surveillance. Bulletin of Volcanology, 57, 99-110. CIONI, R. & D'AMORE, F. 1984. A genetic model for the crater fumaroles of Vulcano Island (Sicily, Italy). Geothermics, 13 (4), 375-384. CORTECCI, G., FERRARA, G. & DINELLI, E., 1996. Isotopic time-variations and variety of sources for sulfur in fumaroles at Vulcano Island, Aeolian Archipelago, Italy. Acta Vulcanologica, 8, 147-160. CORTECCI, G., DINELLI, E., BOLOGNESI, L., BOSCHETTI, T. & FERRARA, G. 2001. Chemical and isotopic composition of water and dissolved sulfate from shallow wells on Vulcano Island, Aeolian Archipelago, Italy. Geothermics, 30, 69-91. DEASTIS, G., LA VOLPE, L., PECCERILLO, A. & CIVETTA, L. 1997. Volcanological and petrological evolution of Volcano island (Aelian arc, Southern Tyrrhenian Sea). Journal of Geophysical Research, 102, 8021-8051. DI GIOVAMBATTISTA,R. & TYUPKIN,Y. S. 2004. Seismicity patterns before the m05.8 2003, Palermo (Italy) earthquake: seismic quiescence and accelerating seismicity. Tectonophysics, 384, 243-255. DI LIBERTO, V., NuccIo, P. M. & PAONITA, A. 2002. Genesis of chlorine and sulphur in fumarolic emissions at Vulcano Island (Italy): assessment of pH and redox conditions in the hydrothermal system. Journal of Volcanology and Geothermal Research, 116, 137-150. EGOZCUE, J., PAWLOWSKY-GLAHN, V., MATEUFIGUERAS, G. & BARCELO-VIDAL, C. 2003. Isometric log-ratio transformations for compositional
data analysis. Mathematical Geology, 35 (3), 279- 300. FOUQUI~, F. 1856. Sur le phenomenes eruptifs de l'Italie meridionale. Comptes Rendus de la Societe Geologique de France, Paris, XLI, 565-567. FREPOLI, A., SELVAGGI, C., CHIARABBA, G. & AMATO, A. 1996. State of stress in the southern tyrrhenian subduction zone from fault-plane solutions. Geophysical Journal International, 125, 879-891. GIGGENBACH, W. F. 1996. Chemical composition of volcanic gases. In: SCARPA, R. • TILLING, R. I. (eds) Monitoring and Mitigation of Volcano Hazards. Springer-Verlag, Berlin-Heidelberg, Germany, 221-256. LEEMAN, W., TONARINI, S., PENNISI, M. & FERRARA, G. 2005. Boron isotopic varitions in fumarolic condensates and thermal waters from Vulcano Island, Italy: implications for evolution of volcanic fluids. Geochimica et Cosmochimica Acta, 69 (1), 143-163. MARTINI, M. 1980. Geochemical survey on the phreatic waters of Vulcano (Aeolian Islands, Italy). Bulletin of Volcanology, 43 (1), 265-274. MARTINI, M. 1983. Variations in surface manifestation at Vulcano (Aeolian Islands, Italy) as a possible evidence of deep processes. Bulletin of Volcanology, 46 (1), 83-86. MARTINI, M. 1993. Water and fire: Vulcano island from 1977 to 1991. Geochemical Journal, 27, 297-303. MARTINI, M. 1996. Chemical characters of the gaseous phase in different stages of volcanism: precursors and volcanic activity. In: SCARPA, R. & TILLING, R. I. (eds) Monitoring and Mitigation of Volcanic Hazard. Springer Verlag, Berlin Heidelberg, Germany, 200-219. MARTINI, M. & TONANI,F. 1970. Rilevamento idrogeologico di Vulcano. Rapporti I e IA, CNR n. 6900544, Palermo, Italy. MAZOR, E., CIONI, R., CORAZZA, E. Er AL. 1988. Evolution of fumarolic gases - boundary conditions set by measured parameters: case study at Vulcano, Italy. Bulletin of Volcanology, 50, 71-85. MINISSALE, A. 1992. Isotopic composition of natural thermal discharges on Vulcano Island, southern Italy. Journal of Hydroloy, 139, 15-25. MONTALTO, A. 1996. Signs of potential renewal of eruptive activity at La Fossa (Vulcano, Aeolian Islands). Bulletin of Volcanology, 57, 483-492. MONTEGROSSI, G., TASSI, F., VASELLI, O., BUCCIANTI, A. & GAROFALO, K. 2001. Sulfur species in volcanic gases. Analytical Chemistry, 73 (15), 3709-3715. PANICHI, C. & NOTO, P. 1992. Isotopic and chemical composition of water, steam and gas samples of the natural manifestations of the island of Vulcano Aeolian Arc, Italy. Acta Vulcanologica, 2, 297-312. PAWLOWSKY-GLAHN, V. & BUCCIANTI, A. 2002. Visualization and modelling of sub-populations of compositinal data: statitical methods illustrated by means of geochemical data from fumarolic
CHANGES IN A FUMAROLIC FIELD OF VULCANO fluids. International Journal of Earth Sciences, 91, 357-368. PAWLOWSKY-GLAHN, V. & EGOZCUE, J. 2001. Geometric approach to statistical analysis on the simplex. Stochastic Environmental Research and Risk Assessment, 15 (5), 384-398. PIRILLO, L., VASELLI, O. & TASSl, F. 2002. Studio della distribuzione areale e delle variazioni stagionali dei composti gassosi nelle fumarole dell'Isola di Vulcano (Italia meridionale). 82 SIMP Congress, Cosenza, Italy, 242-243. REYMENT, R. A. & SAVAZZI, E. 1999. Aspects of Multivariate Statistical Analysis in Geology, Elsevier, Amsterdam. SAINTE-CLARE DEVlLLE, C. 1856. Sur les phenomenes eruptifs du Vesuve et de l'Italie Meridionale. Comptes rendus de la Societe Geologique de France, Paris, XLI, 606-610. SHINOHARA, H. & MATStJO, S. 1986. Results of analyses on fumarolic gases from F1 and F5 fumaroles of Vulcano, Italy. Geothermics, 15 (2), 211-215. SICARDI, L. 1940. I1 recente ciclo dell'attivit fumarola dell'isola di Vulcano. Bulletin of Volcanology, 7, 85-139.
77
SYMONDS, R., GERLACH, T. & R~EO, M. 2001. Magmatic gas scrubbing: implications for volcano monitoring. Journal of Volcanology and Geothermal Research, 108, 303-341. VASELLI, O., TASSI, F., CAPACCIONI,B., TEDESCO, D., BELLUCCI, F. & PIRILLO, L. 2003. Le manifestazioni fumaroliche e le acque di falda dell'Isola di Vulcano nel periodo Novembre 2002-Marzo 2003. Geoltalia 2003, IV Forum FIST, Italy, 440. VON EYNATTEN, H., PAWLOWSKY-GLAHN, V. & EGOZCUE, J. 2002. Understanding perturbation on the simplex: a simple method to better visualise and interpret compositional data in ternary diagrams. Mathematical Geology, 34 (3), 249-275. VON EYNATTEN, H., BARCELO-VIDAL, C. ~z PAWLOWSKV-GLAHN, V. 2003a. Composition and discrimination of sandstone: a statistical evaluation of different analytical methods. Journal of Sedimentary Research, 73 (1), 47-57. VON EYNATTEN, H., BARCELO-VIDAL, C. & PAWLOWSKV-GLAHN, V. 2003b. Modelling compositional changes: the example of chemical weathering of granitoid rocks. Mathematical Geology, 35 (3), 231-251.
Ternary sandstone composition and provenance: an evaluation of the 'Dickinson model' G. J. W E L T J E
Delft University of Technology, Faculty of Civil Engineering and Geosciences, Department of Geotechnology, Applied Geology Section, PO Box 5028, NL-2600GA Delft, The Netherlands (e-mail:
[email protected]) Abstract: A popular model proposed by W. R. Dickinson and co-workers in the early 1980s
relates the composition of sandstones to the plate-tectonic setting of the sedimentary basins in which they were deposited. The present study is devoted to revision and testing of the 'Dickinson model' based on the original data which comprise 11 000 thin sections point-counted by hundreds of different operators over a period of three decades. Statistical analyses based on Aitchison's additive log-ratio transformation are used to obtain an optimal partitioning of ternary compositional spaces into 'provenance fields' and combined with stochastic simulation to assess the success ratio of the optimized 'Dickinson model'. Results indicate that differences between the grand means of each of the three major provenance associations (continental block, magmatic arc and recycled orogen) are highly significant, whereas overall inferential success ratios range from 64% to 78% in the four ternary systems studied. Current methods of dealing with sands of mixed provenance are unsatisfactory. To improve provenance models, the use of ternary subcompositions should be replaced by analyses of the full six-part (Qm, Qp, P, K, Lv, Ls) composition, and their covariance structure could be employed to 'unmix' samples into end-member provenance types.
The idea that sand(stone) composition reflects the nature of the rocks exposed in a source area, as well as the climatic and physiographic regime in which the sand was generated from these rocks, forms the basic premise of sediment provenance studies (Haughton et al. 1991; Johnsson 1993; Basu 2003; Weltje & Von Eynatten 2004). The empirical relation between ternary compositions of sands and the plate-tectonic setting of the sedimentary basins in which they were deposited, first explored by Crook (1974) and Schwab (1975), was formally presented by Dickinson and co-workers (Dickinson & Suczek 1979; Dickinson 1982, 1985; Dickinson et al. 1983). The 'Dickinson model' (DM) is the first quantitative representation of this key concept in sand provenance studies, whose origins may be traced as far back as the late nineteenth century (Weltje & Von Eynatten 2004). The DM consists of four ternary diagrams subdivided into different provenance fields in which subcompositions of sands may be plotted to infer their most likely plate-tectonic environment (see Table 1 for definitions of compositional variables, ternary subcompositions and plate-tectonic settings used in the DM). The apparently straightforward DM enjoyed great popularity from its inception and has been regarded as a benchmark by at least two generations of sediment petrographers. In the light of this popularity, it is remarkable that little attention has been paid to tests of its predictive
power. The main reason for the lack of such tests is that problems associated with statistical analysis of 'closed-sum' data (i.e. compositions) have been recognized widely but no solution was available until the log-ratio transformation of Aitchison (1982, 1986) became known outside of the field of mathematical statistics. Although the log-ratio transformation is discussed in some detail in recent textbooks on geological data analysis (Rollinson 1993; Swan & Sandilands 1993; Davis 1997; Pawlowsky-Glahn & Olea 2004), one can hardly call it a standard technique, as witnessed by the majority of geological papers in which compositional data are statistically analysed without much regard for their inherent limitations. Molinaroli et al. (1991) attempted to test the DM by means of discriminant function analysis of the ternary QFL and QmFLt data of Dickinson et al. (1983) without applying a log-ratio transformation. They concluded that the DM correctly classifies 85% of the data at most. However, this conclusion is difficult to justify from a methodological point of view. 9 The intrinsic limitations of compositional data caused by the constant-sum and non-negativity constraints ('closure effects'), which are known to affect the results of discriminant function analysis (e.g. Butler 1982), were not taken into account. It implies that the DM may
CompositionalData Analysis in the Geosciences:From Theory to Practice. Geological Society, London, Special Publications, 264, 79-99.
From: BUCCIANTI,A., MATEU-FIGUERAS,G. & PAWLOWSKY-GLAHN,V. (eds)
0305-8719/06/$15.00
9 The Geological Society of London 2006.
80
G.J. WELTJE
Table 1. Compositional variables, ternary systems and provenance associations referred to in this study Grain categories Total quartzose grains: Q = Qm + Qp Qm = monocrystalline quartz Qp = polycrystalline quartz Total feldspar grains: F = P + K P = plagioclase grains K = alkali feldspar grains Total unstable lithic fragments: L = Lv + Ls Lv = (meta)volcanic lithic fragments Ls = (meta)sedimentary lithic fragments Total lithic fragments: Lt = L + Qp
Ternary systems
Q~
QmFLt QmPK QpLvLs
Framework (emphasis on maturity) Framework (emphasis on parent rock) Subcomposition: monomineralic grains Subcomposition: lithic fragments
Provenance associations A
B C M
Continental block provenance Magmatic arc provenance Recycled orogen provenance Mixed provenance
actually be more powerful if examined in the light of an appropriate statistical model. 9 The data used to calculate the success ratio of the empirical classification procedure were also used to establish the discriminant functions. This tends to flatter the results and overestimate the success ratio of the DM, because sampling variability is not taken into account. Tests with independent data are more likely to provide a reasonable assessment of the efficiency of empirical classification schemes, which quite often is disappointingly low (e.g. ArmstrongAltrin & Verma 2005). In other words, the actual performance of the DM may be better or worse than suggested by the analysis of Molinaroli et al. (1991). In this study, Aitchison's log-ratio approach will be used to analyse the DM database and to obtain the optimal partitioning of ternary compositional space into 'provenance fields'. The final step in the analysis is quantification of the discriminatory power of the DM. The revised DM employs an alternative graphic representation of ternary data, which will be introduced under the term 'log-ratio diagram', but the results have also been transferred to the familiar ternary space to permit a direct comparison with the provenance fields of the original DM (Dickinson 1985).
Statistical analysis of ternary compositions Many classification schemes developed for sediments employ ternary diagrams (Klein 1963; Okada 1971). The popularity of the ternary diagram, which appears to have been invented in the late nineteenth century (Becke 1897), is most likely attributable to its intuitive appeal. It allows the display of threepart compositions x = (Xa, xa, x3) in a way that treats all components equally, even though one component is redundant because the x-values are non-negative and their sum equals unity (or 100%). The non-negativity and constant-sum constraints represent two fundamental properties of compositional data which are equally relevant to the study of compositions with more than three parts and have frustrated many attempts at statistical analysis, as illustrated by Chayes (1960), Butler (1979), Aitchison (1986), Rollinson (1993) and many others. The additive log-ratio transformation introduced by Aitchison (1982) is a powerful tool that removes the non-negativity and constant-sum constraints on compositional variables, and permits the use of standard multivariate statistical methods based on the assumption of multivariate normality. It is defined as follows. Let xi represent the relative abundances of components in a composition
SANDSTONE COMPOSITION AND PROVENANCE made up of k constituents (1 < i < k). The kth component xk, whose value is fully specified by the sum of the other k - 1 values, is used as a common denominator (or numerator) to form a series ofk - 1 ratios of component abundances. The logarithms of these ratios are defined as the set of additive log-ratios Yi: Yi
---
log
=
log
xi
-
log xk,
w h e r e i - - 1,2 . . . . . k - 1
(l)
or, alternatively y : alr(x) Log-ratios are amenable to rigorous statistical analysis, unlike the constrained compositional variables. They are unconstrained in the sense that they can take on any value, and their values can be modified without automatically forcing a response from the other log-ratios formed from the same composition. Moreover, the outcomes of log-ratio statistical analysis are permutation invariant, i.e. unaffected by the choice of common denominator or numerator. Compositional data follow an additive logistic normal distribution if their log-ratios are multivariate normally distributed. The requirement of additive logistic normality appears to be fulfilled by many types of compositional data. Statistical models for ternary compositions (Xl, X2, X3) are thus preferentially constructed under the assumption of a bivariate normal distribution of the corresponding set of log-ratios (yl, Y2). The results of log-ratio statistical analysis may be mapped back onto the compositional plane for display in a ternary diagram. Mapping is accomplished by the inverse log-ratio transformation, which comprises the following steps. The logistic transformation reimposes the non-negativity constraint:
zi =
{e yi for i = 1, 2 . . . . . k - 1 1
for i = k
(2)
After which the constant sum C is restored: C xi
-
9zi
(3)
zi
It should be pointed out that the additive log-ratio transformation alr(.) is but one way of approaching the problem. Two drawbacks of alr(.) are its lack of symmetry and orthogonality, which are attributable to the use of a common numerator or denominator. Different forms of log-ratio transformation have been developed to alleviate these problems and to accommodate the ever-widening range of applications in compositional data analysis (Aitchison & Egozcue 2005). The centred log-ratio
81
transformation clr(.) provides a symmetrical treatment of all parts of a composition (Aitchison 1986), whereas the isometric log-ratio transformation ilr(.) was developed to enable statistical analyses on orthonormal coordinates (Egozcue et al. 2003). In the present study, alr(.) was used because it leads to a representation of compositional data that is more similar to conventional ratios used in sedimentary petrology than clr(.) and ilr(.), and therefore easier to understand. Weltje (2002) discussed the construction of confidence regions and predictive regions in ternary diagrams by means of the alr transformation in detail, and illustrated their superiority over the conventional hexagonal fields of variation employed in sedimentary petrology. The term confidence region is reserved for regions of ternary compositional space (or its binary air-transformed equivalent) in which the fixed population mean is expected to be located with some probability, generally referred to as the confidence level. The term predictive region refers to the population as a whole, i.e. to the region of compositional space in which future observations are expected to be located. The probability associated with a predictive region is termed the content. Figure l a shows a set of ternary compositions and the corresponding hexagonal fields of variation calculated from univariate summary statistics. The hexagonal fields are clearly inadequate, because they fail to capture the curvature of the dataset and extend beyond the boundaries of the diagram, which implies the prediction of negative percentage values of one or more of the components. The reason for this unrealistic result is the underlying statistical model which erroneously assumes independent normal distributions of ternary percentage values and does not incorporate unit-sum and nonnegativity constraints. Figure lb illustrates the calculation of true confidence and predictive regions from the air-transformed data. Such regions are, by definition, elliptical in log-ratio space if the data follow an additive logistic normal distribution. Figure lc shows the same regions projected onto a ternary diagram by application of the inverse airtransform (equations (2) and (3)). Note that the curvature of the data points is adequately captured and the region is physically meaningful, because it does not extend beyond the boundaries of ternary compositional space.
The log-ratio diagram The elliptical shape of confidence regions in log-ratio space offers an attractive alternative to the use of hexagons in ternary space. In many
82
G.J. WELTJE
-4 6
-3
-2
-1
0
1
I
I
I
I
I
In(Qt/R) 2 3 I
I
4
5
6
7
8
I
I
I
I
I
. -:'.."
.z.~+..77. +s
++.. +`"
..~,~" 9 .+7"
Qt
,---,, ~ ," ,",- -~," ,~' ,","'
F (a)
n-
.~~.+::':"
2
,,'r _~ 1
.++"++'~+""
. . ~ . . . ' P % +.+"
Qt
A
0
-1
R
F (c)
R
Fig. 1. Hexagonal fields of variation versus air-based regions. Solid lines: confidence regions of population mean. Dashed line: predictive regions of population. Confidence limits are 90%, 95% and 99%. (a) Hexagonal region constructed from intersections of univariate normal approximations; (b) air-based regions in log-ratio space; (c) air-based regions transformed to ternary compositional space (after Weltje 2002).
applications of compositional data analysis, the results of log-ratio statistical analysis are retransformed to percentage data and displayed in a ternary diagram (cf. Figs lb, c). This inverse transformation has two opposite effects. On the one hand, transformation to the original (conventional) units of measurement makes the results easier to understand. On the other hand, the elegant elliptical shape of confidence regions is lost after transformation to ternary percentages and results may be more difficult to interpret if several partly overlapping regions are plotted in the same ternary diagram. An alternative approach, examined in this study, is to 'map' the log-ratio space that corresponds to the ternary diagram and display the results as ellipses in that space. If researchers become accustomed to this method of display, it could eventually replace the ternary diagram. An ilr-based log-ratio diagram could also be developed as an alternative representation of the air-based diagram presented in this study. The ternary diagram of Figure 2a contains three lines from each of the vertices towards the middle of the opposite sides, i.e. lines along which the abundance of one component equals that of another. Because these lines represent constant (log-)ratios,
they are also straight lines in log-ratio space (Fig. 2b). This does not apply to fixed-percentage triangles, i.e. lines along which one of the components has a constant value (Fig. 2c). Such triangles are represented by convex, roughly hexagonal shapes in the log-ratio diagram (Fig. 2d). The transformation of fixed-percentage lines reveals another property of the log-ratio diagram: the distance between two lines with values 0.1% and 1% is the same as the distance between the 1% and 10% lines. This geometric scale is a natural result of the logarithmic transformation, and indicates that the log-ratio diagram is much more sensitive to compositional differences in the areas near the edges of the ternary diagram. The opposite holds for areas in the centre of the ternary diagram, as demonstrated by the distances between the 10%, 20% and 30% fixed-percentage lines (Figs 2c and d). Figure 2e shows a subdivision of the ternary QFL diagram into six equal fields. Each field has been labelled according to the most abundant component (uppercase) and the second-most abundant component (lowercase). This straightforward classification of sands comprises the following types (clockwise from Q vertex): Quartzolithic (Q1), Lithoquartzose (Lq), Lithofeldspathic (Lf),
SANDSTONE COMPOSITION AND PROVENANCE Q
83
8-
9
~
Q'=F
._J
...................
O0-
...'"
"i . . . . . . . . . .
Q=L---
_c
~
-4-
-8
F
'
L
-8
i
'
-4
'
0 In(Q/F)
(a)
i
'
i
4
8
(b)
Q
B-
4-
,..-... _i
00-
--4-
F
-8
L
(c)
'
-8
I
-4
'
I
'
0 In(Q/F)
I
'
i
4
8
(d)
Q
8."
.
4-
Qf
!
.
OE
--4-
--8-
F (e)
L
-8
-4
0 In(Q/F)
4
8
Fig. 2. Exploration of log-ratio space corresponding to the ternary diagram. (a, b) Straight lines corresponding to fixed (log-)ratios; (c, d) minimum-percentage contours; (e, f) sixfold subdivision of compositional space into Quartzolithic (Q1), Lithoquartzose (Lq), Lithofeldspathic (Lf), Feldspatholithic (F1), Feldspathoquartzose (Fq) and Quartzofeldspathic (Qf) sands. Dashed line is 0.07% contour (see text for discussion).
84
G.J. WELTJE
Feldspatholithic (F1), Feldspathoquartzose (Fq) and Quartzofeldspathic (Qf). These sand types also occupy similar-sized areas in the part of the logratio diagram enclosed by the dashed line (Fig. 2f). This line corresponds to the outer limit of log-ratio space one expects to be covered by composition estimates derived from point counting. The reason for this is that no zero component abundances are allowed in log-ratios (division by zero is not permitted and the logarithm of zero is undefined). If point counting results in zero abundance of one or two components, one therefore has to assume that these zeros reflect sampling error. In other words, some components are present in the population in trace amounts only and have not been observed during point counting. Such zeros must be replaced by statistically acceptable positive values before log-ratio transformation (Aitchison 1986; Weltje 2002; Mart/n-Fern~indez et al. 2003). The fixed-percentage line in Figure 2f was calculated by replacing the binary compositions located on the edges of the ternary diagram, i.e. compositions with one zero value, by a ternary composition in which the zero value was replaced by 6 = 0.07%. The non-zero components were multiplied by a factor ( 1 0 0 - 6)/100. The value of 6 was obtained from the binomial formula by solving for the case in which an analyst who counts 1000 points fails to record a rare component and assumes the probability of failure to be 50%. In other words, the analyst assumes that the unsampled component is so rare it will be recorded only in half of the point counts of this length, if the procedure was repeated many times. The number of points counted by this hypothetical analyst is much larger than customary in sedimentary petrology, which implies that the values of air-transformed point-count results are expected to fall within the interval [ - 8 ; +8]. In addition, many compositions are likely to plot along the main diagonal of the logratio diagram in view of the intrinsic correlation between the two log-ratios, which share one component (in this case Q has been used as a common numerator). The left-hand side of Figure 3 shows the ternary diagrams of the DM with the first-order provenance fields proposed by Dickinson et al. (1983). The QFL diagram (Fig. 3a) contains three fields in which sands of continental block (A), magmatic arc (B) and recycled orogen (C) provenance are expected to plot. The QmFLt diagram (Fig. 3c) contains an additional field reserved for sands of mixed provenance (M). The diagrams on the right-hand side (Figs 3b, d) show the same fields in log-ratio space. Dickinson (1985) referred to these fields as 'provisional' and 'nominal' and stated they correspond to 'actual reported distributions of mean
detrital modes'. The criteria used to establish these provenance fields were not explicitly stated and it is not clear why a mixed-provenance field is only present in the QmFLt diagram. The main purpose of this study is to define an optimal subdivision of compositional space according to statistical criteria, based on the same data that were used to establish the DM, and compare the resulting provenance fields to those proposed by Dickinson et al. (1983). In addition, the inferential success ratio of the optimized DM will be determined.
Material Acquisition
The database used in this study was assembled from three datasets compiled by Dickinson and coworkers (Dickinson & Suczek 1979; Dickinson 1982; Dickinson et al. 1983) that represent the foundation of the DM (Dickinson 1985, 1988). The paper copies were scanned and digitized by means of OCR software and carefully checked for digitization errors. This resulted in an initial (raw) database of 385 records, each of which comprised the following fields: (a) the sample code; (b) up to four mean compositions of ternary subsets of compositional variables (see Table 1); (c) the sample size n, i.e. the number of observations used to calculate the means (samples are often referred to as 'suites' by petrographers); (d) the inferred platetectonic setting of the sedimentary basin (see Table 1); (e) a short description of the lithostratigraphic unit, location and/or age of the deposit; (f) the data source (author, year of publication). A few typographic errors were detected in the ternary compositions, which were corrected by checking their internal consistency (using the interdependence of some ternary compositions, see Table 1) and by comparing the tabulated compositions with the corresponding ternary graphs. Inspection of the raw database showed that not all of the records were unique. Several records appeared in more than one study, either as exact replicates or with modifications to calculated ternary compositions or inferred plate-tectonic setting. Such replicates were identified by comparing the data sources and the descriptions of the records from each of the three datasets. They were treated in various ways, depending on the nature of the redundancy between records. 9 If records were identical, the oldest was retained. 9 The most complete version of two fully overlapping records was retained. 9 The latest version of a record was retained if corresponding ternary compositions appeared to
SANDSTONE COMPOSITION AND PROVENANCE
85
8
Q 6 4 2 ._J
-2 -4 -6 -8 -8
F (a)
-6
-4
-2
L
0 2 In(Q/F)
4
6
8
4
6
8
(b)
am
6 4
A
2 o
-2 -4 -6 -8 -8
F
(c)
-6
Lt
-4
-2
0 2 In(Qm/F)
(d)
Fig. 3. Subdivision of ternary spaces into provenance fields according to Dickinson et al. (1983). For legend to provenance associations and ternary systems see Table 1. (a, b) QFL system; (c, d) QmFLt system. have been recalculated or if the inferred provenance type had changed. 9 The latest version of a set of records was retained in the case where different records referred to a common data source (for instance if a sample as well as its subsets were reported in different studies). 9 The redundancy in records with a common data source, but containing compositions of different ternary subsets of variables, was removed by deleting overlapping subsets. This operation reduced the total number of records by 54, leaving a database of 331 records in which the four ternary subcompositions are represented by 309 (QFL), 267 (QmFLt), 101 (QpLvLs) and 100 (QmPK) records, respectively. Together these
records represent 11000 thin sections pointcounted by hundreds of operators over a period of three decades. Because very few records are complete, statistical analyses are limited to studies of the variation within each of the four ternary systems separately. Pre-processing
Statistical analysis of the DM database requires some pre-processing to replace missing or truncated data by appropriate values. Where sample size n was missing from the records, its most likely value was estimated from the overall distribution of sample size (Fig. 4), which is approximately log-normal. The median value of sample size (ns0 = 18) was substituted in 14 records.
86
G.J. WELTJE consistent with the multiplicative method advocated by Martln-Femfindez et al. (2003).
1.0
0.8-
Analysis of the DM database 0.6-
Sources of uncertainty
P 0.4-
0.2-
0.0
I
'
1
1~
I
n
2
'
3
Fig. 4. Distribution of sample size in the database is effectively log-normal and spans almost three orders of magnitude. Median sample size equals 18. The problem of dealing with zeros in log-ratio analysis, discussed above, also applies to the DM database. However, since each record in the DM database is an average of a series of point counts of unknown length that has been rounded to the nearest integer, it is not immediately clear which zero-replacement value to choose. An upper limit on the replacement value may be derived from the notion that it should not exceed the smallest positive number actually recorded (the 'detection limit'). This upper limit, which was set at 0.9%, represents the replacement value that transforms the ternary compositions (99, 1, 0) into the composition (98.11, 0.99, 0.90). It seemed appropriate that a zero in an average composition calculated from a large sample should be replaced by a different value than a zero that occurs in a single point count, which prompted the introduction of a weighting scheme for zero replacement based on n. In view of the several-orders-of-magnitude range of n (Fig. 4) the following definition of the replacement value 3 (in %) was adopted: 9.9 lO + vF~"
3 -- -
(4)
Equation (4) provides the desired upper limit of 3 ----0.9% for the case n = 1 and smaller values for larger samples. Compositions with one or two zeros were recalculated to 100% by replacement of zero component(s) by 3 and multiplication of nonzero component(s) by a factor ( 1 0 0 - N3)/100, where N equals the number of zeros in the original composition. This zero-replacement strategy is
Statistical analyses should take into account the level of noise in the data as well as possible effects of systematic deviations from 'true' values. Many sources of error can be distinguished in the multistage data-acquisition procedure involved in the construction of the DM database. Each record in the DM database is made up of one or more ternary subcompositions of sandstone calculated by arithmetically averaging of a series of n specimens collected by a single analyst. No information is provided on the spread of values about the means or their covariance, or on parameters of the data-acquisition scheme such as the spatial extent of the sampling programme, the volumes of the samples, pre-measurement laboratory treatments of samples, the point-counting conventions, the number of grains counted and the (spread in) grain size of the sands analysed. Although all of these factors influence the degree to which the mean composition may be considered representative of the lithosome studied (Weltje 2002, 2004), they cannot be taken into account without going back to the original data sources. Other sources of uncertainty play a role when it comes to assessing the integrity of the database as a whole. Potential sources of bias are the uneven spread of data in a geographical and/or stratigraphical sense and errors in assignment of inferred plate-tectonic setting. The lack of standardized data-acquisition methodology could introduce all sorts of bias into the results of statistical analyses of such heterogenous data, but the net result of all these systematic deviations from an unknown 'truth' could equally well be indistinguishable from random error. The following assumptions appear to be reasonable in the absence of any other information: 9 geographical and stratigraphical coverage of the DM database are sufficiently representative to allow inferences about sand(stone) composition in relation to global tectonics; 9 no significant bias is introduced by possible errors in assignment of plate-tectonic settings; 9 no significant bias is introduced by failures to recognize post-depositional (diagenetic) modifications to detrital framework grains; 9 possible systematic errors do not invalidate the results of the statistical analysis, because they are indistinguishable from random error;
SANDSTONE COMPOSITION AND PROVENANCE 9 the uncertainty of all data-acquisition parameters being equal, the magnitude of random errors in composition estimates is proportional inversely to the square root of sample size n.
Me~o~ As discussed above, very few records are complete, which implies that the four ternary subcompositions had to be analysed separately because the full sixpart compositions (Qm, Qp, P, K, Lv, Ls) could not be reconstructed. The following analysis was performed for each of the four air-transformed ternary systems (each step will be discussed in more detail below). 1.
2.
3.
4.
Predictive regions of the population were constructed for each provenance association by a weighted version of the method outlined in Weltje (2002). The compositional space was partitioned by constructing iso-density lines for each pair of predictive distributions in log-ratio space. The grand mean of each provenance association and its 99% confidence region was estimated to provide reference compositions of sands with A, B and C provenance, as well as sand of mixed provenance (corresponding to the iso-density point of the three predictive distributions). Stochastic simulation of compositions from each of the three predictive distributions was carried out to assess the efficiency of the iso-density partitioning.
The first part of the analysis closely follows the method outlined by Weltje (2002) for the construction of predictive regions based on the assumption of additive logistic normality. The main difference between the standard case of constructing a predictive distribution from a set of data points and the present application is that each ternary composition is itself a sample (average) of n observations. The mean vector and sample covariance matrix of each air-transformed set belonging to a provenance association must therefore be calculated by a weighted method. If the original data had been available instead of a series of averages, each observation would have had equal weight (assuming that other data-acquisition parameters do not differ much between observations), indicating that the averaging effect should be modelled by assigning a weight of n to each sample. However, indiscriminate use of this linear weighting scheme may cause problems since values of n vary by more than two orders of magnitude, so that the estimated parameters of predictive distributions would be heavily influenced by the compositions of a few large
87
samples, which is not desirable, given the possibility of systematic errors in the data. In addition, the smallest samples (n = 1) are all from the river mouths of major river systems whose sands have been thoroughly mixed in large drainage basins, indicating that their influence on the parameters of the predictive distribution should be larger than sample size suggests (cf. Ingersoll 1990). In view of these considerations, it was decided to employ a scheme in which the weights assigned to each record equal ~ . The sum of the weights within each provenance group was used as an estimate of the number of degrees of freedom used in the calculation of the parameters of the predictive distribution. One can think of these degrees of freedom as an effective sample size that encompasses all the sources of uncertainty listed above. The predictive distributions constructed in this way are considered faithful representations of the heterogeneous dataset. In the second stage of the analysis, the three predictive distributions (of the A, B and C associations) were plotted together and iso-density lines were constructed for each pair of distributions to provide an optimal partitioning of log-ratio space into provenance fields. The rationale behind this partitioning method is the notion that probability densities relative to each of the provenance associations A, B and C vary continuously in compositional space. In each of the three fields that correspond to provenance association A, B or C, the probability density relative to the parent distribution should always exceed the probability densities relative to the other two distributions. The boundary between two provenance fields is thus an iso-density line, i.e. a set of compositions at which the probability densities relative to both distributions are identical (but not constant). The three iso-density lines coincide at the point in compositional space where the probability densities relative to each of the three parent distributions are identical: an iso-density point. This partitioning maximizes the probability that a sample mean of a series of sandstones with unknown provenance is classified correctly. The results of the analysis were summarized in terms of the vector means of the provenance associations and their associated 99% confidence regions in each of the ternary systems. The composition corresponding to the iso-density point relative to the A, B and C associations was also calculated and presented as a typical sand of mixed provenance. The final step in the analysis was a stochastic simulation exercise designed to quantify the overall probabilities associated with the empirical classification. A series of 10000 pseudorandom numbers was generated from each predictive distribution
88
G.J. WELTJE
with the Box-Muller algorithm (Press et al. 1994) and the probability densities of these data points with respect to each of the A, B and C distributions were calculated. Each data point was then classified in terms of its actual parent distribution and the distribution for which it has the highest probability density. Probability densities were calculated by a method derived from Barcel6 et al. (1996). The result of this stochastic simulation was a 3 x 3 frequency table for every ternary system, based on 30000 synthetic data points, from which the desired probabilities of (mis)classification were calculated. Resul~
Figures 5 - 8 show the distributions of provenance associations A, B and C in the four ternary systems in the form of predictive regions of 50%, 90% and 99% content plotted together with the records of the DM database. The log-ratio diagrams are shown on the left-hand side of these figures, the corresponding ternary diagrams on the right-hand side. The log-ratio diagrams suggest that the additive logistic normal distribution fits most datasets reasonably well, especially the QFL and QmFLt compositions of provenance associations A and C (Figs 5a, e, 6a, e). The QmPK and QpLvLs compositions of these associations (Figs 7a, e, 8a, e) are not well constrained, owing to the limited number of data points. However, the additive logistic normal distribution shows a lack of fit to the data points of provenance association B (Figs 5c, 6c, 7c, 8c), which suggests the presence of multiple outliers and/or several distinct sub-populations within this association. Formal evaluation of the goodness-of-fit of the predictive distributions by means of appropriate normality tests (Aitchison 1986; Pawlowsky-Glahn & Buccianti 2002) was not attempted. The moderate lack of fit displayed by the data of provenance association B merits further investigation, but is not considered a matter of great concern in the present study, which is mainly devoted to the construction of provenance fields according to a reproducible method. Geologically sound explanations for the apparent deviation of association B from the simple additive logistic normal model require detailed examination of the original data, which is beyond the scope of this study. A purely statistical approach to the lackof-fit problem involving operations such as outlier detection and removal, alternative data transformations (cf. Barcel6 et al. 1996) and/or the invocation of other classes of distributions would not improve the geological viability of the DM. Furthermore, the choice of an alternative model for the distribution of data points belonging to provenance association B will affect the partitioning of
compositional space into provenance fields, but such shifts in the location of provenance fields are likely to be quite small. Figures 9 - 1 2 illustrate the construction of the three provenance fields in each of the four ternary systems. The upper rows of Figures 9 - 1 2 show the predictive regions of 50% and 90% content for each provenance association. The iso-density boundaries between each pair of partly overlapping distributions are displayed in the bottom rows. Triple junctions of iso-density boundaries correspond to compositions with equal probabilities of belonging to one of the provenance associations. Such compositions are regarded as typical examples of mixed provenance. Also plotted in the bottom rows of Figures 9 - 1 2 are the 99% confidence regions of the grand means of each of the provenance associations. The small size of these confidence regions indicates that the grand means are well constrained by the large amount of data. The fact that they do not overlap implies that compositional differences between grand means are highly significant. Table 2 summarizes the four characteristic compositions in each of the four ternary systems studied. The results of the stochastic simulation (Table 3) are a set of probabilities associated with the inference of provenance from a ternary (sub)composition of sand(stone). For instance, if a sample mean plots in field A of the QFL diagram, the probability that its actual provenance is A is 79%. The probability that its actual provenance is C is 20%, and the probability that it is B is only 1%. The overall probability of correct inference in a given ternary system (its success ratio) may be calculated as the average of the probabilities of correct identification of each of the three provenance associations. These numbers equal 76% for QFL, 74% for QmFLt, 64% for QmPK and 78% for QpLvLs. They apply to the iso-density classification presented above; the original subdivision of ternary space in the DM as presented by Dickinson et al. (1983) would have given less favourable results.
Discussion and conclusions Most of the conclusions and points of discussion that emerge from this study are methodological as well as geological. However, one aspect of the compositional data analysis presented in this study is of purely methodological interest. The iso-density partitioning is based on the notion that each point in compositional space may be associated with a vector of relative probability densities, which can itself be regarded as a composition. In the examples presented, both compositional spaces are of the same dimensionality, but this need not be the case. An efficient method to capture the relation
SANDSTONE COMPOSITION AND PROVENANCE
89 Q
8
...J .,,...
o
0J
--2-
-4-6-8
'
-8
I
'
-6
I
'
-4
I
'
-2
0
2
4
6
8
In(Q/F)
F
L
(b)
(a)
Q
8
6-
42..j
0 t--
O-
--2----4-
"o,,
-6-
-8
' -8
i -6
'
n ' n , -4 -2 0
2
4
6
8
F
k
In(Q/F)
(c)
(d) Q
q 42 2-
~-
.
~
o-
c"
--2-
-4-6-
-8
'
-8
(e)
I
-6
'
I
-4
'
I
-2
'
0
In(Q/F)
2
4
6
8
F
(f)
Fig. 5. Predictive distributions for each provenance association in QFL space, represented by regions of 50%, 90% and 99% content. (a, b) Continental block provenance; (c, d) magmatic arc provenance; (e, f) recycled orogen provenance.
90
G.J. WELTJE
Qm
8
6 4 ..,-,
2
--.I .....
E 0 0 -~-2 -4 -6 -8
'
i -6
-8
'
i -4
,
t -2
, 0
2
4
6
8
In(Qm/F)
(a)
F (b) am
8 -
6 -
\
4 -
2 .~-
}o g
~
-
-2 -4 -6 -8
, -8
t -6
'
t -4
,
t -2
, 0
2
4
6
8
In(Qm/F)
(c)
Lt
F (d)
em 6 4 2 ._1
E
0 -2 -4 -6 -8 -8
(e)
'
I
-6
'
I
-4
'
i', -2 0 2 In(Qm/F)
4
6
8
F (f)
Fig. 6. Predictive distributions for each provenance association in QmFLt space, represented by regions of 50%, 90% and 99% content. (a, b) Continental block provenance; (c, d) magmatic arc provenance; (e, f) recycled orogen provenance.
SANDSTONE COMPOSITION AND PROVENANCE
91 Qm
642E --
0-2-
-4-
-6-8
'
I -6
-8
'
I -4
'
I -2
' 0
2
4
6
8
In(am/P)
(a)
P (b) Qm
64v
0-
o
.
C _2_ --4~ --6 m
-8
, -8
p -6
(c)
-4
-2
0
2
4
6
8
In(am/P)
(d)
8
Qm
6-
/
42-
{ 0
~
0-2--4--
--6--8
' -8
(e)
i ' i ' i ' -6 -4 -2
0
In(am/P)
2
4
6
8
P (f)
Fig. 7. Predictive distributions for each provenance association in QmPK space, represented by regions of 50%, 90% and 99% content. (a, b) Continental block provenance; (c, d) magmatic arc provenance; (e, f) recycled orogen provenance.
92
G.J. WELTJE
--a
I
,
-8
i
,
-6
~
,
-4
~
,
-2
(a)
(,
i
0
2
,
i
4
,
I
6
,
--
8
ks
tn(Qp/Lv) 8
QP
64~.
2-
-~
-a-
0-
/ -4--6--
-8
' -8
I -6
'
I -4
'
I -2
' 0
2
4
6
8
In(Qp/Lv)
(c)
Ls
Lv
(d) Qp
6 4 2 _1
0 r
-4 ee
-6 -8 -8 (e)
-6
-4
-2
0
In(Qp/Lv)
2
4
6
8
Lv (f)
Ls
Fig. 8. Predictive distributions for each provenance association in QpLvLs space, represented by regions of 50%, 90% and 99% content. (a, b) Continental block provenance; (c, d) magmatic arc provenance; (e, f) recycled orogen provenance.
SANDSTONE COMPOSITION AND PROVENANCE
93 Q
4
-
-2 -
-4 -6 -8
' -8
I -6
'
I -4
'
I -2
'
' 0
' 2
' 4
I 6
' 8
In(Q/F)
(a)
F (b)
Q
8 6
4
A
9
2 0 -2 -4 -6 -8 -8
(c)
-6
-4
-2
0
2
4
6
8
In(Q/F)
F (d)
Fig. 9. Construction of provenance fields in QFL space. (a, b) Predictive distributions of the three provenance associations (regions of 50% and 90% content) plotted together; (c, d) iso-density partitioning of compositional space into provenance fields. Also shown are 99% confidence regions of population means of the three associations. See Table 1 for legend to provenance associations A, B and C, and Table 2 for ternary population means.
between these compositional spaces would be extremely useful in further iso-density partitioning experiments.
Overview of the optimized DM Statistical analysis of the DM database has permitted an evaluation of the strengths and weaknesses of this popular plate-tectonic provenance model. The three fundamental provenance associations (continental block, magmatic arc and recycled orogen) are significantly different, as demonstrated by the 99% confidence regions of their grand means in each of the four ternary spaces studied. However, the predictive
distributions of the populations display considerable overlap, which indicates that inference of the correct provenance from composition alone is not straightforward. The iso-density partitioning resulted in provenance fields that differ considerably from those proposed by Dickinson et al. (1983). The inferential success ratio associated with the optimized subdivision of compositional space into provenance fields is around 75%. The QpLvLs subcomposition has the highest overall success ratio (78%), followed closely by the QFL and QmFLt compositions, which are essentially equally powerful provenance tools with a success ratio of around 75%. The QmPK subcomposition, with its low overall success ratio of 64%, does not
94
G.J. WELTJE
Qm .
6
4 2 0
o t,.m
--2
-4 -6 -8 -8
-6
-4
-2
(a)
0
2
4
6
8
In(am/F)
F
(b)
8j
am
6-
4
2 0 -2 -4 -6
-8
'
-8
(e)
I
-6
'
I
-4
'
I
-2
'
0
2
4
6
8
In(am/F)
5/
F
(d)
Fig. 10. Construction of provenance fields in QmFLt space. (a, b) Predictive distributions of the three provenance associations (regions of 50% and 90% content) plotted together; (c, d) iso-density partitioning of compositional space into provenance fields. Also shown are 99% confidence regions of population means of the three associations. See Table 1 for legend to provenance associations A, B and C, and Table 2 for ternary population means.
appear to hold much promise for provenance discrimination. The success ratios of the optimized DM suggest that Molinaroli et al. (1991) overestimated its discriminatory power by testing it with the same data used to establish their set of discriminant functions. Strategies to extend and further improve the DM are discussed below.
Increasing the dimensionality One of the major shortcomings of the database underlying the DM is that very few records are complete, which limits statistical analysis to a separate description of compositional variation within
the four ternary systems. The partial view on compositional variability obtained by analysing these amalgamations and subcompositions of the full six-part composition (am, Qp, P, K, Lv, Ls) may be insufficient to address the relevant geological problems at hand, as illustrated by the following example. In an early attempt to apply the log-ratio transformation to the DM database, Butler & Woronow (1986) analysed the dataset of Dickinson & Suczek (1979) for the presence of spurious correlations induced by the constant-sum constraint. Their results suggested that the compositional trend of decreasing L relative to Q + F within the magmatic-arc provenance field of Dickinson
SANDSTONE COMPOSITION AND PROVENANCE 8
95 Qm
6 4 2
E
0
e--
-2 -4 -6 -8
'
-8
I
'
-6
I
'
-4
I
-2
(a)
'
0 2 In(Qm/P)
4
6
8
P
K
(b) 8
Qm
J
6 4 2
/
A
-2 -4
-6 -8
'
-8
(c)
l
-6
'
l
-4
'
l
-2
'
0
2
4
6
In(Qm/P)
8
P (d)
Fig. 11. Construction of provenance fields in QmPK space. (a, b) Predictive distributions of the three provenance associations (regions of 50% and 90% content) plotted together; (e, d) iso-density partitioning of compositional space into provenance fields. Also shown are 99% confidence regions of population means of the three associations. See Table ! for legend to provenance associations A, B and C, and Table 2 for ternary population means.
& Suczek (1979) can be produced by imposing the constant-sum constraint on a set of independent variables, which may indicate that it is the sole result of percentage formation ('closure') and has no geological meaning. This purely statistical interpretation is not very likely, however, because the decease of L relative to Q § F, which defines the main trend of arc dissection by erosion, is usually accompanied by a decrease of P relative to K (W. R. Dickinson, pers. comm. 1994). Answers to question of this kind require analyses of the relationships between log(P/K) on the one hand, and log(Q/L), log(F/L), or log{(Q + F)/L} on the other hand. This example indicates that an
empirical provenance model built on a database of six-part compositions (Qm, Qp, P, K, Lv, Ls) is much more powerful than a series of models based on ternary (sub)compositions only. Full sixpart compositions of individual specimens should be reported in future studies intended to contribute to a new database for a second-generation provenance model (and not only their mean six-part composition).
Sands of mixed provenance The DM was designed for classification of means of sandstone suites only. Erroneous interpretations
96
G.J. WELTJE 8
Qp
6 4 ~" 2
80 -~-e -4 -6
-g -8
-6
-4
-2
(a)
0
2
4
6
8
In(ap/Lv)
Lv
Ls
(b)
Qp 6 .
4
i
2 "~ 0 c-2
-4 -6 B
-8 -8 (C)
-6
-4
-2
0
2
4
6
8
In(Qp/Lv)
Lv
Ls
(d)
Fig. 12. Construction of provenance fields in QpLvLs space. (a, b) Predictive distributions of the three provenance associations (regions of 50% and 90% content) plotted together; (c, d) iso-density partitioning of compositional space into provenance fields. Also shown are 99% confidence regions of population means of the three associations. See Table 1 for legend to provenance associations A, B and C, and Table 2 for ternary population means.
may result if local provenance signals in the data have not been suppressed by spatial averaging of sandstone compositions (Ingersoll 1990; Ingersoll et al. 1993; Critelli et al. 1997). This averaging approach, which may be viewed as a way of artificially mixing local provenance signals, has the distinct advantage of robustness but implies a limited spatial and temporal resolution of the DM. It seems fair to state that many of the records in the DM database are not 'pure' sands of a single provenance, but contain varying admixtures of sands of different provenances. Analyses of modern deepsea sands (Valloni 1985) and reviews of global dispersal systems (Dickinson 1988) indicate that sands
of mixed provenance are extremely common. It is therefore not surprising that many sand suites plot in the mixed provenance field of the original QmFLt diagram (Dickinson 1985, 1988). Given these restrictions, success ratios of empirical provenance models based on averaging of ternary compositions are not likely to exceed those of the optimized DM presented in this study. Differences between the original and the revised DM are not limited to the locations of the boundaries between provenance fields. In the present analysis, each of the ternary systems was treated in the same way, in contrast with the method of Dickinson et al. (1983), who introduced a separate
SANDSTONE COMPOSITION AND PROVENANCE Table 2. Grand means of provenance associations and typical compositions of mixed provenance (legend in Table 1) A B C M
Q (%) 88 25 78 57
F (%) 10 37 6 29
L (%) 2 38 16 14
A B C M
Qm (%) 84 22 59 52
F (%) 10 37 6 25
Lt (%) 6 41 35 23
A B C M
Qm (%) 86 38 87 60
P (%) 5 53 9 27
K (%) 9 9 4 13
A B C M
Qp (%) 59 5 40 22
Lv (%) 30 77 5 28
Ls (%) 11 18 55 50
field for sands of mixed provenance in the QmFLt system only (Fig. 3c). The provisional solution adopted in the present study is to regard the compositions at the triple junction of the three iso-density lines in each of the ternary systems as typical examples of mixed provenance. It should be noted that any observation located in an area of compositional space where two or more distributions overlap one another is difficult to
Table 3. Probabilites of inferring provenance from ternary composition (legend in Table 1) Actual provenance QFL inferred A inferred B Inferred C
A (%) 79 1 16
B (%) I 87 21
C (%) 20 12 63
QmFLt inferred A inferred B Inferred C
A (%) 79 2 13
B (%) 4 81 24
C (%) 17 17 63
QmPK inferred A inferred B Inferred C
A (%) 59 16 34
B (%) 12 72 6
C (%) 29 12 60
QpLvLs inferred A inferred B Inferred C
A (%) 75 7 17
B (%) 13 81 4
C (%) 12 12 79
97
interpret without additional information. On the one hand, such sand could have been derived from one of these distributions exclusively (the point of view adopted in this study), but on the other hand, it could represent a mixture of sands from two or more of these distributions. The overlap between distributions causes the range of potential scenarios to be infinite and impossible to constrain without taking into account additional information about the palaeogeography of the area from which the sands were derived. Additional complications may arise from variability of sediment composition due to past climate change or the presence of diagenetic gradients across basins, both of which are essentially unrelated to provenance sensu stricto, i.e. the composition and texture of parent rocks (Johnsson 1993; Weltje & Von Eynatten 2004). The very fact that such information appears to be required contradicts the basic premise of the DM, i.e. provenance of sands may be inferred from composition alone. The above considerations imply that sands of mixed provenance cannot be interpreted by 'averaging out' all variability. On the contrary, attention must be devoted to the development of methods to exploit the information contained within the covariance structure of compositional data. Ingersoll (1990), Ingersoll et al. (1993) and Critelli et al. (1997) noted a systematic decrease in the variance of mean sand composition with increasing spatial scales of dispersal systems, a phenomenon further explored by Weltje (2004). The covariance structure of compositional data is an essential tool of quantitative provenance analysis, which permits sands of mixed provenance to be statistically 'unmixed' into end-member provenance associations (Weltje 1997). The end-member mixing model allows one to address the issue of mixed provenance in a systematic and quantitative way - thereby providing insights that could never have been obtained by plotting arithmetic means in ternary diagrams. This approach also requires full six-part compositions of individual specimens rather than sets of disjointed three-part means. Weltje (1995) provides an example of endmember modelling of a suite of mixed-provenance sands from the Italian Alps and Apennines. Conclusions The DM in its present form - as a series of four separate ternary diagrams - should be recognized for what it is: an exploration tool designed to infer the large-scale tectonic setting of sediment-dispersal systems in the distant past and/or any remaining frontier areas of our planet. The DM deliberately bypasses all the details of the sediment-forming processes. This approach guarantees robustness
98
G.J. WELTJE
but does not permit a meaningful analysis of mixing, identified as a factor of overriding importance in sediment generation. The DM is a successful exploration tool, but it does not lend itself easily to other applications, such as regional studies of multi-sourced basin fills. Traditional provenance models which are aimed at inferring source-area characteristics from sediment properties could be improved greatly by incorporation of quantified knowledge about the processes that govern sediment generation. Prediction of sediment composition from properties of drainage basins (cf. Ibbeken & Schleyer 1991) through development of modelling tools to address sediment generation is an area of active research (Basu 2003; Weltje & Von Eynatten 2004). The capability to integrate modelling efforts, measurements of process rates under laboratory and field conditions, and analyses of comprehensive and well-documented compositional datasets will ultimately determine the rate of progress in provenance analysis. The author thanks J. J. Egozcue for his perceptive review of the manuscript.
References AITCH1SON, J. 1982. The statistical analysis of compositional data (with discussion). Journal of the Royal Statistical Society, Series B, 44, 139-177. AITCmSON, J. 1986. The Statistical Analysis of Compositional Data. Chapman & Hall, London. AITCH1SON, J. & EGOZCUE, J. J. 2005. Compositional data analysis: where are we and where should we be heading? Mathematical Geology, 37, 829-850. ARMSTRONG-ALTRIN, J. S. & VERMA, S. P. 2005. Critical evaluation of six tectonic setting discrimination diagrams using geochemical data of Neogene sediments from known tectonic settings. Sedimentary Geology, 177, 115-129. BARCELO, C., PAWLOWSKY,V. & GRUNSKY,E. 1996. Some aspects of transformations of compositional data and the identification of outliers. Mathematical Geology, 28, 501-518. BASU, A. 2003. A perspective on quantitative provenance analysis. In: VALLONI, R. & BASU, A. (eds) Quantitative Provenance Studies in Italy. Memofie Descrittive della Carta Geologica dell' Italia, 61, 11-22. BECKE, F. 1897. Gesteine des Columbretes. Tschermak 's Mineralogische und Petrographische Mitteilungen, 16, 308-336. BUTLER, J. C. 1979. Trends in ternary petrologic variation diagrams - fact or fantasy? American Mineralogist, 64, 1115-1121. BUTLER, J. C. 1982. The closure problem as reflected in discriminant function analysis. Chemical Geology, 37, 367-375.
BUTLER, J. C. & WORONOW, A. 1986. Extracting genetic information from coarse clastic modes. Computers & Geosciences, 12, 643-652. CHAYES, F. 1960. On correlation between variables of constant sum. Journal of Geophysical Research, 65, 4185-4193. CRITELLI, S., LE PERA, E. 8r INGERSOLL, R. V. 1997. The effects of source lithology, transport, deposition and sampling scale on the composition of southern California sand. Sedimentology, 44, 653-671. CROOK, K. A. W. 1974. Lithogenesis and geotectonics: the significance of compositional variation in flysch arenites (graywackes). In: DOTT, R. H. JR. SHAVER, R. H. (eds) Modern and Ancient Geosynclinal Sedimentation. Society of Economic Paleontologists and Mineralogists, Special Publications, 19, 304-310. DAVIS, J. C. 1997. Statistics and Data Analysis in Geology, 3rd edn. Wiley & Sons, New York. DICKINSON, W. R. 1982. Compositions of sandstones in Circum-Pacific subduction complexes and forearc basins. American Association of Petroleum Geologists Bulletin, 66, 121 - 137. DICKINSON, W. R. 1985. Interpreting provenance relations from detrital modes of sandstones. In: ZUFFA, G. G. (ed.) Provenance of Arenites. North Atlantic Treaty Organization - Advanced Study Institutes (NATO-ASI), Series C, 148, 333-361. DICKINSON, W. R. 1988. Provenance and sediment dispersal in relation to paleotectonics and paleogeography of sedimentary basins. In: KLEINSPEHN, K. L. 8z PAOLA, C. (eds) New Perspectives in Basin Analysis. Springer-Verlag, New York, 3-25. DICKINSON, W. R. & SUCZEK, C. 1979. Hate tectonics and sandstone compositions. American Association of Petroleum Geologists Bulletin, 63, 2164-2182. DICKINSON, W. R., BEARD,L. S., BRAKENR1DGE,G. R. E~r AL. 1983. Provenance of North American Phanerozoic sandstones in relation to tectonic setting. Geological Society of America Bulletin, 94, 222-235, with Supplement (GSA data repository item # 8302). EGOZCUE, J. J., PAWLOWSKY-GLAHN, V., MATEUFIGUERAS, G. & BARCELO-VIDAL, C. 2003. Isometric logratio transformations for compositional data analysis. Mathematical Geology, 35, 279-300. HAUGHTON, P. D. W., TODD, S. P. & MORTON, A. C. 1991. Sedimentary provenance studies. In: MORTON,A. C., TODD,S. P. & HAUGHTON,P. D. W. (eds) Developments in Sedimentary Provenance Studies. Geological Society of London, Special Publications, 57, 1- 11. IBBEKEN, H. t~ SCHLEYER,R. 1991. Source and Sediment: A Case Study of Provenance and Mass Balance at an Active Plate Margin (Calabria, Southern Italy). Springer-Verlag, Berlin. INGERSOLL, R. V. 1990. Actualistic sandstone petrofacies: discriminating modern and ancient source rocks. Geology, 18, 733-736. INGERSOLL, R. V., KRETCHMER, A. G. & VALLES, P. K. 1993. The effect of sampling scale on
SANDSTONE COMPOSITION AND PROVENANCE actualistic sandstone petrofacies. 40, 937-953. JOHNSSON, M. J. 1993. The system composition of clastic sediments. M. J. & BASU, A. (eds) Processes
Sedimentology, controlling the In: JOHNSSON,
Controlling the Composition of Clastic Sediments. Geological
Society of America, Special Paper, 284, 1- 19. I~EIN, G. D. 1963. Analysis and review of sandstone classifications in the North American geological literature, 1940-1960. Geological Society of America Bulletin, 74, 555-576. MARTiN-FERNANDEZ, J. A., BARCELO-VIDAL, C. & PAWLOWSKY-GLAHN, V. 2003. Dealing with zeros and missing values in compositional data sets using nonparametric imputation. Mathematical Geology, 35, 253-278. MOLINAROLI, E., BLOM, M. & BASU, A. 1991. Methods of provenance determination tested with discriminant function analysis. Journal of Sedimentary Petrology, 61, 900-908. OKADA, H. 1971. Classification of sandstone: analysis and proposal. Journal of Geology, 79, 509-525. PAWLOWSKY-GLAHN, V. & BUCCIANTI, A. 2002. Visualization and modeling of sub-populations of compositional data: statistical methods illustrated by means of geochemical data from fumarolic fluids. International Journal of Earth Sciences (Geologische Rundschau), 91, 357-368. PAWLOWSKY-GLAHN,V. & OLEA, R. A. 2004. Geostatistical Analysis of Compositional Data. Oxford University Press, New York. PRESS, W. H., TEUKOLSKY,S. A., VETTERLING, W. T. & FLANNERY, B. P. 1994. Numerical Recipes in
99
FORTRAN: The Art of Scientific Computing, 2nd edn. Cambridge University Press, Cambridge. ROLLINSON, H. R. 1993. Using Geochemical Data: Evaluation, Presentation, Interpretation. Longman, Harlow. SCHWAB, F. L. 1975. Framework mineralogy and chemical composition of continental margin-type sandstones. Geology, 3, 487-490. SWAN, A. R. H. • SANDILANDS,M. 1993. Introduction to Geological Data Analysis. Blackwell Science, Oxford. VALLONI, R. 1985. Reading provenance from modern marine sands. In: ZUFFA, G. G. (ed.) Provenance of Arenites. North Atlantic Treaty Organization - Advanced Study Institutes (NATO-ASI), Series C, 148, 309-332. WELTJE, G. J. 1995. Unravelling mixed provenance of coastal sands: the Po Delta and adjacent beaches of the northern Adriatic Sea as a test case. In: OTI, M. A. & POSTMA, G. (eds) Geology of Deltas. Balkema, Rotterdam, 181-202. WELTJE, G. J. 1997. End-member modeling of compositional data: numerical-statistical algorithms for solving the explicit mixing problem. Mathematical Geology, 29, 503-549. WELTJE, G. J. 2002. Quantitative analysis of detrital modes: statistically rigorous confidence regions in ternary diagrams and their use in sedimentary petrology. Earth-Science Reviews, 57, 211-253. WELTJE, G. J. 2004. A quantitative approach to capturing the compositional variability of modern sands. Sedimentary Geology, 171, 59-77. WELTJE, G. J. & VON EYNATTEN, H. 2004. Quantitative provenance analysis of sediments: review and outlook. Sedimentary Geology, 171, 1-11.
Detailed guide to CoDaPack: a freeware compositional software S. T H I 0 - H E N E S T R O S A
& J. A. M A R T I N - F E R N A N D E Z
Departament d'Informhtica i Matemgttica Aplicada, Universitat de Girona, Campus Montilivi. Edifici P4. E-17071 Girona, Spain (e-mail:
[email protected])
Abstract: Suitable statistical methods for compositional data based on log-ratio methodology are not included in standard statistical packages. This work presents a detailed guide to freeware Excel for Windows-based package, named CoDaPack, which implements most of the basic statistical methods of this methodology. The web site http://ima.udg.es/~thio/#Compositional Data Package contains this freeware package and to install it users need only to have Excel installed on their computers. The guide starts with an example in order to show how to work with this software and ends with a description of all features of CoDaPack.
CoDaPack (Thi6-Henestrosa & Martln-Fernfindez 2005), available on the web site http://ima. udg.es/~thio/#Compositional Data Package, is a freeware that implements the basic methods of analysis of compositional data based on log-ratios, following the methodology introduced in Aitchison (1986). Because there is still a lot of work to be done, the co-operation of the users will be essential in making it a really useful tool. Therefore, suggestions about the philosophy as well as about new features and options in the current functions are welcome. The application runs under a Windows operating system and the software Excel | must be installed on the computer. The file CoDaPack.xls must be opened with the associated macros and it works using menus that appear in every sheet. The dataset must be located in a sheet of the file CoDaPack.xls, observations in rows and parts in columns, and the first row of each column should be used to label the parts or it has to rest blank. Using menus, one can execute macros that activate Visualbasic routines. Numerical results appear in the active sheet and graphical output in independent windows inside Excel. When the user activates the menus a Visualbasic routine is started and the final results appear in the active sheet, or in a new graphical window that is opened under the active sheet.
Dealing with CoDaPack In order to show how to work with CoDaPack an example with real data is presented in this section. The data used belong to the database of Cenozoic volcanic rocks of Hungary kindly provided by Dr Lajos O.Kovfics and Dr Gfibor Kovfics from the Hungarian Geological Survey (O.Kovfics & Kovfics 2001). This dataset consists of 959 unaltered rock samples and nine major oxides: SiO2, TiO2; A1203; FezO3total; MgO; CaO; Na20; K20
and P205. From previous geological classification each sample belongs to one of the two groups: alkaline basalts and calc-alkaline series). For more detailed description of these groups see Kovacs et al. (2006). As stated in this previous paper this dataset contains some observations with one or more zeros in its parts. Therefore the Rounded Zero Replacement routine should be performed first. With only one mouse click it is selected from the Operations menu (Fig. 1). A new window appears - standard for all CoDaPack routines, asking which columns of the active sheet to select and where to put the results (Fig. 2a). Its left side contains the Select Columns structure, the middle the Inputs structure and the Store In (Initial Column) box and on the fight there are buttons: Help, OK, Cancel, Clear and in some routines the Options. Between the left and the middle part there are two arrows to pass information between them. W h e n this window is opened the Select Columns list contains the first row of each column of the Excel sheet or, for a c o l u m n without label, the standard letter that identifies the columns of any Excel sheet. Also, if the routine has been executed before, this window contains in Inputs, Store In (Initial Column) and Options the values used on the last execution. First, the user should select the parts to be used in the routine, in this case the parts involved in the Rounded Zero Replacement. To do that the user has to mark a row of the Select Column list with a single click and then click on the arrow. After that the name of the selected column appears on the middle, inside Inputs structure. The user should repeat this operation in order to select all the parts involved in the routine. If the user wants to unselect a part of Inputs structure, this part must be marked inside Inputs structure and the arrow which now indicates the opposite direction should be pressed.
BUCCIANTI,A., MATEU-FIGUERAS,G. & PAWLOWSKY-GLAHN,V. (eds) Compositional Data Analysis in the Geosciences: From Theory to Practice. Geological Society, London, Special Publications, 264, 101-118.
From:
0305-8719/06/$15.00
9 The Geological Society of London 2006.
102
S. THI0-HENESTROSA & J. A. MARTIN-FERN,/~NDEZ
Fig. 1. Operations menu and its routines.
Finally the user should select, following the same procedure as before, where to put the results. The Rounded Zero Replacement output has the same number of parts as the initial data. The produced data were obtained following the replacement methodology (Martin-Fem~indez etal. 2003) also described in this monograph, see Martin-Fem~indez & Thi6Henestrosa (2006). In order to avoid unnecessary selections, the user needs to select only the first column where to put the results and CoDaPack uses the columns it needs starting with the selected column. In the example all nine columns of the database have been selected and the results will start on column O (Fig. 2a). At this time if the user presses the OK button Excel executes the routine and the results appear on the same sheet where
indicated. In this case the Rounded Zero Replacement has been done with a default ~ -- 0.005 but if the user wants another substitution constant or needs a different substitution constant for each part, this can be done using the Options button (Fig. 2b). The Options button is available only for routines with the possibility of changing the default settings, but if this button is not activated, the routine is executed with the default values. Once zero replacement has been done the dataset contains positive values in all parts. Now it is possible to use all the other routines of CoDaPack. In order to explore the dataset the example continues performing a complete descriptive statistical summary using the routine Summary from the Descriptive Statistics menu. In the example all
Fig. 2. Forms of the Rounded Zero Replacement routine from the Operations menu: (a) main form; (b) Options form.
DETAILED GUIDE OF CoDAPACn
103
Fig. 3. Output of the Summary routine from the Descriptive Statistics menu.
nine columns - O to W - have been selected and the results are put starting on column Y (Fig. 3). As no options were selected the routine has calculated 'Variation Array', 'Centred Log-ratio (CLR) Variance' of each component, 'Total Variance' and a summary of compositional descriptive statistics: 'Centre', 'Minimum', 'Maximum' and three quartiles. The upper diagonal of the 'Variation Array' and the 'CLR Variance' show that compositional variability associated with the parts MgO, P205, TiO2, K20 and SiO2 are the largest. These components would be optimal candidates to obtain three-part subcompositions which retain the largest possible variability of the complete dataset. The exploration continues performing the compositional biplot (Aitchison & Greenacre 2002) of the nine components, that is, a standard biplot
computed on the centred log-ratio (clr) transformed parts of a composition. With the Options button, CoDaPack (Fig. 4) allows one to draw the biplot displaying either only the parts or also the observations. Another feature of all graphs in CoDaPack is the possibility that observations could be displayed in different colours or shapes according to a classification. Figure 5 shows two different displays of the biplot: the first one (Fig. 5a) with only parts and labelling the rays of the biplot, the second one (Fig. 5b) with observations in different colours, where black represents the calc-alkaline series and green the alkaline basalts. The proportion of total variability explained by the biplot (Fig. 5) is 77%, which is reasonably high. As stated in the previous descriptive analysis, rays corresponding to MgO, P205, TiO2, K20 and SiO2 are the largest. Also the rays associated with SiO2, K20, Na20 and A1203
Fig. 4. From the Graphs menu: main form and Options form of Biplot routine.
104
S. THIO-HENESTROSA & J. A. MARTIN-FERN.~qDEZ
Fig. 5. Output of the Biplot routine from the Graphs menu: (a) parts only; (b) parts and the two groups.
are the closest rays to the first axis, the direction along which the projections of the alkaline basalts and the calc-alkaline series are best separated. The subcomposition among the nine parts which retains the largest variability is [MgO; P205; K20], which explains 49% of the total variability. CoDaPack can plot three subcompositions in a ternary diagram. Figure 6 shows four different ways to display a ternary diagram of the same dataset: the data in a basic diagram (Fig. 6a), with a grid (Fig. 6b), after centring the data (Fig. 6c) and with the centred data showing the grid perturbed by the centre of the data (Fig. 6d). Note that in the non-centred representations (Figs 6a, b) the pattern of the data looks like a linear pattern. On
the contrary, after centring (Figs 6c, d), a large dispersion in the data is observed with no clear trend. Moreover in these figures the two groups appears not well separated. The exploration continues looking simultaneously for a subcompositional linear pattern and for a reasonable separation of the two groups. The ternary diagrams, including SiO2 and K20, or SiO2 and Na20, whose vertices in the biplot are quite close (Fig. 5) and a third component, e.g. MgO whose vertex lies further apart in the biplot (Fig. 5), show a compositional linear pattern. As an example of this linear pattern a Compositional Principal Components Analysis (CPCA) of the subcomposition [MgO, SiO2, K20] is performed. For
DETAILED GUIDE OF CODAPACK
105
Fig. 6. Output of the TernaryDiagramroutine from the Graphsmenu: (a) default output with groups; (b) with grid; (c) after centring the data; (d) with the centred data showing the grid perturbed by the centre of the data.
106
S. THI0-HENESTROSA & J. A. MARTIN-FERN~NrDEZ
Fig. 6. Continued.
better interpretation the subcomposition has been centred with the routine Centering of Operations menu. Figure 7a shows this compositional linear pattern. Figures 7b and 7c show the linear pattern for each of the groups. The good separation of the alkaline basalts and the calc-alkaline series in the biplot could also be numerically confirmed by a linear discriminant analysis of the two groups applied to the alr-, clror ilr-transformed dataset (von Eynatten et al. 2003). The present version of CoDaPack does not
perform linear discriminant analysis. The strategy is to apply log-ratio transformation of data with one of the three available transformations: Additive Log-ratio (ALR), Centred Log-ratio (CLR) or Isometric Log-ratio (ILR), then to export the transformed data into any standard statistical package and to perform the classical linear discriminant analysis there. Following this procedure only 3.96% of the observations are incorrectly classified (misclassification rate) by the linear discriminant function.
Fig. 7. Output of the Principal Components routine from the Graphs menu: (a) centred dataset; (b) calc-alkaline series; (e) alkaline basalts.
DETAILED GUIDE OF CODAPACK
Fig. 7.
Continued.
List of features of CoDaPack The present version is comprised of six menus, with a total of 26 macros. The menus are: 9 9 9 9 9
107
Transformations - transforms the data from the simplex to the real space or vice versa (Fig. 8). Operations - performs some operations of the data in the simplex (Fig. 1). G r a p h s - makes some graphical representation in the simplex or in real space (Fig. 9). D e s c r i p t i v e S t a t i s t i c s - performs some descriptive statistics (Fig. 10). A n a l y s i s - in the present version only this performs the Logistic Normality test (Fig. 11).
9
Preferences - customizes the size of the graphs depending on the screen (Fig. 12).
Transformations
menu
This menu performs transformations of the data from the simplex to the real space or vice versa. Unconstrain/Basis. This routine returns the dataset unconstrained, that is, for each constrained observation x, it returns its unconstrained y = [ x l w . . . . . x a w ] , where w is the size or weight of the observation. With this feature the data are transformed from simplex to the real space. The user has to select the columns to unconstrain,
108
S. THIO-HENESTROSA & J. A. MART~N-FERNANDEZ
Fig. 8. Transformations menu and its routines.
Fig. 9. Graphs menu and its routines.
Fig. 11). Descriptive Statistics menu and its routines.
Fig. 11. Analysis menu and its routines.
Fig. 12. Preferences menu and its routines.
DETAILED GUIDE OF CODAPACK
109
Fig. 13. Form of the Unconstrain/Basis routine from the Transformations menu. where to put the results and also to indicate the column of sizes or weights (Fig. 13).
The user has to select the columns to transform, where to put the results and also to select the specific transformation (Fig. 14).
Raw-ALR. With this feature the data are transformed from simplex to real space according to the additive log-ratio transformation (alr) or its inverse transformation, that is, from real space to simplex, with the generalized additive logistic transformation (agl). y= alr(x)=
where y E R D-l, dimension, and
x --- agl(y) =
Fin ..... Inx'-ll, L XD
Raw-CLR. With this feature the data are transformed from simplex to real space according to the centred log-ratio transformation (clr) or its inverse transformation, that is, from real space to simplex, with its inverse (clr-l).
(I)
XD J
the real space with D - 1
exp(yl ) 1 + )--~/O_]lexp(yi) . . . . .
X
y = clr(x) = l n - -
where y E R D and g(x) is the geometric mean ( [l D 1 xi)l/Dof X, and the inverse transformation is x = clr-I (y)
-- I exp(yo_ l) 1 -1 + )-~.~__]J exp(yi)'
X 1 . . . . .
(3)
g(x)
XD-1 l
J
(2)
Division in the alr transformation is performed with the last component according to the sequence selected by the user.
exp(yl)
exp(yD)
1
-- L~_,~=l--exp-(yi). . . . . y~i=b-~lexp(----y/)J"
(4)
The user has to select the columns to transform, where to put the results and to indicate the specific transformation.
110
S. THI0-HENESTROSA & J. A. MARTIN-FERN/~dNDEZ
Fig. 14. Form of the Raw-ALR routine from the Transformations menu.
Raw-ILR. With this feature the data are transformed from simplex to real space according to the isometric log-ratio transformation (ilr) or its inverse transformation, that is, from real space to simplex, with its inverse (ilr-1). y = ilr(x) = (Yl . . . . . YD-1) E 1~D-1 ,
(5)
where
Yi -~-~
1
in (I-I~_l xj]
~k(Xi+I)i/'
Operations menu This menu performs operations inside the simplex. That is, operations where input data and output results belongs to the simplex.
Perturbation. With this feature a vector perturbs the data. Returns a D-composition y = p G x = C[plxl, . . . . pDXD], where C stands for the closure operation
(6)
and
c[x,,x2 . . . . ' x ~ ] = x = ilr-l(y) = (xl . . . . . XD) E R D,
f(i) D
and
1 + E~:0f (.J) /
1
with f(0) = 1.
"
Z,~, x,
Z,:,x,
and p is a given D-composition. The user has to select the columns to be perturbed, where to put the results and indicates the column that contains the vector p (Fig. 15).
where Xi ~
Z,~,xi
(7)
. ,,-i/i
(8)
The user has to select the columns to transform, where to put the results and indicates which transformation must be applied.
Power transformation. This feature applies a Power Transformation to the data. For a ~ ~, the power transformation returns a | = C[x~ . . . . . x3]. The user has to select the columns to be powertransformed, where to put the results and to enter the exponent of the operation (decimals separated by comma) (Fig. 16).
DETAILED GUIDE OF CODAPACK
111
Fig. 15. Form of the Perturbationroutine from the Operationsmenu.
Centring. With this feature the data are centred, that is, data are perturbed by the inverse of the geometric mean of the data. This routine centres the dataset, that is, it returns the dataset Y formed by the D-compositions y = g N ( X ) - l o X, where N
,~I/D
.....
N
"~I/N1
j
(9) is the vector of geometric means of components of vectors in the dataset X. Thus, the centre of the set Y is e, the barycentre of the simplex, e.g. for D = 3 the geometric centre of a ternary diagram is e = [0.333, 0.333, 0.333]. The user has to select the columns to be centred, where to put the results and optionally the column where to put the geometric mean (Fig. 17).
Standardization. This feature standardizes the data, that is, centres the data and makes the total variance equal to one. This routine returns a
sample of D-compositions Y, centred at e and with unit total variance. The user has to select the columns to standardize and where to put the results.
Amalgamation. This feature amalgamates some columns of the data. The result of the amalgamation of some of the parts of a D-composition selected by the user is the sum of those parts. The user has to select the columns to amalgamate and the column where to put the results.
Subcomposition/Closure. With this feature the data are closed. This routine closes, i.e. reproportions, the data, that is, returns Y = C(X). If S parts (S < D) are selected, a subcomposition with S-parts is obtained. The user has to select the columns to close and where to put the results.
Rounded Zero Replacement. This feature applies a transformation of the data to avoid the rounded zeros. 'Rounded Zero Replacement' is a substitution of an observation x, with zeros in some parts, by an
112
S. THI0-HENESTROSA & J. A. MARTIN-FERN.~)EZ
Fig. 16. Form of the Power Transformation routine from the Operations menu.
There are four options which modify the appearance of the graph (Fig. 18):
observation y using the expression:
{~i( Yi =
~J) xi 1-y~'~xj=,,
if X i ~ 0 ifxi>0
(lO)
6i
where is the replacement value for the i-th part and is defined by the user. The default constant 8i is 0.005 but the user can define another constant or a column of constants that contains as constants as parts of the composition. The user has to select the input columns and where to put the results (Fig. 2).
Graphs menu This menu enables the user to create twodimensional graphs. The user can customize the appearance of each graph and, in some cases, plot the observations in the graph according to a previous classification.
Ternary diagram. This feature displays a ternary diagram of three selected parts.
(1) (2) (3) (4)
differentiate, by colour or by shape, each point depending on a previous classification; label the vertices of the triangle (the default labels are the part names); perturb the data with the inverse of the centre (centring) or with a given vector; display a reference grid of values. The default values of the grid are 1, 10, 33, 66, 90 and 99 but the user can define other values in a column.
ALR plot. This feature displays a plot according to the ALR transformation of the three columns selected. There are two options which modify the appearance of the graph (Fig. 19): (1) (2)
differentiate, by colour or by shape, each point depending on a previous classification; to label the axis (default labels are log(xl/x3) and log(x2/x3)).
CLR plot. This feature displays a plot according to the centred log-ratio transformation (clr) of three
DETAILED GUIDE OF CODAPACK
Fig. 17. Form of the
Fig. 18. From the
Centering routine
Graphs menu:
from the
main form and
Operations menu.
Options form of the Ternary Diagram routine.
113
114
S. THIO-HENESTROSA & J. A. MARTFN-FERNANDEZ
Fig. 19. From the Graphs menu: main form and Options form of the ALR Plot routine. selected parts. There are two options to modify the appearance of the graph:
(1)
(1)
(2)
(2)
differentiate, by colour or by shape, each point depending on a previous classification; label the axis (default labels are ILR1 and ILR2).
Biplot. This feature performs a 'Compositional Biplot' of selected parts. There are six options which modify the appearance of the graph (Fig. 4): (1) (2) (3) (4) (5) (6)
indicate a column with the labels of the axes; differentiate, by colour or by shape, each point depending on a previous classification; choose the factor plane indicating which parts to display; label the observations (default is no label); display or not the observations (default is yes); display with a different mark the observations that are outliers (default is no mark).
Principal components. This feature calculates the two compositional Principal Components for a three-part composition of three selected parts and displays the result in a ternary diagram. There are two options which modify the appearance of the graph (Fig. 20): (1) (2)
differentiate, by colour or by shape, each point depending on a previous classification; label the vertices of the triangle (default labels are the part names).
ALN confidence region. This feature calculates the 'Additive Logistic Normal Confidence Region' for the ALR-mean vector of the selected parts and displays the result in a ternary diagram. There are three options which modify the appearance of the graph (Fig. 22): (1) (2) (3)
perform an ALN Confidence Region for each group defined by a column; label the vertices of the triangle (the default labels are the part names); define the confidence level (the default is 0.95).
Descriptive statistics menu This menu retums characteristic values for a dataset.
Summary. Performs five descriptive statistics: two of log-ratios ('Variation Array' and 'CLR Variance') and three compositional descriptive statistics ('Centre', 'Min', 'Max' and quartiles). (1)
ALN predictive region. This feature calculates the 'Additive Logistic Normal Predictive Region' of the selected parts and displays the result in a ternary diagram. There are two options which modify the appearance of the graph (Fig. 21):
label the vertices of the triangle (the default labels are the part names); choose the default predictive levels (the default levels are 0.90, 0.95 and 0.99).
(2)
'Variation Array'. Returns a matrix where the upper diagonal contains the log-ratio variances and the lower diagonal contains the log-ratio means. That is, the ij-th component of the upper diagonal is var[ln(X//Xj)], and ij-th component of the lower diagonal is E[ln(X;/Xj)], where (i, j = 1, 2 . . . . . D). 'CLR Variance'. Returns the sum of log-ratio variances that involve each part. The sum of all 'CLR Variances' is the
DETAILED GUIDE OF CoDAPACK
115
Fig. 20. From the Graphs menu: main form and Options form of the Principal Components routine.
'Total Variance'. So
(3) D
CLR-Variancei = Z~=I,/r
var[ln(Xi/Xj)] 2D (11)
(4)
'Total variance'. Returns the sum of all ' C L R Variances'. 'Center'. Returns centre of the dataset, that ^ is, ~ = C[gl , g2 . . . . . go], where gi =(1--I~=lXki) 1/N symbolizes the geometric
Fig. 21. From the Graphs menu: main form and Options form of the ALN Predictive Region routine.
116
S. THIO-HENESTROSA & J. A. MARTIN-FERNANDEZ
Fig. 22. From the Graphs menu: main form and Options form of the ALN Confidence Region routine.
(5) (6)
mean of part X,- in dataset X. The dataset X has been previously closed. ' M i n i m u m ' and ' M a x i m u m ' . For each part of the dataset X returns the m a x i m u m and the m i n i m u m of the closed dataset C(X). 'Quartiles'. For each part of the dataset X returns Q1, the median and Q3 of the closed data set C(X)
The user has to select the columns to be closed and where to put the results. There are two options on this routine (Fig. 23): (1) (2)
perform the statistics for each group defined by a column; the user can choose the descriptive wanted (at least one must be chosen).
Centre. With this feature the user obtains the centre of the dataset as described on the Summary routine. The user has to select the columns to calculate the centre and where to put the result. Total variance. With this feature the user obtains the total variance of the selected columns of the dataset as described on the Summary routine. The user has to select the columns to calculate the total variance and where to put the result. Variation array. With this feature the user obtains the variation array of the selected columns. It returns a matrix where the upper diagonal contains the log-ratio variances and the lower diagonal contains the log-ratio means of the dataset as described
Fig. 23. From the Descriptive Statistics menu: main form and Options form of the Summary routine.
DETAILED GUIDE OF CODAPACK
117
Fig. 24. Form of Atypicality Indices routine from the Descriptive Statistics menu. in Summary. The user has to select the columns to calculate the variation array and where to put the results.
For each kind of test, Anderson-Darling, Cramervon Misses and Watson tests are performed.
Atypicality indices. With this feature the user
Preferences menu
obtains the atypical observations and its index under the assumption of 'Additive Logistic Normal' distribution of the selected parts. The user has to select the columns to calculate its atypical observations and where to put the results. Also the user has to indicate the threshold of atypicality (usually 0.95) (Fig. 24).
Screen size. With this feature the user indicates the resolution of the screen in order to obtain complete pictures of graphs on the screen. The default value is 1152 x 864 pixels.
Analysis menu In the present version this menu performs only the 'Logistic Normality Test'.
Logistic Normality Test. This feature performs a test for: (1) (2)
all marginal, univariate distributions (with a total of D tests); all bivariate angle distributions (with a total of D(D 1)/2 tests); the D-dimensional radius distribution. -
(3)
Fig. 25. Form of the Sum Constraint routine from the
Preferences menu.
118
S. THIO-HENESTROSA & J. A. MART~-FERN,~4DEZ
Sum-constraint. With this feature the user indicates which is the constant used to close the data. The default value is 1 (Fig. 25). This work has received financial support from the Direcci6n General de Investigaci6n of the Spanish Ministry for Science and Technology through the project BFM2003-05640/MATE. The dataset from the database of Cenozoic volcanic rocks of Hungary has been kindly provided by L. 6.Kovfics and G. P. Kovfics from the Hungarian Geological Survey.
References AITCHISON, J. 1986. The statistical analysis of compositional data. Chapman & Hall, London. Reprinted (2003) by The Blackburn Press, Caldwell, NJ. AtTCmSON, J. & GREENACRE, M. 2002. Biplots of compositional data. Applied Statistics, 51, 375-392. MARTiN-FERNANDEZ & THI0-HENESTROSA, S. 2006. Rounded zeros: some practical aspects for compositional data. In" BUCCIANTI, A., MATEUFIGUERAS, G. & PAWLOWSKY-GLAHN, V. (eds) Compositional Data Analysis in the Geosciences:
From Theory to Practice. Geological Society, London, Special Publications, 264, 191-201. MARTiN-FERN.~NDEZ, J. A., BARCEL0-VIDAL, C. & PAWLOWSKY-GLAHN, V. 2003. Dealing with zeros and missing values in compositional data sets. Mathematical Geology, 35 (3), 253-278. 6.KovAcs, L. & KovAcs, G. P. 2001. Petrochemical database of the Cenozoic volcanites in Hungary: structure and statistics. Acta Geologica Hungarica, 44 (4), 381-417. 0.KovAcs, L., KovAcs, G. P., MARTIN-FERNANDEZ, J. A. & BARCETO-VIDAL, C. 2006. Major-oxide compositional discrimination in Cenozoic vulcanites of Hungary. In: BUCCIANTI, A., MATEUFIGUERAS, G. & PAWLOWSKY-GLAHN, V. (eds) Compositional Data Analysis in the Geosciences: From Theory to Practice. Geological Society, London, Special Publications, 264, 11-24. TH10-HENESTROSA, S. & MARTiN-FERN,~NDEZ, J. A. 2005. Dealing with compositional data: the freeware CoDaPack. Mathematical Geology, 37 (7), 777-797. VON EYNATTEN, H., BARCELt3-VIDAL, C. & PAWLOWSKY-GLAHN, V. 2003. Composition and discrimination of sandstones: a statistical evaluation of different analytical methods. Journal of Sedimentary Research, 73 (1), 47-57.
Compositional data analysis with 'R' and the package 'compositions' K. G. V A N D E R B O O G A A R T 1 & R. T O L O S A N A - D E L G A D O 2
llnstitut fiir Mathematik und Informatik, Ernst-Moritz-Arndt-Universitiit Greifswald, Greifswald D-17487, Germany (e-maih boogaart@ uni-greifswald.de) 2Departament Informhtica i Matembtica Aplicada, Universitat de Girona, Girona E-17071, Spain Abstract: This paper is a hands-on introduction and shows how to perform basic tasks in the analysis of compositional data following Aitchison's philosophy, within the statistical package 'R' and using a contributed package (called 'compositions'), which is devoted specially to compositional data analysis. The studied tasks are: descriptive statistics and plots (ternary diagrams, boxplots), principal component analysis (using biplots), cluster analysis with Aitchison distance, analysis of variance (ANOVA) of a dependent composition, some transformations and operations between compositions in the simplex.
This paper will show how the basic tasks of compositional data analysis (Aitchison et al. 2002) can be performed with the package 'compositions' in the free statistical environment 'R' (R Development Core Team 2003). The paper aims to be useful for a wide spectrum of 'R' users: for this reason, it is suggested that the experienced skip these first steps, whereas those who never heard about 'R' should begin with Appendix A before continuing with the text. It is strongly recommended that the reader be in front of the computer, typing the examples outlined here: thus, text output of these instructions is kept to a minimum, and almost all figures are not included, although they are described briefly (with a few exceptions).
manuals or of typing to a command line any command found out there. However, it should be remembered that 'R' and its packages are a living project permanently adapted to the development of the field. More intstructions can be found at 'http://www.stat.boogaart.de/compositions/'. After starting 'R' (either by clicking on the appropriate icon, selecting the entry 'R' in the start menu or by typing the command 'R' to a console or command window, after installing the software) a command window appears where commands can be given to 'R'. The following appears:
R: C o p y r i g h t 2004, T h e R F o u n d a t i o n for Statistical Computing Version 2.0.1 (2004-11-15), ISBN3-900051-07-0
First steps 'R' is a powerful computer environment for multipurpose statistics and data analysis. It is available for all computer platforms and can be downloaded from 'http://www.cran.R-project.org'. 'Compositions' is a contributed package for 'R', devoted specially to the analysis of compositional data; it can be downloaded from 'http://www.stat.boogaart.de/compo sitions'. 'R' and 'compositions' are both distributed and developed under the GNU public license, hence they are available free of charge. Further instructions on downloading, installation and getting started with the software can be found in Appendix A. 'R' is classically based on a command line interface, but various graphical user interfaces are available from 'http://www.cran.R-project.org'. When compared with other compositional software, the 'R' package provides a maximum of flexibility. However, being based on a computer language, it demands from its users not to be afraid of reading
R is f r e e s o f t w a r e and comes with ABSOLUTELY NO WARRANTY. You are welcome to r e d i s t r i b u t e it under certain conditions. Type 'license()' or ' l i c e n c e ( ) ' distribution details. R is a c o l l a b o r a t i v e many contributors.
project
for with
Type 'contributors()' for more information and 'citation()' on h o w to cite R or R packages in publications. Type 'd e m o ( ) ' for some demos, 'help()' for o n - l i n e help, or 'help. s t a r t () ' for a HTML browser i n t e r f a c e to h e l p . T y p e 'q()' to q u i t R.
CompositionalData Analysis in the Geosciences:From Theory to Practice. Geological Society, London, Special Publications, 264, 119-127.
From: BUCCIANTI,A., MATEU-FIGUERAS,G. & PAWLOWSKY-GLAHN,V. (eds)
0305-8719/06/$15.00
9 The Geological Society of London 2006.
120
K . G . VAN DER B O O G A A R T & R. T O L O S A N A - D E L G A D O
The version number should be checked, since at least version 2.0.0 is required for running compositions. The ' > ' mark shows that 'R' is willing to accept commands. This character should not be typed with the commands. To see how 'R' works, type '3 "7', and hit the ENTER-Key to make 'R' execute this command: > 3*7 [i] 21 >
'R' executes the command by multiplying 3 and 7 and then prints the result 21. At this moment ignore the '[ 1 ] ' . 'R' can in this way be used as a (extremely powerful) calculator. To prepare 'R' for compositional data analysis the library compositions must be loaded with the library command:
When working in a terminal, the help can be closed by typing 'q' for Quit. In a windows-based environment the help window can simply be closed.
> i s ( ) # S h o w n a m e s of all v a r i a b l e s / datasets [i] " s a . d i r i c h l e t .... s a . d i r i c h l e t . dil .... s a . d i r i c h l e t . m i x " [4] " s a . d i r i c h l e t 5 .... s a . d i r i c h l e t 5 . dil .... s a . d i r i c h l e t 5 . m i x " ... (lines o m i t t e d )
The other commands show a typical usage of 'R': Use .'? to get help information, or '1 s ( )' to show all variables/datasets defined previously. Just type the name of a dataset to show its content, which in this case is a set of simulated amounts of three different chemical elements in ppm:
> library(compositions)
Attaching package
'compositions':
The following o b j e c t ( s ) from package:stats: cor
cov dist
are masked
var
The following object(s) from package:base:
are masked
%*% >
Either such output, or no output at all, informs the user about a properly loaded package. When an error appear such as this: > library(compositions) in library(compositions): is no package called 'compositions'
Error There
this means that the package is not properly downloaded or installed. Instructions for downloading and installation of the package can be found in Appendix A. After loading the package, some example data from the package should be loaded with the 'data' command: > data(SimulatedAmounts) # Load e x a m p l e d a t a (no o u t p u t ) > ? SimulatedAmounts # Show help about example data
Note that a hash mark' #' denotes the beginning of a comment: after it, the rest of the line is ignored by 'R'. Therefore, it is not necessary to type them.
> sa.lognormals
# Show
one
of
the
datasets Cu Zn Pb [i,] 8.8043262 35.1671810 45.895025 [2,] 0.8115227 2.6547329 47.804310 [3,] 1.2836130 12.4472047 40.553628 ... (lines o m i t t e d ) [60,] 3 . 9 8 5 4 9 9 8 6 . 1 3 0 1 9 0 9 4 0 . 5 7 9 4 1 7
To edit or just to inspect the dataset in a spreadsheet-like environment the command ' f i x ( s a . lognormals)' may be used. Appendix A contains instructions on how to load datasets. Basic compositional data analysis The zero step when using the package is to mark your data explicitly as a set of elements from a simplex under Aitchison geometry (Aitchison et al. 2002). This is done by converting the dataset to an _Aitchis~ compositional set through the function 'acomp', and storing it into a new variable by using the assignment sign '<-'.
> data(SimulatedAmounts) # j u s t in case you s t a r t h e r e > c d a t a <- a c o m p ( s a . l o g n o r m a l s ) > cdata Cu Zn Pb [i,] 0.097971136 0.391326782 0.51070208 [2,] 0.015828238 0.051778890 0.93239287 [3,] 0.023646054 0.229295970 0.74705798 ... (lines o m i t t e d ) [60,] 0.078617049 0.120922730 0.80046022 attr(,"class") [i] " a c o m p "
CODA ANALYSIS WITH R AND COMPOSITIONS The dataset is now closed. Note that the closure constant is automatically considered as one. In this case, the resulting object is stored in ' c d a t a ' and marked as an Aitchison composition (having the attribute class ' a c omp') such that it is automatically treated in an adequate way in further commands. For example, the ' p l o t ' command will automatically draw a ternary diagram: > plot(cdata) # Ternary diagram > ? p l o t . a c o m p # H e l p on the a c o m p speci~c plot function
A barplot can also be used to display the whole dataset: I > barplot(cdata) data
# Display
the w h o l e
I
I
L
set
The variation of compositions can be summarized in several ways (Aitchison et al. 2002; Pawlowsky-Glahn & Egozcue 2001):
-
One quits the help by closing the help window or by typing q for 'quit'. 1 There is always an example of the command at the end of each help page. Try this out to see what happens. Another graphical display of compositional data related closely to the Aitchison geometry of the simplex and displaying the Aitchison distance in a visual way is the boxplot of log-ratios: > boxplot(cdata) # B o x - p l o t s of p a i r w i s e r a t i o s in log s c a l e > ? b o x p l o t . a c o m p # H e l p on compositional box-plots > boxplot(cdata,log=FALSE) # use normal scale
> variation(cdata) # Variation matrix Cu Zn Pb Cu 0 . 0 0 0 0 0 0 0 0 . 4 0 4 6 9 9 4 2 . 9 3 8 1 8 2 Zn 0 . 4 0 4 6 9 9 4 0 . 0 0 0 0 0 0 0 2 . 9 1 0 5 3 9 Pb 2 . 9 3 8 1 8 1 6 2 . 9 1 0 5 3 8 9 0 . 0 0 0 0 0 0 > v a r ( c d a t a ) # V a r i a n c e m a t r i x of the clr-transform Cu Zn Pb Cu 0.4194692 0.2125124 -0.6319817 Zn 0.2125124 0.4102550 -0.6227674 Pb - 0 . 6 3 1 9 8 1 7 - 0 . 6 2 2 7 6 7 4 1.2547491 > mvar(cdata) # metric variance [i] 2 . 0 8 4 4 7 3 >msd(cdata) # metric standard deviation = sqrt(mvar/(D-l)) [i] 1.0209 > summary(cdata) # multiple information about pairwise ratios .
As a result a square table of boxplots appears, displaying the ratios of the row and the column parts in log scale. Various types of descriptive statistics can be computed by intuitive commands:
> mean(cdata) Cu Zn 0.08918175 0.23949922 attr(,"class") [i] " a c o m p "
121
.
.
Indented lines are continuations of the previous line. A graphical way to display the variability of a composition is the biplot, based on principal 0
-5
Pb 0.67131903
5
!
i
I
0
0
0 0
0
The result is the mean in Aitchison geometry (i.e. closed geometric mean), which is again a composition. This single composition can be displayed by a pie-chart or a barplot:
o oOO ~ o oO
e~ ~ o 8
oo
Pb
Cu
/'^
%o o ~ ~ o o ~176 o
> p i e (mean (cdata)) > b a r p l o t (mean(cdata))
oc~
o
~176176176 " ~ Zn o
cP
I o o
1For some commands (barplot, boxplot, cdt, cor, cov, idt, mean, names, perturbe, plot, power, princomp, qqnorrn, rnorm, runif, scale, segments, split, summary, var, +, - , . , /, % 9% you need to add '.acomp' to see the Aitchison compositional specific help.
i
-0.4
I
-0.2
o:o
0:2
Comp.1
Fig. 1. Biplot of a three-part composition (Cu, Zn, Pb).
122
K.G. VAN DER BOOGAART & R. TOLOSANA-DELGADO
component analysis, which uses the clr transforms (Aitchison 2002): > pca <- princomp(cdata) # p e r f o r m PCA and store the result in pca > pca # display results as text Call: p r i n c o m p . a c o m p ( x = cdata) Standard deviations: Comp.l Comp.2 1.3604382 0.4460269 3 variables and 60 observations. Mean (compositional): Cu Zn Pb 0.08918175 0.23949922 0.67131903 attr(,"class") [i] " a c o m p " +Loadings (compositional): Cu Zn Pb Comp.l 0.5533583 0.5570883 1.8895534 Comp.2 0.4207858 1.7307697 0.8484445 attr(,"class") [i] " a c o m p " -Loadings (compositional): Cu Zn Pb Comp.l 1.312246 1.3034604 0.3842932 Comp.2 1.725060 0.4193976 0.8555428 attr(,"class") [i] " a c o m p " > screeplot(pca) # display importance of components > biplot(pca) # display direction of components
The last component always has no importance. The first component, giving the highest variation, corresponds to the Pb against Zn and Cu balance here (as can be seen in Fig. 1). It explains 90% of the variablity, as can be obtained by considering the variance of first component divided by the metric variance of 'cdata'.
Working with compositions of four or more parts To analyse a different dataset or a subcomposition you might assign something different to 'cdata' or any other variable representing your compositional dataset, e.g. by
The optional parameter 'parts=', allows you to select the parts to be used in the subcomposition. Optional parameters are a typical way of 'R' providing additional functionality to the default behaviour of a command. The possible optional parameters and their effects are documented in the help to each command that can be invoked by '? nameoffunction'. The 'c ( ) ' function is just here to Concatenate the variable names. In principle, now everyone of the aforementioned commands can be applied to the new dataset. Try: m
I > plot(cdata) Since a ternary diagram can display only three parts at the same time, a table of multiple ternary diagrams, containing subcompositions or marginal compositions (Fig. 2) must be displayed. As a default, two parts are determined by the row and the column occupied by each plot, and the geometric mean of the remaining components is taken as the third component. Alternatively one can specify a component, by using the optional parameter ' m a r g i n ' : I plot(cdata,margin="Cd")
Performing a cluster analysis with Aitchison distance A hierarchical cluster analysis can be performed with the following instructions. First the clustering must be computed and the result stored in a variable: > Clusters <- hclust(dist(cdata), m e t h o d = " c o m p l e t e " ) # compute clustering
Since ' c d a t a ' is marked as an Aitchison composition, ' d i s t ' automatically computes the Aitchison distance. The linkage method of clustering (here 'complete') can be replaced by any other method (e.g. 'single') as described in the help '? h c l u s t ' . Now the results should be displayed (Fig. 3): plot(Clusters)
> data(SimulatedAmounts) # Load the example datasets > sa.groups5 # One of these > cdata <- acomp(sa.groups5,parts=c ("Cd", "Pb", " C o " , " C u " ) ) > cdata
# shows the d e n d r o g r a m
I
When the user has decided on the number of groups to interpret, maybe four in this case, a new variable containing the groups assigned to each case can be generated and the group membership
CODA ANALYSIS WITH R AND COMPOSITIONS 0.0
0.2
0.4
1
I
0.6
0.8
I
I
t .0
123
0.0
0.2
0.4
0.6
0.8
1.0
i
i
t
i
I
I
=
, d
Cd
o o.
-
oo
,o
Pb c5 eq c5 o c5
~3
Co
Pb
Cu Cu
I
I
I
I
I
0.0
0.2
0.4
0.6
0.8
1.0
0.0
C
0.2
0.4
0.6
0,8
1.0
Fig. 2. Matrix of ternary diagrams of a four-part composition (Cd, Pb, Co, Cu).
can be displayed in ternary diagrams, boxplots and biplots. > g r o u p <- c u t r e e ( C l u s t e r s , 4 ) > group [ 1 1 1 1 2 2 2 2 2 2 2 1 2 2 2 2 2 1 1 2 2 1 3 2 3 2 4 3 3 2 2 4 3 4 2 2 3 2 22 [ 3 9 ] 3 1 3 3 2 4 4 3 4 4 4 3 4 4 3 4 4 4 3 3 4 4 > plot(cdata,col=group) > plot(cdata,pch=group) > plot(cdata, pch=as.character (group)) > plot(cdata,col=group,center=T) #display centered data > plot(Clusters,labels=group) > biplot(princomp(cdata), xlabs= group) > boxplot(cdata,factor(group))
Compositional computation Various mathematical transformations and operations are defined for the Aitchison simplex (Aitchison et al. 2002). These are interesting mainly for developers of new statistical methods. The perturbation and the power transform are considered as addition and scalar multiplication in a vector space structure of the simplex,
>a
> a <- a c o m p ( c ( l , 2 , 2 ) ) composition [i] 0.2 0.4 0.4 attr(,"class") [i] " a c o m p " > b <- a c o m p ( c ( 8 , 1 , 1 ) ) composition
# a single
# a second
124
K.G. VAN DER BOOGAART & R. TOLOSANA-DELGADO
Cluster Dendrogram
dist(cdata) hclust (*,"complete") Fig. 3. Dendrogram of groups in a four-part composition (Cd, Pb, Co, Cu), defined by the Aitchison distance.
>b [i] 0.8 0.i 0.i attr(,"class") [i] " a c o m p " > a+b # a d d i n g is p e r t u r b a t i o n [i] 0 . 6 6 6 6 6 6 7 0 . 1 6 6 6 6 6 7 0 . 1 6 6 6 6 6 7 attr(,"class") [i] " a c o m p " > 2*a # m u l t i p l i c a t i o n is p o w e r transform [i] 0 . i i i i i i i 0 . 4 4 4 4 4 4 4 0 . 4 4 4 4 4 4 4 attr(, " c l a s s " ) [i] " a c o m p " > (a+a)/2-a # i n v e r s e o p e r a t i o n s [i] 0 . 3 3 3 3 3 3 3 0 . 3 3 3 3 3 3 3 0 . 3 3 3 3 3 3 3 attr(,"class') [i] " a c o m p " > xx<-(cdata-mean(cdata))/msd(cdata) # p a r a l l e l on the w h o l e data set >
XX
...
> mean(xx) # x x is c e n t e r e d Cd Pb Co Cu 0.25 0.25 0.25 0.25 attr(,"class") [i] " a c o m p " > msd(xx) # and n o r m a l i z e d
[i] 1 > a %*% a # scalar [1] 0.320302 > norm(a) # norm [1] 0.5659523 > cdata %*% a c o m p ( c ( 1 , 2 , 3 , 4 ) )
product
# scalar products [110.43526714 - 1 . 5 3 7 0 0 3 0 4 2 . 4 0 6 8 2 7 6 9 1.61307672 2.64960822 2.38350333 ... (lines omitted) > y y <- c h o l ( s o l v e ( c l r v a r 2 i l r (vat (cdata)))) %*% c d a t a # m a t r i x operation
> var(yy) # v a r ^ - 0 . 5 * [,i] [,2] [,3] [i, ] 0.75 -0.25 -0.25 [2,] -0.25 0.75 -0.25 [3,] -0.25 -0.25 0.75 [4,] -0.25 -0.25 -0.25
cdata [,4] -0.25 -0.25 -0.25 0.75
The standard transforms can be computed by
> > > > > > >
CenteredLogRatio <- c l r ( c d a t a ) IsometricLogRatio <- i l r ( c d a t a ) AdditiveLogRatio <- a l r ( c d a t a ) OriginalData <- c l r . i n v ( C e n t e r e d L o g R a t i o ) OriginalData <- i l r . i n v ( I s o m e t r i c L o g R a t i o ) OriginalData <- a l r . i n v ( A d d i t i v e L o g R a t i o ) CenteredLogRatio Cd Pb Co Cu [i,] - 1 . 6 2 8 1 5 9 0 6 2.82253578 -0.46785026 -0.72652645 [2,] - 0 . 2 8 0 1 3 7 1 5 2.27945644 1.20048072 -3.19980002 ... l i n e s o m i t t e d [60,] 1 . 7 2 4 6 8 2 6 4 -2.56743127 -0.16183132 1.00457995 attr(,'class') [i] " r m u l t "
Working with grouped data Sometimes a group membership is available. In the case of the ' s a . g r o u p s 5 ' dataset, it represents a sector of an imaginary river where the sample was originally collected. This information is stored in the variable's a . g r o u p s 5 . a r e a ' :
plot(acomp(sa.groups5), col=sa.groups5.area) boxplot(acomp(sa.groupsS), sa.groups5.area)
CODA ANALYSIS WITH R AND COMPOSITIONS When the user wants to analyse only one of the groups, a subset of the data is selected based on a criterion: > sa.groups5.area [i] U p p e r U p p e r U p p e r U p p e r U p p e r Upper Upper Upper Upper Upper [ii] U p p e r U p p e r U p p e r U p p e r U p p e r Upper Upper Upper Upper Upper [21] M i d d l e M i d d l e M i d d l e M i d d l e Middle Middle Middle Middle Middle Middle [31] M i d d l e M i d d l e M i d d l e M i d d l e Middle Middle Middle Middle Middle Middle [41] L o w e r L o w e r L o w e r L o w e r L o w e r Lower Lower Lower Lower Lower [51] L o w e r L o w e r L o w e r L o w e r L o w e r Lower Lower Lower Lower Lower Levels: Lower Middle Upper > u p p e r <- s p l i t ( c d a t a , s a . g r o u p s S . area) [["Upper"]] > plot(upper) > mean(upper)
A parallel analysis of all groups is possible through the ' 1 a p p l y ' or ' s a p p ly'-function of 'R': > sapply(split(cdata,sa.groups5.area), mean) Lower Middle Upper Cd 0.064366655 0.006801803 0.001636658 Pb 0 . 0 5 9 1 7 8 2 4 2 0 . 5 7 2 8 8 9 9 1 9 0 . 9 5 7 1 9 1 9 0 9 Co 0 . 0 0 9 5 0 9 9 9 5 0 . 0 0 2 6 5 8 0 6 4 0 . 0 0 4 3 2 3 9 1 2 Cu 0 . 8 6 6 9 4 5 1 0 8 0 . 4 1 7 6 5 0 2 1 4 0 . 0 3 6 8 4 7 5 2 1
However the grouping information could be used to check whether the groups are really different in a Multivariate Analysis of Variance (manova), which can be done by 'R' standard routines based on the ilr transform:
> m <- manova(ilr(cdata)-sa.groups5, area) > summary(m) Df Pillai approx F num Df den Df Pr(>F) sa.groups5.area 2 1.0872 22.2312 6 112 < 2.2 e-16 *** Residuals 57 ___ Signif. codes: 0 ~***" 0.001 "**" 0.01 ~*" 0.05~. " 0.i " "i > plot(ilr.inv(residuals(m)),col=sa.groups5. area) > plot(ilr.inv(predict(m)),col=sa.groups5. area) > qqnorm(ilr.inv(residuals(m))) > mvar(predict(m))/(mvar(residuals(m) +predict (m))) # ~R"^2 [i] 0.3980416 > diag(ilrvar2clr(var(predict(m)))/ilrvar2 clr(var(residuals(m)+predict(m)))) [i] 0.4001846 0.5670027 0.1392320 0.2654141
125
Here one sees a highly significant influence of the group given by a p-value stated as ' < 2 . 2 e - 1 6 ' . If this example was run, a series of plots would result: the first one would show the residuals with substantial spread. The second plot shows the location of the predicted group means in ternary diagrams. Unfortunately, the variable names are lost during the ilr transform, such that the plots are drawn without labels. The third plot shows qqnorm-plots of the pairwise log-ratios, in order to check the normality assumption used in the manova. The last two calculations give the total of the model of about 39% and the individual's for the four parts of the composition. In a similar way a discrimination analysis can be performed based on the ilr transform and standard functionality of 'R': >
library(MASS) # Loading appropriate library > # Generating example data > subsample <- s a m p l e ( l : 6 0 , 4 5 ) > TrainingData <- a c o m p ( c d a t a [ s u b s a m p l e , ] ) > TrainingGroups <- s a . g r o u p s S . a r e a [ s u b s a m p l e ] > ControlData <- a c o m p ( c d a t a [ - s u b s a m p l e , ] ) > ControlGroups <- s a . g r o u p s 5 . a r e a [ - s u b s a m p l e ] > ControlGroups [i] U p p e r U p p e r U p p e r M i d d l e M i d d l e Middle Middle Middle Middle Middle [ii] L o w e r L o w e r L o w e r L o w e r L o w e r Levels: Lower Middle Upper > # Performing the discriminat analysis > d s c r <- i d a ( T r a i n i n g G r o u p s - . , i l r (TrainingData)) # Discrimination Analysis > dscr ... ( o u t p u t o m i t t e d ) > predict(dscr,newdata:ilr (ControlData)) # Classify ControlData $class [i] U p p e r U p p e r U p p e r M i d d l e M i d d l e Lower Middle Middle Middle Middle [ii] L o w e r L o w e r L o w e r L o w e r L o w e r Levels: Lower Middle Upper $posterior Lower Middle Upper 1 3.626286e-16 1.851031e-07 9.999998e-01 2 7.991869e-12 8.473827e-05 9.999153e-01 ... ( l i n e s o m i t t e d ) > table(ControlGroups, predict (dscr, n e w d a t a = i l r (ControlData)) $class) ControlGroups Lower Middle Upper Lower 5 0 0 Middle 1 6 0 Upper 0 0 3
126
K.G. VAN DER BOOGAART & R. TOLOSANA-DELGADO
The calculated classification o f the 15 control samples based on 45 training samples w a s thus correct, with one exception. M o r e detailed information about discriminant analysis and ' l d a ' function can be found in the ' R ' help.
Importing data to 'R' The most simple way to provide data to 'R' is to store them into a simple text file. The first row should contain the variable names seperated by a semicolon. The following lines contain the data, again separated by a semicolon.
Conclusions For the beginner, this approach i m m e d i a t e l y provides all basic compositional plots, summaries and transformation in the form o f simple standard commands given in this publication. M o r e helpful reading can be found in Using the R package 'compositions', available at h t t p : / / w w w . s t a t . boogaart.de/compositions. Users can perform advanced analysis using the p a c k a g e in c o m b i nation with the statistical sub-routines o f ' R ' as exemplified in the later chapters and experts can even extend the functionality through the programming interface o f 'R'. The authors are open to suggestions to include m o r e functionality and c o n v e n i e n c e to the package.
Cd;Zn;Pb;Cd;Co 1.2;2.6;4.9;0.2;5 23.4;11;0.2;0.002;6.2 . . .
The data can then be loaded b y the ' R ' - c o m m a n d s :
> m y d a t a <- r e a d . c s v ( " C : / mydirectory/m!n%le.txt", sep=";", dec=".") > fux(mydata) # you m u s t close the w i n d o w a f t e r w a r d s
Appendix A: Help with technical details Downloading and installing 'R' On 'http://www.cran.R-project.org' one can find detailed instruction on downloading and installing 'R', as well as the downloadable packages themselves. Users must download the setup program of the base part of a precompiled binary distribution of 'R' for their platform. For example, for windows users it is sufficient to download the 'rw? .9 .9 ? . e x e ' file from 'http:// www'cran'R-pr~176176 and to double click the downloaded file to start the installation process.
Downloading and installing 'compositions'
One should always check with the fix command that the data are properly loaded before using them. Directories must always be separated by a forward slash '/' in pathnames. All spreadsheet programs can export to this format, when instructed to store as '. c s v ' . The separator and the decimal symbol can vary with the local configuration of the computer. Note that 'R' only uses a dot as the floating comma symbol in any output, although the import procedure accepts the optional parameter ' d e c = ' , " to deal with the colon (see '? r e a d . c s v ' or '? r e a d . table'). To use Tabulator as a separating character use ' s e p = ' " in the ' r e a d . c s v ' command.
Newbee problems and solutions 9
The package is available from 'http://www.stat. boogaart.de/compositions/' or as a contributed package from 'http://www.cran.R-project.org'. In a windows system ('R' v.2.0.1 or later) the package can now be installed through the 'Packages' menu option 'Install package(s) from local zip file...'. On Unix/Linux systems it is done by the command:
9
9 9
" R " CMD I N S T A L L DownloadedPackage.tar.gz
9
9 with 'DownloadedPackage.tgz' replaced by the actual filename of the downloaded package.
'R' is not exactly made for beginners: it's not your fault. Try to find someone to help you, Next year you will be the expert. When 'R' does not find your file, give the whole path and separate the directories with ' / ' , not with a backslash. Don't forget the extension (e.g. '. t x t ' ) . W e a r glasses w h e n c o p y i n g and typing commands. When you get neither a plot nor an error, the plot window is probably iconified. When 'R' answers with '+' instead of '>', you have made a typing error and 'R' thinks that the command is not yet finished. Type '; ', the ENTER-key, and try again. When you are bored by retyping commands again and again, try the up and down arrow keys or copy and paste from your favourite editor or a script window.
CODA ANALYSIS WITH R AND COMPOSITIONS 9 It doesn't work in the second session: Have you loaded all necessary libraries and prepared all variables? 9 'R' comes with plenty of help. Type the ' h e l p . s t a r t ( ) ' command and start with 'Introduction to R'. 9 Type 'q ( )' for quit and the ENTER-key to leave 'R'. Save your workspace, when asked.
References AITCHISON, J. 2002. Simplicial inference. In: VIANA, M. A. G. & RICHARDS, D. S. P. (eds) Algebraic Methods in Statistics and Probability. Contemporary Mathematics Series, 287, American Mathematical Society, Providence, Rhode Island, 1-22.
127
AITCHISON, J., BARCELO-VIDAL, C., EGOZCUE, J. J. & PAWLOWSKY-GLAHN, V. 2002. A concise guide to the algebraic geometric structure of the simplex, the sample space for compositional data analysis. In: BAYER, U., BURGER, H. & SKALA, W. (eds)
Proceedings of the 8th Annual Conference of the International Association for Mathematical Geology, Berlin, Germany, 387-392. PAWLOWSKY-GLAHN, V. & EGOZCUE, J. J. 2001. Geometric approach to statistical analysis on the simplex. Stochastic Environmental Research and Risk Assessment, 15 (5), 384-398. R Development Core Team 2003. R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria (http://www.R-project.org).
Visualization of three- and four-part (sub)compositions with R M. B R E N t'3 & V. B A T A G E L J 2'3
1University of Maribor, Faculty of Organizational Sciences, Kidri(eva 55a, 4000 Kranj, Slovenia (e-mail: matevz.bren @fov. uni-mb.si) 2University of Ljubljana, Faculty of Mathematics and Physics, Jadranska 19, 1000 Ljubljana, Slovenia 3Institute of Mathematics, Physics and Mechanics, Jadranska 19, 1000 Ljubljana, Slovenia Abstract: In 2003 the MixeR (Mixtures with R) project was started and work began to develop a
library of functions written in R to support the analysis of compositional data, i.e. mixtures. This paper presents the 'mix' object in R, reading different data file formats and some MixeR routines for graphical presentation of three- and four-part (sub)compositions in ternary diagrams and tetrahedrons. Additional graphical features and use of parameters are applied on real data - a glacial dataset and dataset of the researcher's daily activities, both from Aitchison's (1986) book. All these routines and datasets are available at http://vlado.fmf.uni-lj.si/pub/MixeR
The paper begins by introducing the two datasets data on Researcher's daily activities and the Glacial data that will serve as examples. Reader are strongly recommended to sit in front of the computer with R installed (R beginners - see Downloading and installing R in Boogaart & Tolosana-Delgado 2006), typing the examples outlined here. With this the reader will also see all figures produced in colour and will understand the ideas on classification and classes more clearly. All MixeR routines and the two datasets are available at http://vlado.fmf.uni-lj.si/pub/MixeR
Researcher's
daily activities data
The dataset No. 31 from Aitchison (1986) gives activity patterns of a statistician for 20 days. The proportions of a day spent teaching, in consultation, administrative work, research, other wakeful activities and sleep are given. Data show the proportions of the 24 hours devoted to each activity, recorded on each of the 20 days. The activity proportions, not the values in hours are given. Therefore, the data are portions of a day summing to one, thus compositional, i.e. mixtures. The data are stored in a file in matrix form, days as rows and activities as columns and the first row comprising the abbreviations of the activities, i.e. variable names: teac - teaching, cons - consultation, a d m i - administration, rese - research, w a k e - other wakeful activities and slee - sleep are given. This dataset is presented in Table 1. The six activities may be divided into two categories: 'work' comprising activities 1, 2, 3, 4,
and 'leisure' comprising activities 5 and 6. These data will be used to present the subcompositional concepts, visualization of the data in ternary diagrams presenting variability with border percentile lines, centring, etc.
Glacial dataset
From Aitchison (1986) dataset No. 18 gives 92 samples of pebbles of glacial tills sorted into four categories: red sandstone, grey sandstone, crystalline and miscellaneous. The percentages by weight of these four categories and the total pebbles counts are recorded. These data are stored in the 'CoDa' data file - each line comprising just one record: in the first line the data file name, in the second the number of variables, in the third the number of cases, in the next rows the variable labels and then No. 1 for the first case and the variable values for the first case, then No. 2 and the variables values for the second case, etc. until the last, No. 92 (Table 2). These data in Table 2 will be used to present the visualization in tetrahedrons and KiNG Mage viewer animation, dealing with zeros, Aitchison's distance computation and classification.
Compositional software
data analysis
tools
First, a brief, not exhaustive history, of software tools for compositional data analysis will be given. CoDa, a microcomputer package for the statistical analysis of compositional data was the first
From: BUCCIANTI,A., MATEU-FIGUERAS,G. & PAWLOWSKY-GLAHN,V. (eds) CompositionalData Analysis in the Geosciences:From Theory to Practice. Geological Society, London, Special Publications, 264, 129-143. 0305-8719/06/$15.00
9 The Geological Society of London 2006.
130
M. BREN & V. BATAGELJ Table 1. Researcher's daily activities dataset
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
teac
cons
admi
rese
wake
slee
0.162 0.200 0.201 0.134 0.224 0.144 0.125 0.127 0.139 0.108 0.187 0.184 0.155 0.181 0.224 0.198 0.214 0.132 0.167 0.166
0.041 0.039 0.082 0.077 0.080 0.063 0.054 0.077 0.052 0.052 0.091 0.070 0.086 0.097 0.096 0.067 0.073 0.037 0.073 0.064
0.138 0.073 0.115 0.107 0.091 0.103 0.137 0.110 0.128 0.082 0.113 0.066 0.101 0.081 0.101 0.139 0.102 0.148 0.127 0.101
0.123 0.076 0.146 0.146 0.162 0.123 0.102 0.101 0.111 0.075 0.116 0.151 0.119 0.164 0.142 0.154 0.130 0.099 0.122 0.145
0.254 0.346 0.194 0.214 0.195 0.316 0.312 0.341 0.266 0.413 0.264 0.305 0.225 0.271 0.203 0.162 0.201 0.307 0.266 0.242
0.282 0.266 0.261 0.321 0.248 0.252 0.270 0.244 0.304 0.270 0.228 0.216 0.315 0.206 0.234 0.281 0.281 0.277 0.245 0.282
From dataset 31 (Aitchison1986). software on compositions, written in Quick Basic by John Aitchison and available with his book (Aitchison 1986). CoDa was later upgraded by John Bacon-Shone.
Table 2. Gacial dataset b:glacial.dat 5 92 Case no A B C D Count 1 91.8 7.1 1.1 0 282 2 88.9 10.1 0.5 0.5 368 31.4 65.9 2.7 0 698 From dataset 18 (Aitchison1986).
CoDaPack freeware software was next, written in Excel in 2001 by Santiago Thi6 Fernandez de Henestrosa and Josep Antoni Mart/n-Fern~indez from Girona compositional research group. An introduction is available in this volume (Thi6Henestrosa & Martfn-Fern~indez 2006). There were also some attempts written in R - a language and environment for statistical computing and graphics. R (http://www.r-project.org/), is 'GNU S'; it provides a wide variety of statistical and graphical techniques (linear and non-linear modelling, statistical tests, time-series analysis, classification, clustering, etc.). Further extensions can be provided as packages. Basic Compositional Data Analysis functions for S + / R comprising basic operations, transformations, estimators and plots, written by Joel Reynolds & Dean Billheimer (2002) from the Washington University are available at http: //www.biostat.wustl. edu / archives/html/ s-news / 2003-12 / msg00139.html In 2003 work began to develop a library M i x e r of functions in R to support the analysis of compositional data, i.e. mixtures (Bren & Batagelj 2003) Routines were provided for: 9 Operations on compositions: perturbation and power transformation, subcomposition with or without residuals, centring of the data, computing Aitchison's, Euclidean, Bhattacharyya distances and compositional Kullback-Leibler divergence - see Martfn-Fernfindez et al. (1999). 9 Graphical presentation of three- and four-part (sub)compositions in ternary diagrams and
VISUALIZATION OF COMPOSITIONS WITH R tetrahedrons with additional features: geometric mean of the dataset, the percentiles and ratio lines, centring of the data, notation of individual data in the set, marking and colouring of subsets of the dataset, their geometric means, etc. 9 Log-ratio transformations of compositions into real vectors that are amenable to standard multivariate statistical analysis, etc. The current version of the MixeR library is available at http://vlado.fmf.uni-lj.si/pub/MixeR In April 2005 Kjetil Halvorsen, author of the 'Fahrmeir' R package, reported his work on coding some compositional routines in R (operations on compositions, air and clr transformations, etc.). In June 2005 a 'compositions' package, written by K. Gerald van der Boogaart and Raimon TolosanaDelgado was published and is now available at http://cran.r-project.org/src/contrib/Descriptions/ compositions.html. To analyse compositions this package supports four different multivariate scales represented by four different classes: ' r p l u s ' the total amount is meaningful and data are analysed in real geometry; ' r c o m p ' - the total amount is meaningless or the individual amounts are parts of a whole in equal units and data are analysed in real geometry; ' a c o m p ' - the total amount is meaningless or the individual amounts are parts of a whole in equal units and the data should be analysed in a relative, i.e. Aitchison's geometry; ' a p l u s ' - the total amount is meaningful and the data should be analysed in relative geometry. Choosing the right type of analysis according to the data is left to the user. The package manual 'The compositions Package' and the introduction 'Using the R package "compositions"' are also available. An introduction to this package is also available in this volume (Boogaart & TolosanaDelgado 2006).
m$sta
-2 -1 0
1
m$mat m$class
131
the status of the mix object with values matrix contains negative elements, zero row sum exists, matrix contains zero elements, matrix contains positive elements, rows with different row sum(s), matrix with constant row sum and normalized mixture, the row sums are all equal to 1. the matrix with the data, and the special attribute of the object, used to allow for an object-orientated style of programming in R. For the MixeR purpose the class is defined 'mixture'.
Example of the mixture object - the dataset of researcher's daily activities (Table 1). Proportions of a day in activity are given for a statistician for 20 days. > m <- mix. R e a d ( ' a c t i v i t y . d a t ' ) > m
Here the R cursor mark '>' denotes the start of the command line and ' < - ' is an assignment operator. The output is $tit [i] ' ' R e s e a r c h e r ' s
daily
activities''
$sum [i] N A $sta
[i] l $mat teac cons 1 0.162 0.041 2 0.200 0.039 3 0.201 0.082 ............
The mixture class in R
19 20
The mix object in R will be presented, dealing with different data file formats and some R routines for graphical presentation of three- and four-part (sub)compositions in ternary diagrams and tetrahedrons, not incorporated in the 'compositions' package. Additional features will be applied: plotting the geometric mean of the dataset, ratio lines and/or percentile lines, marking and colouring subsets of the dataset and centring of the dataset. The input mixture data consist of a data matrix preceded by a title. They are represented as an R 'data frame', an object m consisting of
attr [1 ]
0.167 0.166
admi 0.138 0.073 0.115
0.073 0.064
0.127 0.i01
rese 0.123 0.076 0.146
wake 0.254 0.346 0.194
slee 0.282 0.266 0.261
0.122 0.145
0.266 0.242
0.245 0.282
(, ' ' c l a s s ' ' ) ' 'mixture' '
It should be explained that the $sura is NA, not available, because row sums are not all exactly equal to one due to the rounding errors; therefore, the status $ s t a is one, i.e. the matrix contains positive elements and rows with different row sums.
The 'mix' procedures in R Some basic MixeR routines are presented.
m$tit m$sum
the title of the dataset, the value of the row sums, if constant,
mix. R e a d (~le,
eps=le-6)
132
M. BREN & V. BATAGELJ
Reads a mixture data from the file and returns it as a mixture object. If ] m S s u m - l l < e p s it sets m $ s t a = 3 . The default value for e p s is l e - 6 . mix. ReadML(file,
eps:le-6)
Reads a 'CoDa' data file and returns a mixture object. If ] m g s u m - l l < e p s it sets m g s t a : 3 . The default value for e p s is l e - 6 . mix. C h e c k ( m ,
eps=le-6)
Determines the m $ s u m and m S s t a of a given mixture object m. The default value for e p s is le-6. mix.Normalize(m,
c=l)
Normalizes a given mixture object m if m$ s ta> = 0. The rows sums are now normalized to the constant c with default value c--1. mix. M a t r i x (a, t)
Gives the subcomposition with the columns given by the components of the vector k; all the rest is amalgamated in the residual. Output is the normalized mixture object with l e n g t h ( k ) + 1 columns.
The subcomposition routines
Example Determine the three-part subcompositions of activity data comprising teaching, consulting and research activities, therefore excluding the 3rd, 5th and 6th columns. To define a vector in R the c - concatenate command is applied. > m <- mix. R e a d ( ' a c t i v i t y . d a t ' ) > mix. Sub(m, c ( 3 , 5 , 6 ) ) $tit [i] ' ' R e s e a r c h e r ' s activities''
Joins a matrix data a and the title t into a mixture object.
$sum [1] 1
mix. Random(nr,
$sta [i] 3
nc,
c=l)
Constructs a random mix object with n r rows and nc columns and constant row sum c with default value c = 1. The command mat r ix ( r u n i f ( n r * n c ) , n r , n c ) ) is applied to calculate matrix elements where r u n i f (n, rain:O, m a x = l ) function generates random deviates of uniform distribution. Then the row sums are normalized to the constant c.
daily
$mat teac 0.497 0.635
cons 0.126 0.124
rese 0.377 0.241
20 0.443
0.171
0.387
1 2
attr(,''class'') mix. Sub(m,
k, N o r m a l i z e : T R U E )
Output mix object is computed as a subcomposition of m without the columns given by the components of the vector k. The output mix object is normalized if N o r m a l i z e : T R U E , the default value. mix. S u b R e s (m, k)
Output is the normalized subcomposition without the columns given by the components of the vector k and amalgamated in the residual. mix. E x t r a c t ( m ,
k, N o r m a l i z e = T R U E )
[i]
''mixture''
Example To determine the three-part subcompositions of activity data on teaching and research activities, with all the rest (the 2nd, 3rd, 5th and 6th variables) amalgamated in the residual. > mix. E x t r a c t R e s (m, $tit [i] ' ' R e s e a r c h e r ' s activities''
Gives the subcomposition of m with only the columns given by components of the vector k, norrealized if N o r m a l i z e = T R U E , the default value.
$sum [1 ] 1
mix. E x t r a c t R e s (m, k)
$sta [i] 3
c (i, 4) )
daily
VISUALIZATION OF COMPOSITIONS WITH R
Example
Smat teac 0.162 0.200
1 2 .
133
.
20
.
.
.
0.166
rese 0.123 0.076 .
.
.
residual 0.715 0.724
In Figure 1, the plots of activity data subcompositions in ternary diagrams with the geometric mean and the border percentile lines are produced by the following commands
.
0.145
0.689
> mix. T e r n a r y ( m i x . Sub(m, c (3,5,6)) , d i s t G = c (. i, .I, .i) , Gmean=T) > mix. T e r n a r y ( m i x . Sub(m, c(3,5,6)), Borders=T, c l s = c ( ' ' r e d ' ' , ''magenta'', ''blue''))
attr(,''class'') [i]
''mixture''
In Figure 2, the plots of activity data subcompositions (with residual) in ternary diagrams with border percentile lines are produced by the following commands
Visualization in the ternary diagram routine The mix. T e r n a r y routine draws a ternary diagram with points marking the data. The routine mix. T e r n a r y (m, dist, distG, cls, Centre, B o r d e r s , Gmean) has the following parameters:
m
dist
distG
the mix object, displaces the numbers marking percentile lines for additional space given by the components of the vector d i s C. First component for percentile lines to the vertex No. 1 = top, second to the vertex No. 2 = right, and third to the vertex No. 3 = left, i.e. each component corresponds to one vertex. This displacing is needed to prevent overlaying. The default value is d i s t = c (0.05, 0 . 0 5 , 0.05) displaces the numbers marking percentile lines of the geometric mean for additional space to prevent overlaying. Additional space is given by the components of the vector d i s t , each component corresponds to one vertex. The default value is distG:c(0.05,
cls Centre
0.05,
0.05). colours of the percentile lines. centres the dataset if
> mix. T e r n a r y (mix. S u b R e s (m, c ( 2 , 3 , 5 , 6 ) ) , d i s t = c ( . 0 5 , .i, .05), B o r d e r s = T ) > mix. T e r n a r y (mix. S u b R e s (m, c(2,3,5,6)), dist=c(.05, .i, .05), Borders:T, Centre:T)
Example The glacial dataset (Table 2) consists of percentages by weight for 92 samples of pebbles of glacial tills sorted into four categories - red sandstone, grey sandstone, crystalline and miscellaneous. The percentages by weight of these four categories and the total pebbles counts are recorded. The data are stored in a CoDa data file format. > m <- mix. R e a d M L ( ' g l a c i a l . d a t ' ) >
m
$tit ' ' G L A C I A L D A T A 92 s a m p l e s of p e b b l e s of g l a c i a l tills s o r t e d into four c a t e g o r i e s percentages by weight''
[i]
$sum [i] N A Ssta [I] 0 Smat
Centre:TRUE, Borders
draws border percentile lines if
Gmean
draws the geometric mean of the data if G m e a n : T R U E . The default value for all these options is FALSE.
B o r d e r s :TRUE,
A 91.8 88.9
1 2 .
.
90 91 92
.
.
.
.
15.9 16.9 31.4
B 7.1 I0.i ...
83.3 74.3 65.9
C i.i 0.5 ...
0.8 1.2 2.7
D 0.0 0.5 9
.
.
0.0 5.9 0.0
Count 282 368 9
..
245 575 698
134
M. BREN & V. BATAGELJ
Fig. 1. Three-part (teaching, consultation and research) subcompositions with (a) the geometric mean; and (b) border percentile lines. attr(, [i]
''class'')
' 'mixture'
> mix. Ternary Gmean=T ) > mix. Ternary Borders=T,
' (mix. S u b (m, c (4,5) ,
dist
(mix. S u b (m, c (4,5) , Centre=T)
See Figure 3.
d i s t = c (0.05, 0.05);
The percentile lines routine The routine that draws percentile lines into a drawn ternary diagram is p e r c e n t i l e . l i n e s ( y , direction,
cls,
dist,
it
it) with
parameters Y direction
the vector of percents or decimal values of percentile lines; directions for percentile lines with values 1 - percentile lines to the vertex No. 1 = top, 2 - percentile lines to the vertex No. 2 = right, and 3 - percentile lines to the vertex No. 3 = left. The default value is direction=l
cls
presentation stronger colours are advised; moves the numbers marking the percentile lines for additional space given by the components of the vector d2 s t , to prevent overlaying. The default value of
: 3 i.e. all
directions; the vector with colours, each component corresponds to one vertex. The default value is cls=c ( ' 'yellow'
',
''yellow2'', ' ' y e l l o w 3 ' ' ) visible on the screen but for printing or
0.05,
is the vector with line types, the 1 t y parameter in R graphics routines (values 1, 2 . . . . . 10), each component corresponds to one vertex. The default value is it=c(4,3,2) .
Example A normalized mix object m with nine cases and three variables, i.e. 9 x 3 matrix, is constructed, having 0.1 to 0.9 values in the first column, ratios of one half between the second and third. A ternary diagram is drawn with these nine points in different colours - c 1 s , different shapes - p c h and the size c e x = l (see Fig. 4a). $tit [i] ' ' D e c i l e s values column ' ' $ sum [i] i $sta [i] 3
in
the
~rst
VISUALIZATION OF COMPOSITIONS WITH R
135
Fig. 2. Three-part (teaching, research and residual) subcompositions with (a) borders percentile lines; and (b) centred for better visualization of the differences between cases. To avoid misunderstanding of this centred visualization, borders percentile lines with exact max variation values are obligatory.
Fig. 3. Three-part (red sandstone, grey sandstone and crystalline) subcompositions, with (a) geometric mean; and (b) centred for better visualization of the differences between cases - border percentile lines showing actual variation.
$mat aa 101 202 303 404 505 606 707 808 909
bb 0 30000000 26666670 23333330 20000000 16666670 13333330 i0000000 06666667 03333333
0 0 0 0 0 0 0 0 0
cc 60000000 53333330 46666670 40000000 33333330 26666670 20000000 13333330 06666667
attr( [i]
''class'') ''mixture''
> cls <- c ( ' ' k h a k i ' ' , ''pink'', ''sienna'', ''plum'', ''orchid'',
t o m a t o ' ' , ''tan'', ''violet . . . . ''purple''i > mix. T e r n a r y ( m , col=cls, p c h = 0 : 8 , cex=l) > perc.lines(10*l:9, dir=l, c l s = ' ' c y a n ' ' , it=l) To draw the t e m a r y diagram in Figure 4b, use the
mix. R a n d o m ( n r ,
nc,
s:l)
routine is used that constructs a random mix object with n r rows and n c c o l u m n s with a constant row sum s.
> mix. T e r n a r y (mix. R a n d o m ( 2 2 , 3 ) )
136
M. BREN & V. BATAGELJ
Fig. 4. Three-part compositions: (a) deciles values in the first column, constant ratios half between the second and the third column plotted in the ternary diagram with deciles lines in the first direction; (b) ternary diagram with the random 22 points and deciles lines in all three directions. > perc.lines(!0*l:9, cls=c(''blue'', ''violet''))
Example ''blueviolet'',
The ratio lines routine
The command that draws lines of constant ratios of two components into a drawn ternary diagram direction, cls, is r a t i o , l i n e s (y, dist). The routine parameters are
Y direction
the vector of ratios for ratio lines; directions for ratio lines with value 1 - ratio lines No. 1, i.e. 2 - ratio lines No. 2, i.e. 3 - ratio lines No. 3, i.e.
cls
dist
to the x2/x3 to the xl/x3 to the xl /x2
vertex ----y, vertex = y, and vertex = y.
The default value is d i r e c t i o n = l :3 that stands for all the directions; the vector with colours, each component corresponds to one vertex. The default value is c l s : c ( ' 'green' ' , ' 'green3' ', ' 'y e l l o w g r e e n ' ' ) visible on the screen but for printing or presentation stronger colours are advised, moves the numbers marking ratio lines for additional space given by the components of the vector d i s t , to prevent overlaying. The default value is d i s t : c
(0.05,
0.05,
0.05).
A matrix with nine cases and three variables is constructed, first triple of cases having a constant ratio of one half between second and third variables, second triple between first and third . . . and each triplet is coloured (see Fig. 5a). In this ternary diagram we draw the 1/7, 1/3, 1/2, 1, 2, 3 and 4, ratio lines to all three sides (see Fig. 5b). m <- m a t r i x ( c ( 7 , 4, i, i, 2, 3, i, 2, 3, i, 2, 3, 7, 4, i, 2, 4, 6, 2, 4, 6, 2, 4, 6, 7, 4, i), nrow=9) m <- m i x . N o r m a l i z e ( m i x . M a t r i x ( m , 'Constant ratios')) > m $tit [i] ' ' C o n s t a n t
ratios''
$sum [i] i Ssta [i] 3 Smat [,Z] [,2] 01 [i ] 0 7 02 [2 ] 0 4 03 [3 ] 0 1 07 [4 ] 0 1 02 04 [5] 03 01 [6] 02 [7 ] 0 1 04 [8 ] 0 2 06 [9 ] 0 3 $class
[,3] 0 2 0 4 0 6 0 2 0 4 0 6 0 7 0 4 0 i
VISUALIZATION OF COMPOSITIONS WITH R
137
Fig. 5. Three-part compositions with (a) constant ratios of all combinations of two components, and (b) with the constant ratio lines also drawn for ratios 1/7, 1/3, 1/2, 1, 2, 3 and 4 in all three directions.
[i]
''mixture''
>t<-c(rep(l,3),rep(2,3),rep(3,3 ) >co <- c ( ' ' r e d ' ' , ' ' m a g e n t a ' ' , ''blue'') # d i f f e r e n t c o l o u r s of p o i n t s >mix. T e r n a r y ( m , col=co[t]) # draws ternary diagram >ratio.lines (c(i/3,1/2,1,2,3,4, 1/7), cls=co)# draws ratio lines
Visualization with the tetrahedron routine The m i x . Q 2 k i n routine transforms a four-part mixture m into three-dimensional XYZ coordinates using quadrays transformations and saves them as a file.kin. This transformation applies Quadrays and XYZ by K. Umer and Quadray formulas by T. Ace available on the web. The kin file is displayed as 3D animation using KiNG or MAGE viewer--free software available at http://kinemage.biochem.duke.edu. The mix. Q 2 k i n (kinfile, m, clu=NULL, vec=NULL, king=TRUE, s c a l e = 0 . 2 , col=l) routine'sparameters are kirtle m
clu vec king scale col
the name of a f i l e . k i n , the mix object with four variables, partition determining the colours of points, vector of values determining points sizes, FALSE for Mage, TRUE for King, relative size of points, and colour of points if clu=NULL.
Example From the activity data mix object m will be constructed. A four-part composition with variables teaching, consulting and research, administration work, and leisure comprising other wakeful activities and sleep. The Aitchison distance will be computed and the complete linkage classification method performed. Out of the dendrogram four clusters will be detected and drawn with the m i x . Q 2 k i n command in a ac4. k i n file to be displayed with the KiNG viewer in a tetrahedron using different colours (Figs 6 and 7). > m $ m a t < - c b i n d (m$mat [, i] , m $ m a t [, 2] + m S m a t [, 4] , m $ m a t [, 3 ] , mSmat[,5]+mSmat[,6] ) > d i m n a m e s (m$mat) [[2] ]
m $tit [i] ' ' R e s e a r c h e r ' s ties ' '
daily
activi-
admi 0.138 0.073
leas 0.398 0.490
$ sum [i] N A $sta [i] 1 $mat teach cons&rese 0.162 0.164 0.200 0.115
1 2 .
.
19 20
.
.
.
.
0.167 0.166
.
.
.
.
0.195 0.209
.
.
.
.
.
0.127 0.i01
0.410 0.386
138
M. BREN & V. BATAGELJ
Cluster Q
Dendrogram
~
0
C
II
[7
[
0
d
hclust ('. "complete") Fig. 6. Activity dataset classification represented by a dendrogram.
Fig. 7. Two snapshots of 3D KiNG view of tetrahedral display of the activity data - four-part compositions.
> d<-mix.Ait(m) # computes a m a t r i x of A i t c h i s o n ' s d i s t a n c e s b e t w e e n cases > hc<-hclust(d,method=''complete'', members=NULL) # p e r f o r m s the complete linkage c l a s s i ~ c a t i o n method > plot(hc, labels=NULL, hang=0.1, m a i n = ' ' C l u s t e r Dendrogram'', ylab= ''Aitchison distance'') >mix. Quad2kin('ac4.kin', m, clu=cutree(hc,4), scale=0.2)
The kinemage is a dynamic, 3D layout. One can take advantage of that: 9 by rotating it and twisting it around with the mouse click near the centre of the graphics window and slowly dragging right or left, up or down; 9 by clicking on points with the mouse, the label associated with each point will appear in the bottom left of the graphics area; 9 also the (Euclidean) distance from this point to the previous will be displayed; 9 the right button drag can be used to zoom in and out of the picture.
VISUALIZATION OF COMPOSITIONS WITH R This layout also supports colouring and different sizing of points. Zero values replacement routines
The routine that checks data for 'zero' and negative values is mix. C h e c k D a t a (m, e p s = l e - 6 , Detect=FALSE)
with parameters
the mix object, turning point value for zero checking, and if TRUE output is also the mix object with matrix elements equal - 1 at negative data values, 0 at zero i.e. ]mS [ i , j ]] < e p s values, and 1 at all positive values.
m
eps Detect
negative negative -3 No. negative -3 No.
v a l u e s in t h e d a t a value column 1 row 1 value 1 value column 1 row 2 value 2
negative -3 No. negative -3 No.
value 8 value 9
zero zero 07 zero 08 zero 09
no
zero
values
in
the
data''
column
1 row
column
1 row 9 value
v a l u e s in t h e value column 6 No. 1 value column 6 No. 2 value column 6 No. 3
The
Output is the column number - i , the row number - j , the data value and the consecutive number of negative and/or zero values. If there is no negative and/or 'zero' values, output is ' ' No n e g a t i v e ,
139
8 value
data row 7 value
ie-
row
8 value
ie-
row
9 value
ie-
mix. Z e r o R e p l a c e S i m p
routine
replaces all the 'zero' (i.e. less than eps) values applying the Simple replacement strategy - see Martin-Fernandez et al. (2003). Output is a mix object with all 'zero' values replaced and the row sums preserved. In the routine mix. Z e r o R e p ] a c e S i m p ( m , de,
Col=TRUE,
e p s = l e - 6) the parameters are
Example Here is a mix object m with eight cases, negative values in first column and 0.1 to le-09 in the last. $tit [i] ' ' N e g a t i v e v a l u e s in first c o l u m n , 0.i to l e - 0 9 in the last.''
in
de eps Col
$sum
[i] N A Ssta [i] -2 $mat 1 2 3 4 5 6 7 8 9
Example
aa -3 -3 -3 -3 -3 -3 -3 -3 -3
bb 3 3 3 3 3 3 3 3 3
cc 3 3 3 3 3 3 3 3 3
dd 3 3 3 3 3 3 3 3 3
ee 3 3 3 3 3 3 3 3 3
ff le-01 ie-02 le-03 le-04 le-05 le-06 le-07 le-08 le-09
$class [i]
the mix object, vector of imputed values, value for zero checking, and if T R U E one component of de is imputed in each column. If d e length is less than the number of columns i.e. variables, the d e is repeated to adequate length. If Co 1 =FALSE, one component of d e is imputed in each row with zero components, no length adjustment is made. The default value for C o l = T R U E .
$tit [I] " O n e ,
Ssum [i] 4
''mixture''
> mix. CheckData(m,
Take a mix object m with five 'variables' and ten 'cases', with one or two zeros in a row and each row sum equal to four. Apply the mix. ZeroReplaceSimp command, first with imputed value equal to 1, second with imputed values vector d e = ( 1 : l 0 ) *. 1 i.e. comprising values 0.1, 0.2 . . . . . and 1.
Detect:F)
Ssta [i] 0
two zeros in a row a n d e a c h row sum is 4.''
140
M. BREN & V. BATAGELJ
$ma t aa bb
1 2 3
1 1 1
4
5 6 7
cc
dd
1 1 0
1 0 1
0 1 1
1
0
1
1
1
0 0 1
1 0 0
1 2 0
1 1 2
1 1 1
8
1
1
0
0
2
9 10
2 0
1 2
1 1
0 1
0 0
de
eps
Output is a mix object where each 'zero' value is replaced and the row sums are preserved.
$class
[i]
the mix object, vector of imputed values, if the length of d e is smaller than the number of columns i.e. variables, the d e is repeated to adequate length, and value for the 'zero' checking.
m
ee
1 1 1
''mixture''
>mix. ZeroReplaceSimp(m,l)
Example
$tit [i] " o n e , two zeros in a row a n d e a c h row sum is 4.''
The m i x . Z e r o R e p l a c e M u l t is applied to the mix object m from the previous example.
Ssum [1] 4
>mix. ZeroReplaceMult(m,l) $tit
$sta [i] 2
[I] "One,
$mat
$sum [i] 4 aa
bb
cc
dd
ee
1
0 8000000 0 . 8 0 0 0 0 0 0
0.8000000 0.8000000 0.8000000
2
0 8000000 0 . 8 0 0 0 0 0 0
0.8000000 0.8000000 0.8000000
3
0 8000000 0 . 8 0 0 0 0 0 0
0.8000000 0.8000000 0.8000000
4
0 8000000 0 . 8 0 0 0 0 0 0
0.8000000 0.8000000 0.8000000
5
0 8000000 0 . 8 0 0 0 0 0 0
0.8000000 0.8000000
0.8000000
6
0 6666667 0 . 6 6 6 6 6 6 7
1.3333333
0.6666667
7
0 6666667 0 . 6 6 6 6 6 6 7
0.6666667 1.3333333 0.6666667
8
0 6666667 0 . 6 6 6 6 6 6 7
0.6666667 0.6666667 1.3333333
9
1 3333333 0 . 6 6 6 6 6 6 7
0.6666667 0.6666667 0.6666667
two zeros in a r o w a n d
each
r o w s u m is 4."
$sta [i] 2 $mat
0.6666667
aa
0.75 0.75 0.75 0.75 1.00 1.00 0.50 0.50 1.00 1.00
1 2 3 4 5 6 7 8 9 i0
bb
cc
0.75 0.75 0.75 1.00 0.75 1.00 1.00 0.50 0.50 1.00
dd
0.75 0.75 1.00 0.75 0.75 1.00 1.00 1.00 0.50 0.50
ee
0.75 1 00 0 75 0 75 0 75 0 50 1 00 1 00 1 00 0 50
1.00 0.75 0.75 0.75 0.75 0.50 0.50 1.00 1.00 1.00
10 0 . 6 6 6 6 6 6 7 1 . 3 3 3 3 3 3 3 0 . 6 6 6 6 6 6 7 0 . 6 6 6 6 6 6 7 0 . 6 6 6 6 6 6 7 $class $class
[i]
[i]
> round(mix. ZeroReplaceSimp(m, C o I = F A L S E ) $ m a t , 2) aa
1 2 3 4 5 6 7 8 9 I0
''mixture''
''mixture''
0.98 0.95 0.93 0.91 0.44 0.46 0.74 0.71 1.38 0.67
bb
0.98 0.95 0.93 0.36 0.89 0.46 0.52 0.71 0.69 1.33
cc
0.98 0.95 0.28 0.91 0.89 1.54 0.52 0.57 0.69 0.67
dd
0.98 0.19 0.93 0.91 0.89 0.77 1.48 0.57 0.62 0.67
(i:i0)*.i,
ee
0.i0 0.95 0.93 0.91 0.89 0.77 0.74 1.43 0.62 0.67
In the second command the simple replacement strategy is used, inputing values 0.1, 0.2 . . . and 1 in rows from 1 to 10 and the results are rounded to two decimals. It should be noted that the last line is the same (and it should be) as in the former case, where the imputed value was 1 in each column. The routine for Multiplicative replacement strategy - see Martin-Fernandez et al. (2003) m i x . ZeroReplaceMult (m, de, eps=le-6 ) parameters are
With the Multiplicative replacement strategy the value of the data replacing the zeros is the exact imputed value, i.e. 1.
Example In the glacial dataset there are 48 zero values in the Crystalline and Miscellaneous variables. Because this kind of zero is usually understood as 'a trace too small to measure', it seems reasonable to replace them by a suitably small value. The smallest value detected is used to replace the zero value. The command m i x . S u b (m, 5) produces the four-part normalized subcomposition and the rain command of the third and of the fourth column produces the lowest detected value equal to 0.001. Then the Multiplicative replacement strategy is applied. >
mix.CheckData(m, in
Detect=T)
zero
values
the
data
zero
value
column
3
row
4
zero
value
column
3
row
14
value value
0 0
No. No.
1 2
VISUALIZATION OF COMPOSITIONS WITH R
141
It must be emphasized that the straightforward zero
value
column
4 row
90
value
0 No.
47
zero
value
column
4 row
92
value
0 No.
48
No
NA
0
Example
values
in
the
data
' ' N o t i n g -i f o r n e g . , 0 f o r for positive values''
zero
and
1
$sum [i] $sta [i]
1
From the glacial data mix object mm a mix object will be constructed with the pebbles counts for the cases names. Again a dendrogram and tetrahedron will be drawn with the data points, where classes will be marked with different colours (see Figs 8 and 9). > dimnames(mm$mat)[[i]]<> nun
$mat A
B
C
D Count
[I,]
1
1
1
0
1
[2,]
1
1
1
1
1
[3,]
1
1
1
1
1
[4,]
1
1
0
1
.
(mix. S u b (m, 5) , . i)
is inadequate because the m i x . S u b routine normalizes the data and the 0.1 imputed value is not adequate any more. The values in the rows where there were zero values (first, fourth ...) will now show new proportions, not at all coherent with the original data.
negative
$tit [i]
> mix. ZeroReplaceMult
.
.
.
.
.
.
.
[9O,]
1
1
1
0
1
[91,]
1
1
1
1
1
[92, ]
1
1
1
0
1
m$mat[,5]
$tit [i]
' 'b.glacial.dat'
$class
$ sum
[I]
[i] i
''mixture''
'
$sta > m5
<-
mix. Sub(m,5
> mm
<-
mix. ZeroReplaceMult(m5,.001)
[I] 0 Smat
$tit [i]
A
' ' G L A C I A L D A T A 92 s a m p l e s of p e b b l e s of g l a c i a l tills sorted into four categories percentages by weight''
C
D
368 0.88900000 0.i01000000 0.005000000 0 005000000 607 0.87300000 0.109000000 0.008000000 0 010000000
$sum [i]
B
282 0.91708200 0.070929000 0.010989000 0 001000000
532 0.84515400 0.134865000 0.001000000 0 018981000
1
. . . . . . . . . . . . Ssta [i]
245 0.15884100 0.832167000 0.007992000 0 001000000 3
575 0.17192269 0.755849440 0.012207528 0 060020346 698 0.31368600 0.658341000 0.026973000 0 001000000
$mat A
B
C
D
1
0.91708200
0.070929000
0.010989000
0.001000000
2
0.88900000
0.i01000000
0.005000000
0.005000000
3
0.87300000
0.109000000
0.008000000
0.010000000
4
0.84515400
0.134865000
0.001000000
0.018981000
0.832167000
0.007992000
0.001000000
91 0.17192269
0.755849440
0.012207528
0.060020346
92 0.31368600
0.658341000
0.026973000
0.001000000
......... 90 0.15884100
attr(,''class'') [i]
$class [1 ] ' 'mixture' ' >
d <-mix. Ait(mm)
>
hc<-hclust(d, method:''complete'', m e m b e r s : NULL)
> plot(hc, l a b e l s = m $ m a t [ , 5 ] , h a n g = 0 . 1 , m a i n = ''Cluster Dendrogram'', ylab=''Aitchison distance'') >mix. Quad2kin('glac5.kin',mm, clu=cutree (hc,5), scale=0.14) >mix. Quad2kin('glacS.kin',mm, clu=cutree (hc,8),scale=0.14)
''mixture''
The same result can be obtained using a slightly more complicated procedure: > m
<-
mix. Sub(m,
5,
Normalize:F)
> m
<-
mix. ZeroReplaceMult(m,
> m
<-
mix. Normalize(m)
.i)
Conclusions Some MixR routines for reading different data file formats have been presented, together with an explanation of visualization of three- and four-parts
142
M. BREN & V. BATAGELJ
Cluster Dendrogram
q~
m
r
--
1 t
C,t
--i
1
1
~q~r d
h c ~ (', "como~te') Fig. 8. Glacial dataset classification represented by a dendrogram.
Fig. 9. Two snapshots of glacial data 3D display in the tetrahedron. The snapshot of 3D KiNG view of the glacial data is classified into (a) five classes and (b) eight classes.
VISUALIZATION OF COMPOSITIONS WITH R
143
V. (eds) Compositional Date Analysis in the Geosciences: From theory to practice. Geological Society, London, Special Publications, 264, 119127. BREN, M. & BATAGELJ,V. 2003. Compositional data analysis with R. In: Proceedings CoDaWork'03, Girona, Spain, 111 - 122. MARTIN-FERN.~NDEZ, J. A., BARCELO-VIDAL, C., BREN, M. & PAWLOWSKY-GLAHN, V. 1999. A measure of difference for compositional data based on measures of divergence. In: Proceedings of the 5th Annual Conference of the International Association for Mathematical Geology, Trondheim, Norway, 1, 211-215. MARTIN-FERN,~NDEZ, J. A., BARCELO-V1DAL, C. & PAWLOWSKY-GLAHN, V. 2003. Dealing with zeros and missing values in compositional data sets using nonparametric imputation. Mathematical Geology, 35 (3), 253-278. REYNOLDS, J. H. & BILLHEIMER, D. 2002. Basic Compositional Data Analysis Functions for References S+/R. World Wide Web Address: http://www. biostat.wustl.edu/archives/html/s-news/2003-12/ AITCHISON, J. 1986. The Statistical Analysis of Commsg00139.html. positional Data. Chapman & Hall, New York. VANDERBOOGAART,K. G. & TOLOSANA,R. 2005. The THI0-HENESTROSA, S. & MARTiN-FERNANDEZ,J. A. 2006. Detailed guide to CoDaPack: a freeware compositions Package. World Wide Web address: compositional software. In: BUCCIATI, A., http://cran.r-project.org/src/contrib/Descriptions/ MATEUV-FIGUERAS, G. & PAWLOWSKY-GLAHN, compositions.html. V. (eds) Compositional Data Analysis in the GeoVANDERBOOGAART,K. G. & TOLOSANA-DELGADO,R. sciences: From theory to practice. Geological 2006. Compositional data analysis with 'R' and the Society, London, Special Publication, 264, 101package 'compositions'. In: BUCCIANTI, A., 118. MATEUV-FIGUERAS, G. & PAWLOWSKY-GLAHN,
(sub)compositions with additional features and also the computation of distances between compositional data. The use of some R routines for classification and plotting dendrograms has also been demonstrated. The mix routines are available at http://vlado, fmf.uni-lj, si/pub/MixeR With the authors of the 'compositions' package, support is provided for a complementary use of 'compositions' package and MixeR routines - transformations from the mix object to the objects of the four different classes: rplus, rcomp, acomp and aplus are implemented in the 'compositions' package and the transformations from the four classes to the mix objects can be found in MixeR library routines. With these routines it is hoped to enable users to apply and to benefit from both, the 'compositions' package and also the MixeR library routines.
Simplicial geometry for compositional data J. J. E G O Z C U E t & V. P A W L O W S K Y - G L A H N
2
1Departament Matem~tica Aplicada III, Universitat Politkcnica de Catalunya, Jordi Girona Salgado 1-3, C2, E-08034 Barcelona, Spain (e-mail: [email protected]) 2Departament Informgttica i Matemfitica Aplicada, Universitat de Girona, Campus Montilivi, P4, E-17071 Girona, Spain
Abstract: The main features of the Aitchison geometry of the simplex of D parts are reviewed. Compositions are positive vectors in which the relevant information is contained in the ratios between their components or parts. They can be represented in the simplex of D parts by closing them to a constant sum, e.g. percentages, or parts per million. Perturbation and powering in the simplex of D parts are respectively an internal operation, playing the role of a sum, and of an external product by real numbers or scalars. These operations impose the structure of (D - 1)dimensional vector space to the simplex of D parts. An inner product, norm and distance, compatible with perturbation and powering, complete the structure of the simplex, a structure known in mathematical terms as a Euclidean space. This general structure allows the representation of compositions by coordinates with respect to a basis of the space, particularly, an orthonormal basis. The interpretation of the so-called balances, coordinates with respect to orthonormal bases associated with groups of parts, is stressed. Subcompositions and balances are interpreted as orthogonal projections. Finally, log-ratio transformations (air, clr and fir) are considered in this geometric context.
This contribution provides a summary of the main definitions and properties of the geometry of the simplex that are the foundation of the statistics of compositional data. However, the subject of the statistics of compositional data is not taken up. The geometry of the simplex of D parts has been developed over the last three decades. The first steps are due to J. Aitchison (1982, 1983, 1986) in papers focused mainly on the statistical analysis of compositional data. The complete development of the geometry was not achieved until the first years of the present decade (Billheimer et al. 2001; Barcel6-Vidal et al. 200 l; Pawlowsky-Glahn & Egozcue 2001, 2002; Aitchison et al. 2002; Egozcue et al. 2003; Egozcue & PawlowskyGlahn 2005). A historical point of view of both geometry and statistics of compositional data can be found in Aitchison & Egozcue (2005). The first section begins with the fundamental concepts of compositional data and the simplex. Operations (perturbation and powering) and straight lines in the simplex are presented in the second section. They define the structure of linear space. Metric characteristics (distance, norm and inner product) in the simplex are developed in the third section. With these definitions, the simplex of D parts has the structure of a Euclidean space of dimension D - 1. Orthonormal bases, coordinates of compositions and their interpretation are treated in the fourth section. The last section summarizes two traditional representations of compositions, alr and cir.
Essentially, the representation of compositional data using alr and clr transformations was introduced in the early contributions of J. Aitchison (1982, 1983, 1986); they played a major role in the development of theory. From the present view-point, they are not required for introducing the Euclidean structure of the simplex. The Aitchison distance was introduced (Aitchison 1983) prior to the powering operation (Aitchison 1986) and two decades before the inner product (Barcel6-Vidal et al. 2001; Billheimer et al. 2001; Pawlowsky-Glahn & Egozcue 2001). However, implicitly Aitchison (1986, p. 85) used it to define orthogonal log-ratios, which he called orthogonal log-contrasts.
Compositional data and the simplex of D parts The concept of compositional data is the starting point for the development of all the geometric and algebraic results that are necessary for building up reliable probabilistic and statistical models for such data. Following on earlier developments of compositional data (Aitchison 1986), a compositional vector of D parts, x = [Xl, x2 . . . . . Xo], is defined as a vector in which the only relevant information is contained in the ratios between its components. All components of the vector are assumed positive. Throughout the text, components are called parts
From: BUCCIANTI,A., MATEU-FIGUERASG. & PAWLOWSKY-GLAHN,V. (eds) CompositionalData Analysis in the Geosciences: From Theory to Practice. Geological Society, London, Special Publications, 264, 145-159. 0305-8719/06/$15.00
9 The Geological Society of London 2006.
146
J.J. EGOZCUE ETAL.
and a compositional vector is called a c o m p o s i t i o n . The notation of the vector with square brackets means that this vector is considered to be a row vector. The assertion that all the relevant information is contained in the ratios implies that, if a is a real positive number, [ X I , X 2 . . . . . and [axl, ax2, . . . . axo] convey essentially the same information and are thus indistinguishable. Then, a composition is a class of equivalent compositional vectors (Barcel6-Vidal et al. 2001). A way to simplify the use of compositions is to represent them in closed form, i.e. as positive vectors, the parts of which add up to a positive constant, K. Common values of the closure constant K are 1 for parts per unit, 100 for percentages, or 106 for parts per million. However, compositional data are also presented in non-closed form; for example, chemical data are expressed frequently as molar or molal concentrations. They can be translated easily into closed form by multiplying by the molar weight of each component (Buccianti & Pawlowsky-Glahn 2005). A consequence of this is that a composition of D parts, [x~, x2 . . . . . xo], can be identified with a closed vector
XD]
X =
C[Xl,
_Fxl
X 2 .....
XD]
C~-~d=lXi Y~D_Ixi ~-D--
'
' "'''
7
Zi=lXiJ D---"
'
(1)
where C is called the c l o s u r e o p e r a t i o n to the constant K (Aitchison 1986). The set of real positive vectors closed to a constant K is called the simplex of D parts and is denoted S D from here on. This notation is not completely standard and sometimes the superscript D for the number of parts is changed to D - 1, indicating dimension or degrees of freedom instead of number of parts. A composition expressed in parts, x = [x], x2 . . . . . xD], closed to K, is interpreted intuitively as a mixture of D pure materials or species. In order to give an algebraic expression of such a mixture, each vector of the canonical basis of ~z~ is associated with one of them. Hence, the vectors al = [ 1 , 0 . . . . .
0], a2 = [0, 1 . . . . .
0] . . . . .
aD =
the closure in equation (2): if x is closed to K, then each ai must be closed to 1. The quantities xa determine the amount of the i-th pure material, expressed in parts per K (unit, cent, million . . . . ) present in the composition. Equation (2) suggests a compositional representation of pure materials. For example, if K = 100, X l = 1 0 0 , and x 2 = x3 . . . . . x o = O, the vector [100,0 . . . . . 0] is obtained. However, this is not a composition in the sense that the ratios between its components are not defined. The representation of compositions by their parts gives rise to useful graphical diagrams. Compositions of two parts can be plotted as a point in the interval (0, K). Compositions of three or four parts can be represented in ternary or tetrahedral diagrams. Figure 1 shows a ternary diagram in which a point represents a composition. The parts of such a composition are proportional to the lengths of the perpendicular segments from the point to each side of the ternary diagram. The vertices of the ternary diagram correspond to the three pure materials. It is customary to designate them by labels associated with the parts; in Figure 1, xl, x2, and x3. Although compositions represented by their closed parts, e.g. percentages, are interpreted easily as mixtures of pure materials, mixing itself is not a purely compositional operation. The noncompositional character of the vectors ai's associated with pure materials announces the difficulties that appear when trying to develop geometry and statistics of compositional data represented by closed parts and using the mixture (2). The following example shows that, to obtain a particular composition from the pure materials by a mixture, some evaluation of the total amount of material is
xl
[0,
0 . . . . . 1] represent the D pure materials. The algebraic expression of the mixture is then x = I x ] , x2 . . . . .
xo]
= Xl al q- x2a2 -4- 999A- XDaO,
(2) x2
where the compositional vector x is still closed to K. This is a convex linear combination of the vectors ai, being the coefficients of the combination the parts of the composition. Note the importance of
Fig. 1. Ternary diagram. Representation of two three-part compositions, n = C[1, 1, 1] and x = [0.50, 0.35, 0.15], and the segments proportional to their parts.
x3
SIMPLICIAL GEOMETRY needed, and this cannot be done using only changes in ratios of parts. Assume that three different and pure materials, AI, A2, A3 are available in three containers and that a composition [0.1, 0.3, 0.6] of such materials has to be obtained. The total amount of the mixture obtained does not matter. To proceed, the total available amount of Al, A2 and A 3 has to be known. Typical situations may be: (a) the three total amounts are equal; (b) the total amounts are, for example, 0.4, 0.5, 0.1 (units of material). In case (a), 1/10, 3/10, 6 / 1 0 of the respective equal amounts can be selected and then mixed. In case (b), to convert 0.1 units of A3 into 60% of the mixture, the original amount must be multiplied by a constant proportional to 6/1, and analogously for A1 and A2 by constants proportional to 1/4 and 3/5, respectively; the constants are specifically 1/24, 3/30 and 1, corresponding to fractions of the original available amounts; finally, these fractions can be mixed to obtain the target composition. In both cases, (a) and (b), the original amounts of the three pure materials have to be known. This operation is not a compositional one, because it does not imply ratios of the available amounts, but an absolute measurement. Once the available amount of each material is measured, one can modify adequately the proportions of the original materials to get the final mixture. This latter operation is compositional, i.e. it only refers to proportions of the three materials involved. Frequently, attention is centred on a group of parts of a composition. Ratios of parts within the group are considered relevant; ratios involving some part, not in the group, are ignored. This corresponds to the definition of a subcomposition including only the parts in this group. If x--C[xl,x2 . . . . . XD] is a composition in S ~ a group of parts is defined readily by a set of r subscripts; let R = {il, i2. . . . . it} be such a set, pointing out the parts xil, xi2 . . . . . xir. The R-subcomposition of x is defined as a composition in S r sub(x; R)
:
C[Xil , xi2 . . . . .
x/r],
(3)
where the closure only affects the r parts in the R-group. For example, consider the composition of three parts x -----[0.1, 0.5, 0.4] and the group of two parts R = {1, 3}. The R-subcomposition is then a two-part composition, namely, sub(x; R ) = C[0.1, 0.4] = [0.2, 0.8].
Vector space structure of the simplex of D parts The aim of this section is to present two purely compositional operations known as perturbation and
147
powering of compositions. They were introduced formally by Aitchison (1986). They are the operations that allow one to configure the simplex of D parts, S ~ as a vector space, and thus to define bases and straight lines (Barcel6-Vidal et al. 2001; Billheimer et al. 2001; Pawlowsky-Glahn & Egozcue 2001). Perturbation and powering
First, let us define perturbation of two compositions in S ~ It plays the same role in S D as the sum of vectors in real space. If x = [Xl,X2. . . . . xo], Y = [Yl,Y2 . . . . . YD] are in S ~ , their perturbation is x ~ y -- C[xlyl, x2Y2 . . . . . xoyo].
(4)
Perturbation fulfills the standard properties of a commutative group operation, which are: 1. 2. 3. 4.
internal operation: x ~ y is in So; commutative: x ~ y = y @ x; neutral element: n = C[1, 1. . . . . 1], satisfying x@n=n~x=x; opposite element: given x E SD there is an opposite element, O x - - C[1/xl , 1/x2 . . . . . 1/xD], such that x @ (Ox) = (Gx) ~ x = n.
Closure operations are implicit in the perturbation. Perturbation can be carried out using compositions closed to different constants and no previous closure to the same constant is required. For example, let x -- [xl, x2 . . . . . XD] and y = [Yl, Y2 . . . . . YD] be compositions closed to Kx and Ky, and compute their perturbation z = x @ y closed to K. The result is obtained as z = [ z l , z2 . . . . .
z~,]
N
--EDlxiYi
[x~yl, x2Y2 . . . . . XDYD]
(5)
= C[XlYl, x2Y2 . . . . . XOyD], where the closure constants, Kx and Ky do not play any role. Perturbation of a composition with its opposite is denoted by the symbol O: x ~ ( O y ) = x O y. This operation is equivalent to substraction in real vector spaces and is called perturbationdifference. Powering is an external operation by a real number or scalar. It plays in S D the same role as multiplication by scalars in real space. Let c be a scalar and x a composition in sD. Powering by c is defined as c |
c
s
x = C[x 1, x 2 . . . . .
x~].
(6)
148
J.J. EGOZCUE ETAL.
The main properties of powering by scalars are: 1. 2. 3.
operation in s o : c 0 x ~ s D ; unitary element: the real number 1 satisfies 10x=x; distributive: c Q (x 9 y) = (c Q x) 9 (c 0 y).
Powering by - 1 can be used to define the opposite element: ( - 1 ) O x = Ox. Perturbation and powering satisfy the properties required to give a vector space structure to the simplex S ~ A typical way to express a composition is x = C[exp [SOl,~2. . . . . ~:D]].
(7)
The key point in expression (7) is that adding a number to the terms within the exponential function leaves the composition invariant due to the closure operator. A consequence is that coefficients ~i such that y'~/D_1 ~i -=- 0 can always be found. In this case ~i = ln(xi/g(x)), where g(.) denotes the geometric mean of the arguments. This is related directly to the clr (centred log-ratio) transform that was commented upon in the last section and, therefore, the ~:i, i = 1, 2 . . . . . D, are called clr coefficients of x. Perturbation and powering take a particularly simple form using the clr coefficients: perturbation is equivalent to addition of clr coefficients, and powering by a number c is equivalent to multiplying the vector of clr coefficients by c. Let ~---[~:1, ~2 . . . . . ~D] and '0 = ['01, "02. . . . . '0D] be the vectors of clr coefficients corresponding to the compositions x and y. Then,
= [~1 "{- '01, ~2 -~- '02 . . . . . ~D "-["'0D]'
(8)
are the clr coefficients of z = x 9 y. Similarly, if z = c O x, then
= [c. ~ 1 , c . ~:2. . . . . c . ~o].
(9)
This shows that, when representing compositions by their clr coefficients, simplicial operations, i.e. perturbation and powering, are reduced to ordinary sum and product by scalars for real vectors.
perturbation-linear-combination as Xa : (O~1 Q) Xl) 9 (o~20 x2) 0 " ' '
0 (O~m Q) Xm)
(10)
= + (O~i O Xi), i=1
which is a perturbation-powering version of the traditional linear combination in real vector spaces. The symbol ~ ) represents repeated perturbation on the subscripts. The set Xl, x2 . . . . . Xm is called perturbation-dependent if the perturbation-linearcombination (10) is equal to the neutral element, x~ = n, for some of the ai's being non-null. Alternatively, if x,~ = n implies that all the ai's are null, then the set xl ,x2 . . . . . Xm is said to be perturbation-independent. In S ~ , the maximum number of perturbationindependent compositions is D - 1. Then, S ~ is a vector space of dimension D - 1. If el, e2 . . . . . e o - i are perturbation-independent, they constitute a basis of S D. This means that each composition x E S v can be expressed as a perturbationcombination X ~-- ( a 1 O el) 9 (0/2 O e2) 0 - - -
9 (aD-1 Q eD-1)
D-I = ~ ) (ai Q) ei), i:1
(1l)
for some coefficients oti that are then termed coordinates with respect to the basis. Given a basis, the coordinates of x are determined uniquely. This allows the representation of a composition by its coordinates with respect to a given basis. Perturbation and powering can be expressed easily in terms of coordinates. In fact, let ei, i : 1,2 . . . . . D - 1 , be a basis of S D and ai,/3 i, i : 1,2 . . . . . D - 1, the coordinates of x and y, respectively, i.e. D-I X = @ (Oli Q ei), i=1
D-1 y = G (fli Q) e/).
(12)
i=l
Perturbation and powering are now D-1
x 9y =
~)[(o~i --]-fli)
Q) ei],
i=1 Perturbation-dependence,
bases and
coordinates
The vector space structure o f S ~ allows one to use the concepts of linear dependence and independence. A set of m compositions in S ~ xl, x2 . . . . . Xm, and m scalars, a l , ce2. . . . . am, is combined (linearly) in a
D-I C Q) X = ~ ) [ ( C " O~i) Q) ei], i=1
(13)
and the coordinates of the perturbation are the ordinary sum of the vectors of coordinates, and powering is reduced to multiplying coordinates by
SIMPLICIAL GEOMETRY the scalar c. Compositional operations are reduced to ordinary vector operations when representing compositions by their coordinates. Expression (11) can be compared to equation (2). In the latter one, a composition was expressed using a mixture of D non-compositional vectors and ordinary sums and products of the parts. The expression of a composition using its coordinates, given in equation (11), requires a basis and only uses compositional operations that can be expressed in terms of coordinates. Again, this confirms that perturbation and powering are natural operations for working with compositions.
Straight lines The analogy between real vector spaces and the simplex leads to a definition of compositional straight lines, or compositional lines, for short. The characteristics of these lines are important for understanding the meaning of perturbation and powering. A compositional line in S D, containing a composition or starting point x0 and with direction given by the composition v, is defined as the compositions x(t) satisfying x(t) = x0 | (t Q v)
(14)
for any real value of the parameter t. An alternative expression is obtained whenever two compositions contained in the line are given, e.g. x0 and xl. In this case, the direction can be specified as v = Xl O x0, which can be substituted in equation (14). Parallel lines are those defined by the same direction v and different starting points x0 and x~). Figure 2 shows four compositional lines. The first two
149
lines (rl,r2) have the same direction vl = [0.36, 0.13,0.51] and are thus parallel. The other two lines (r3, r4) are also parallel with direction v2 = [0.50,0.37,0.13]. Two lines (rl, r3) intersect at the neutral composition (barycentre of the ternary diagram), n = [1/3, 1/3, 1/3]. The other two lines (r2, r4) intersect at [0.62, 0.22, 0.16]. A first observation in Figure 2 is that compositional straight and parallel lines do not appear in the ternary diagram as straight and parallel lines in the ordinary sense. Some compositional lines can appear as straight lines in the ternary diagram, as illustrated in Figure 3. Note that all represented lines go to some vertex of the diagram, although not always both ends. Compositional lines may tend to a point on the border of the ternary diagram, as the border plays the role of infinity in real spaces (Fig. 3). A key point for understanding what compositional lines represent is to identify compositional processes that evolve in time or space following a compositional line. There are many such examples. A typical one in geology is the evolution of mass of different radioactive isotopes in a sample. Assume that a sample contains D different radioactive isotopes with disintegration rates Ai(time-unit)-1, i = 1,2 . . . . . D. When disintegrating, these isotopes do not produce another isotope in the set. Observation of the sample starts at time t = 0, being the initial masses of the isotopes xi(O) (unit of mass), i = 1,2 . . . . . D. The exponential decay of the remaining mass of each isotope at time t is known to be
xi(t)=xi(O)-exp(-Ait), i= 1,2 . . . . . D,
(15)
xl xl
x2
x3 x2
Fig. 2. Two couples of parallel compositional lines in the ternary diagram, two of them crossing at n, the barycentre of the ternary diagram.
Fig. 3. The orthonormal basis (26) represented in a ternary diagram.
x3
J.J. EGOZCUE ETAL.
150
0,90.8x3
0.70. r W
E
0.60.5-
> ,m m
E
0.40.3-
0
0.20.10 0
I
I
I
I
20
40
60
80
....
100
time
Fig. 4. Compositional evolution of two compositional processes of an exponential decay of mass (plain line and plus marker). Rates of decay are equal but initial compositions differ (see text). Plot of cumulated parts. where the Ai are assumed positive in this particular case. Frequently, the relevant information is the relative abundance of each isotope. To study the isotope composition in the sample, a compositional vector can be defined as x(t) =C[xl(t), x2(t) . . . . . xo(t)]. The evolution in time (forward and backward) of x(t) is identified readily as
x(t) = x(0) ~ (t E) exp(-k)), 2~ = [A1, A2. . . . . Ao],
(16)
which is a compositional line with direction v = e x p ( - k ) and starting point x(0). Note that closure xl
of v and x(0) is irrelevant. A conclusion is that all exponential mass-growth or mass-decay processes follow compositional lines when considered from the compositional point of view (Egozcue et al. 2003; Egozcue & Pawlowsky-Glahn 2005). Figures 4 and 5 show two compositional lines corresponding to an exponential mass decay with D = 3, k = [1,1/2,1/4]. The initial or starting points are x(0) -- C[1, 1, 1] = n for the first one (plain line), and x ( 0 ) = C[5,5, 1] for the second one (plus marker). Figure 4 shows the evolution in t of the cumulated parts; Figure 5 represents the lines in the ternary diagram. Since disintegration rate of isotope 3 (A3) is the smallest one, both processes tend to the vertex associated with the pure isotope 3, despite initial conditions.
Distance and other metric concepts Aitchison (1983, 1986) introduced a simplicial distance suitable for the analysis of compositional data. If x = [xl, x2 . . . . . xo] and y = [ Yl, Y2. . . . . YD] are compositions in S v, the squared Aitchison distance between them is
x2
x3
Fig. 5. Same processes of exponential decay of mass as in Figure 4, represented as compositional lines in a temary diagram.
1 ~ ~ (lnXi Xj
da2(X'Y) = D i=1 j=i+l \
=
lnY/~ 2
yj/I
k(Xi _~y))2 In--
i=1 ~'
g(x)
- In
(17)
SIMPLICIAL GEOMETRY where g ( x ) = IX1 "X2"..XD] lID denotes the geometric mean of the parts of the compositional vector in the argument. The main properties of such a distance are: 1. 2.
it does not depend on the closure constant; it is invariant under perturbation: if p C S ~ da(x O p, y O p) = da (x, y);
3.
(18)
itisscaledbypowering:ifcisarealnumber, then
da(c@x, cQy) : Icl" da(x,y);
(19)
it is invariant under permutation of the parts; it guarantees the so-called subcompositional dominance: let R be a set of r subscripts, 1
(20)
All these properties were important for defining the Aitchison distance. However, Aitchison distance can be derived from an inner product defined in the simplex (Barceld-Vidal et al. 2001; Billheimer et al. 2001; Pawlowsky-Glahn & Egozcue 2001; Egozcue et al. 2003). The inner product that is compatible with perturbation and powering gives the structure of a Euclidean space to the simplex or, equivalently, of a finite dimensional Hilbert space. The inner product of two compositions x, y in S D can be defined as D (X, Y)a = Z (ln xi. In Yi) i=1
1D(j=~llnxj ) " (k=~llny~)"
(21)
This definition, except for a factor of D, resembles a covariance where the parts play the role of samples. There are alternative equivalent expressions, for example, 1D-1 (X, Y)a = ~ ~ in x/. In Yi 7"7=j=i+l xj yj D =Zln i=1
xi . ln. Yi g(x)
g(y)'
(22)
where g(. ) is again the geometric mean of the components of the vector in the argument. The inner product, defined in equations (21) and (22), satisfies the standard properties: 1.
positiveness, (X, X)a • 0 if X 4:
n;
151
2. 3.
commutativity, (x, Y)a = (Y, x)a; distributive with respect to perturbation, (X 0 Z, Y)a : (X, Y)a d- (Z, Y)a; 4. linear with respect to multiplication by scalars, (C'X, Y)a : C" (X, Y)a. The inner product allows one to define the Aitchison norm of a composition and to re-define the Aitchison distance between compositions,
Ilxlla ----~ ,
da(x,y) : IlxOYlla.
(23)
These relationships of the norm and the distance with the inner product guarantee the shifting and scaling properties given in equations (18) and (19). Also, the triangular inequality holds from (23): for three compositions x, y, z in S ~ da(x, y) < da(x, z) -k- da(y, z).
(24)
A Euclidean space is a vector space in which an inner product is defined satisfying the abovementioned properties (1-4). Therefore, the simplex S ~ has the structure of a Euclidean space. From a mathematical perspective this means that SD is completely equivalent to R ~ In order to use this fact in practice, compositions have to be represented by coordinates that are, in fact, real vectors, the values of which are not constrained to be positive or less than one.
Representation of compositions by orthogonal coordinates In the previous section, metric concepts (e.g. distance, inner product) in the simplex S ~ were introduced. However, the direct use of those definitions may be difficult for an applied scientist dealing with compositional data. This section has two main goals: (1) to introduce orthonormal coordinates in the simplex; and (2) give some ideas to help interpretation. Goal (1) has a practical consequence. All operations and concepts in the second and third sections will be translated easily into ordinary vector operations, distance and inner product in real spaces that are normally well understood by scientists. For example, perturbation will be translated into a sum of real vectors, a straight-line in the simplex will be equivalent to a straight-line in real space; the Aitchison distance in the simplex will be equivalent to the ordinary Euclidean one. Goal (2) is an important one because analysing compositional data would require two different kinds of interpretation: that performed on the expression of a composition by their parts is customary; and the interpretation based on the coordinates that, although very powerful, is less common.
152
J.J. EGOZCUE ET AL.
Orthonormal
basis
As mentioned in the second section, S ~ is a vector space of dimension D - 1. Thus, D - 1 perturbation-independent vectors in S ~ constitute a basis. If these vectors are unitary and mutually orthogonal they form an orthonormal basis, i.e. if the compositions el, e2 . . . . . eD-1, in S ~ satisfy IleillZa = (ei, ei}a = 1, (ei, ej} a = O, i,j=
i # j,
l,2 ..... D-l,
(25)
they constitute an orthonormal basis of S ~ In real vector spaces, an example of an orthonormal basis is found easily. For example, in R 3, the vectors (1, 0, 0), (0, 1,0), (0, 0, 1) form an orthonormal basis which, because of its simplicity, is termed canonical. This is not so simple in S ~ . A n orthonormal basis in S 3 is el=C
1
exp
,~,0
C exp
, ~,
]]
, (26)
e2 =
.
Although expression (26) is not very complicated, it does not exhibit the simplicity of the canonical basis in a real space. Moreover, other possible bases have similar expressions. Hence, there is no reason to refer to equation (26) as a canonical basis. A practical point is to have some simple rules to identify unitary and orthogonal compositions as demanded in equation (25). The compositions in the basis (26) are expressed using their clr coefficients as defined in equation (7). General characteristics of an orthonormal basis can be seen in equation (26): the squared clr coefficients add to 1; and the ordinary inner product of the clr terms, as real vectors, is null. These properties are general for any orthonormal basis in the simplex, as they are equivalent to conditions in equation (25). Once an orthonormal basis is given, a composition x in S ~ can be expressed as in equation (11), but using the inner product, equations (21) or (22), the expression of the coordinates becomes simplified, i.e. D-I X
= I ~ (X* O
ei),
X~ = (x, el}..
(27)
i:1
Note that the coordinate x* is the signed-size of the orthogonal projection of the composition x onto the direction defined by the unitary composition ei. In the following the asterisk superscript shall be used to denote coordinates with respect to a given basis. For example, the vector of D - 1 coordinates
of a composition x in S ~ is denoted by x*; it is a vector of ~ D - I . As already noted, all operations and metric relationships are translated into coordinates as ordinary vector operations. For example, (x 9 y)* = x* + y*, ( c ( 3 x)* = c . x*;
da(x, y) = d(x*, y*),
(28)
where d( 9 9) is the ordinary distance in real vector spaces given by
I
D-I
d(x*,y*)=
E(x*-y*)
2.
(29)
i=1
Similar results hold for the Aitchison norm and inner product, i.e. all compositional operations are reduced to ordinary vector operations when compositions are represented by their coordinates. This is comfortable for working with compositional data, as all known techniques designed for real data hold for their coordinates. Some special orthonormal bases are associated with sequential binary partitions of a compositional vector (Egozcue & Pawlowsky-Glahn 2005). This is a practical way of defining orthonormal basis and coordinates. The idea is to divide the set of parts in a composition into two groups of parts. The two groups so obtained are divided again into two groups and so on until all groups encompass only one part. The number of divisions necessary to obtain a sequential binary partition is D - 1 and they are directly associated with the D - 1 vectors of an orthogonal basis of S ~ If the process of grouping of parts is based on affinity of the parts, e.g. minor elements and major elements; alkaline and non-alkaline; contaminant and non-contaminant; associated with metamorphic processes and associated with sedimentary processes, etc., the generated coordinates can be interpreted easily. There is a number of ways of defining such a sequential binary partition. Table 1 shows one of them. At each order of the partition a group in the previous level is subdivided into two subgroups: those parts with label + 1 and the other ones with label - 1 . Label 0 indicates that this part is not involved in the partition at this order. The i-th element of the orthonormal basis associated with a sequential binary partition has the expression ei =- C[exp[ail,ai2 . . . . . ai, D-1]] where the clr coefficients aij take different values depending on the code of the sequential binary partition. Assume that at order i a group of r + s parts is divided into two groups of r (positive code) and s parts (negative code), respectively. For example,
SIMPLICIAL GEOMETRY
153
Table 1. Code defining a sequential binary partition of a six-part composition Order
X1
X2
X3
X4
X5
X6
1 2 3 4 5
+1 +1 0 0 0
+1 -1 +1 0 0
-1 0 0 -1 +1
-1 0 0 +1 0
+1 -1 -1 0 0
-1 0 0 -1 -1
At first-order partition the group {1, 2, 5} is separated from {3, 4, 6}; second-orderpartition divides {1, 2, 5} into {1} and {2, 5}; etc. See text for more details.
in Table 1, for order i = 2, r = 1 and s = 2. The values of the a 6 are then a+=
, a_=
, a0=0,
(30)
where the subscripts correspond to positive, negative and null codes. In the example of Table 1, a22 = --l/q/-6. The orthonormal basis in S3 given in equation (26) corresponds to the sequential binary partition coded in Table 2. The expression of coordinates of a composition x with respect to an orthogonal basis defined by a sequential binary partition is simple
*
(I-]+xj)a+
(31)
x i = In ( I-I- xk)~-'
where the products I-I+ and I-I- are extended to parts coded with +1 and - 1 , respectively, in the i-th order partition.
Projection onto a subcomposition
contains only the information of the ratios between parts whose subscripts are in R; ratios between parts whose subscripts are either the two subscripts in or one in R and one in R are ignored. Given a composition x in S D, an important question is to find another composition xR in sD which contains only the information of the R-subcomposition and such that the Aitchison distance da(x, XR) is a minimum. A simple computation leads to Xr, a, a . . . . . a],
X n = [X l , X 2 . . . . .
"O
\~7-,
(32)
where it is assumed that R = { 1,2 . . . . . r} for simplicity. A remarkable fact is that IIxRIla = Ilsub(x; R)II,, where the norms are taken in different simplices, namely $D and S r, respectively. This means that the R-subcomposition in Nr can be represented by XR, which is still in $D, and therefore XR is a composition associated with the R-subcomposition. Figure 6 shows this double representation of subcompositions in a ternary diagram. A three-part
As stated in equation (3), a subcomposition is a composition made up of selected parts from the original one (Aitchison 1986). The process to obtain a subcomposition is simple: start with compositions in so; select r parts, described by the set of their subscripts, R; the R-subcomposition is obtained by applying the closure corresponding to St. In general, this causes a reduction of dimension, from D parts to r parts. The subscripts of parts not included in R are denoted by R. The main point when reducing a composition to an R-subcomposition is that the latter
xl
Table 2. Sequential binary partition code for
the orthonormal basis in equation (26) Order
X1
X2
X3
1 2
+1 +1
+1 -1
-1
0
N o t e that s u b s c r i p t s in e q u a t i o n (26) are in r e v e r s e order.
x2
x3
Fig. 6. Composition x in S 3 (circle) may be reduced to a subcomposition sub(x; {1,2}) in S 2 (square) or to a projection xR onto the second orthogonal axis (triangle).
J.J. EGOZCUE ETAL.
154
composition x is represented by a point (circle). The segment from vertex 3, containing x, to the side 1 - 2 corresponds to compositions for which the ratio xl/x2 is constant. The subcomposition sub(x, {1,2}) can be represented (square) on the side 1-2. This border is then viewed as an interval equivalent to the simplex S 2. Also, in Figure 6 the axes of the basis (26) have been plotted. The point on the axis marked with a triangle corresponds to xR which is associated with the R-subcomposition. From the geometric point of view, xR is the orthogonal projection of x onto a subspace defined by the subcomposition. This operation may be performed easily using the appropriate coordinates. To find suitable coordinates, the sequential binary partition defined in S ~ should contain a sequential binary partition of the parts in R. For example, the sequential binary partition in Table 1 for S 6 contains a sequential binary partition of the group of parts R -- {3, 4, 6} that is carried out in the orders 4 and 5. The coordinates x~ and x~ are said to be associated with the R-subcomposition. The reason is simple: the associated coordinates are exactly those which would be obtained in $3 with the sequential binary partition coded as in Table 3; the corresponding coordinates are also shown. In summary, subcompositions can be treated as orthogonal projections in the Aitchison geometry of the simplex. Following the example in Tables 1 and 3, Table 4 shows how to obtain a Table 3. Sequential binary partition of the group of parts R = {3, 5, 6} defined in Table 1 and subdivided at orders 4 and 5 Order
x3
x4
x6
--1
"~-1 --1 x~ : l n [ ( x 4 ) ~ /(X3X6) ~/'~]
subcomposition in the three representations: in parts, in clr coefficients and in coordinates.
Balance between two groups o f parts Assume that some parts of a composition in S D are grouped in Q, which contains q parts. Furthermore, assume Q is partitioned into two non-overlapping groups, R1 and R2, being the number of parts rl and r2, respectively, with q -- rl + r2 < D. Ratios of parts in Q convey all the compositional information of the Q-subcomposition. Ratios of parts in the subcompositions R1 and R2 retain only a part of the information in the Q-subcomposition. To complete the information given by the subcompositions RT and R2 up to the information given by the Q-subcomposition, all the ratios between one part in the Rl-group and one part in the R2-group are required. This information corresponds to the balance between the R1 and the R2 groups. In order to obtain a more detailed insight into balances, assume the subscripts of the parts in Q are {1,2 . . . . . q}, of the parts in R1 are {1,2 . . . . . rl}, and of the parts in R2 are {rl -t- 1, rl + 2 . . . . . q}, and consider a composition x E S ~ A relevant question is to find the composition, X~R,.R2~, that, being associated with the Q-subcomposition, is orthogonal to the compositions associated with the Rl-and the R2-subcompositions and, moreover, is such that the distance da(X~R,,R~),x) is minimum. It can be proved that such a composition has the form X(R],R2)
Coordinate =(~
+1
0 -1
al
al 02 02
a2
..... ~' 'br, terms .~. . . . b ' b ' b r 2 terms
x5
(33)
Table 4. From composition x E S 6, a R-subcomposition, R ---- {3, 4, 6}, is obtained Representation Original composition Parts clr coef. Coordinates
Symbol x x x*
R-subcomposition Parts clr coef. clr coef. Coordinates Coordinates
D-qterms J '
Expression [Xl, x2, x3, x4, x5, x6] C[exp[~:l, ~2, ~3, ~4, ~5, ~:6]] [x]', x~, x~, x~, x~]
sub(x; R) sub(x; R)
C[x3, X4, X6] C[exp[s -t- ~, ~4 - ~, ~:6 - ~]]
xR (sub(x; R))* x~
Ctexp[0, 0, ~3 - ~, 0, so4 - ~, ~c6 - ~]] [x~, x~] [0, 0, 0, 0, x~, x~]
The sequential binary partition in Table 1 defines the coordinates, being x~ and x~ associated with the R-subcomposition. Notation ~ ~- s + ~4 + s
SIMPLICIAL GEOMETRY where al = g(xl . . . . . Xr,),a2 : g ( X r l + 1 . . . . . Xrl+r2), b : g(xl ..... Xr~+r2) or, equivalently, b = (@. ar22)1/q, with q = rl + re. Note that all components in equation (33) can be multiplied times b so that all terms have the form al, a2 or b. The composition X(R,,R2) is called the balance between the groups of parts Rl and R2. It is the orthogonal projection of x onto the unitary composition,
155
The balance between groups R1 and R2 represents an alternative to the log-ratios of amalgamated parts introduced by Aitchison, (1986, p. 267). In that approach the elements of a group are represented by the sum of the parts and the relationship between the groups is expressed by
Z~ll Xi
(36)
n ~-,--T----- , ~-~k=rl + 1 X k
e(R~,R2)
which would replace the balance (35). However, the amalgamation of a group of parts is not a linear --G operation in the simplex and may cause difficulties -, 1 . . . . . 1 , I Iexp ~ - r r 2 ~ - r 2~ , ~ rl 11 when comparing results obtained with amalgal(r, , r2;' (r, . +r2) , o-qtermsj mated parts with other results obtained using each I_ r 1 terms r2 t e r m s part individually (Egozcue & Pawlowsky-Glahn (34) 2005). An orthonormal basis defined by a sequential the balancing element. The norm of the balance binary partition generates coordinates that can be interpreted as balances between groups of parts. X(R,,R2~ is also denoted balance because it measures the length of the projection of x onto the balancing The coordinate corresponding to an order of parelement. tition is the balance between the groups of parts separated in that order of partition. The example in From an intuitive point of view, a balance between two groups of parts R1 and Re is obtained by carrying Table 1 illustrates these facts. The coordinate correout the following operations: (a) substitute each part sponding to the order 1 partition is the balance in the group R1 by the geometric mean of the parts between groups R1 = {1,2,5} and R2 = {3,4,6}. of the group; (b) substitute each part in the group Coordinates generated by partitions of order 2 and 3 are associated with the Rl-subcomposition, R2 by the geometric mean of the parts of the group; (c) substitute the parts which are not in R1 or R2 by whereas order 4 and 5 partitions generate coordithe overall geometric mean of the parts in RI and nates associated with the Rz-subcomposition. MoreR2. Operations (a) and (b) remove the information over, any coordinate represents a balance; for within the R1- and Ra-subcompositions, respectively; example, the coordinate of order 4 is the balance operation (c)_removes all information related to the between the single part group {4 } and group {3, 6 }. An important property of orthogonal balances is parts in the Q-group. The remainder constitutes the the decomposition of the squared distance balance between R1 and R2. A given balancing element can be included in an between two compositions x, y, into a sum of squared differences of balances (by Pythagoras' orthonormal basis choosing a suitable sequential theorem). This is binary partition. For example, for the previous assumption for Q, R1, R2, the sequential binary parD-1 tition can be coded as in Table 5, being the second dZ(x' Y ) = Z (x* - y,)2 (37) element of the orthonormal basis the balancing i " i----1 element (34). The consequence is that the balance between the groups R1 and Re is the coordinate If interest lies in a group of parts, R, defined in the sequential binary partition generating the coordinates, the terms associated with the R-subcomposix 2 : in ( I ' I ~ l l Xi)~/r2/(rl(rl+r2))__ * (35) tion in equation (37) can be separated; they account ( l"lq . Xk)~/r,/(r2(rl+r2)) ' 1 lk=rl+l for ratios of parts within the R-group. The sum of Table 5. A code defining a sequential binary partition such that the balancing element (34) is the element
of the basis element associated with order 2 partition
Order
x 1
1
+1
2
+1
.
.
.
.
.
.
Xr~
Xrl + l
. . .
Xq
Xq+ 1
+1
+1
+1
+1
+1
-1
+1
+1
-1
-1
-1
999
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
-1
0 .
.
.
.
999
.
XD
-1
0
0
156
J.J. EGOZCUE ET AL.
these terms is the part of d](x, y) explained by the R-subcomposition. Moreover, each term in equation (37), corresponds to a balance between two groups of parts and, consequently, the decomposition (37) is induced by the chosen sequential binary partition. An example using groups of variables follows. Suppose that two different basalt samples, A and B, have been analysed and that the compositions are represented by nine oxides as shown in Table 6. The goal is to quantify and analyse the Aitchison distance between the two compositions A and B. A first step is to compute the perturbation-difference (O). The larger is the part in the perturbation difference, the larger is the multiplicative change in that part. However, this provides only a poor interpretation from the compositional point of view, since no reference to ratios between parts is obtainable easily. To get a more precise insight into the meaning of the perturbation-difference, its squared clr coefficients, ~2, have been computed; they add up to the squared Aitchison distance between A and B; the percentage contribution of each part is also shown (%). It can now be asserted that P205 represents the main contribution to the difference (40.9%). This should be considered carefully, because these squared contributions are not orthogonal. In general, the squared balance of a single element against the remaining ones is proportional to the squared clr coefficient and a little
greater than the clr coefficient. For example, compare the percentage contribution from the clr coefficients corresponding to the part SiO2 (5.7%) to the one of the first balance (SiO2 against the remaining parts, 6.5%); except for a scaling factor, clr coefficients and balances of one part to the remaining ones have the same interpretation; see also equation (42). A more suitable decomposition of the squared Aitchison distance into squared balances (37) is obtained using coordinates. The sequential binary partition coded in Table 6, defines coordinates which are intended to be interpretable balances. The corresponding coordinates are computed and the distance is obtained from the squared differences in the coordinates. The squared differences, ( A - B) 2 in Table 6, is a decomposition of the squared Aitchison distance, dZ(A,B), into components associated with the coordinates, which are in fact balances between groups. The largest squared difference, 0.194 (53.5%), corresponds to the second balance between the groups {FeO2, MgO, CaO } and {TiO2, AI2 03, Na20, KzO, P2Os}. The second squared difference in importance is the 7th, 0.1013 (27.9%), comparing the single part group {TiO2} to {K20, P205 }. The apparently important balance between the single part group {SiO2} and all other parts does not contribute greatly, 0.0235 (6.5%), to the squared Aitchison distance.
Table 6. Two oxide basalt compositions (fictitious) in mass percent. Sequential binary partition used to define coordinates. The Aitchison distance is computed.
SiO2 A B O ~:2 %
47.0 48.9 9.42 0.021 5.7
Order 1
+1
TiO2 2.25 2.24 9.84 0.010 2.8
A1203
FeO2
MgO
17.1 14.0 12.0 0.009 2.5
10.2 10.5 9.52 0.018 4.9
8.06 9.84 8.03 0.093 25.5
-1
-1
-1
CaO
Na20
K20
P205
8.89 8.95 9.74 0.012 3.4
3.82 3.05 12.3 0.014 4.0
2.56 1.90 13.2 0.037 10.3
0.80 0.49 16.0 0.149 40.9
Sequential Binary Partition -1 -1
-1
-1
-1
2 3 4 5 6 7
0 0 0 0 0 0
-1 0 -1 -1 0 +1
-1 0 -1 +1 0 0
+1 +1 0 0 0 0
+1 -1 0 0 +1 0
+1 -1 0 0 -1 0
-1 0 +1 0 0 0
-1 0 -1 -1 0 -1
-1 0 -1 -1 0 -1
8
0
0
0
0
0
0
0
+1
-1
Order A B (A - B) 2 %
1 2 3 2.18 1.45 0.152 2.33 1.89 0.092 0.0235 0.194 0.0037 6.5 53.5 1.0 da(A, B) = 0.603
Coordinates
4 0.222 0.243 4.27E-4 0.1
5 2.02 2.07 3.09E-3 0.9
6 7 -0.069 0.369 0.067 0.688 0.0186 0.1013 5.1 27.9 d2(A, B) = 0.363
8 0.822 0.958 0.0184 5.1
SIMPLICIAL GEOMETRY The squared Aitchison distance due to a subcomposition is also computable from the squared coordinates. For instance, coordinates 7 and 8 are associated with the {TiO2, K20,P2OsJ-subcomposition and they jointly explain 33.0% of the squared Aitchison distance between A and B.
157
where again g(-) means geometric mean. Using this definition, it is easy to see that ~/D_1 ~i = 0 (Aitchison 1986). The expression in parts of the composition is recovered easily from the clr coefficients, x = [xl,x2 . . . . . x o ]
Other representations of compositions
= C[exp(~l), exp(~:2). . . . . exp(~D)],
The normal acquisition of compositional data as percentages, concentrations, etc., leads to the raw representation of compositions, i.e. a vector of parts, possibly closed to a given constant K. This representation in parts of compositions has been described in the first section. The definition of compositional data based on the ratios of parts suggests that any valid representation of compositions should be based on them. A first attempt may be to represent the composition by a set of ratios of parts. However, meaningful operations and distances between compositions may involve complicated expressions using these ratios. This representation of compositions is easily done, but it is non-linear in the simplex, and thus confusing. Therefore, representations based on ratios of parts are not recommended for compositional data analysis. An important step forward was made by Aitchison (1982) in proposing log-ratios, i.e. logarithms of ratios of parts, as a natural way of representing compositions. He defined two transformations, the additive log-ratio (alr) transformation and the centred log-ratio (clr). Both assign real coefficients to a composition in a way that their properties either facilitate interpretation, operation or visualization of compositions. A representation of a composition by its coordinates with respect to a given orthonormal basis is a transformation from the simplex into a real space; this has been called the isometric log-ratio transformation (ilr) (Egozcue et al. 2003). The authors now prefer to call these transformations (air, clr, ilr) representations, to enhance the geometric aspects involved. In fact, both the air and ilr transformations are coordinates of a composition with respect to a basis, an oblique basis in the case of the air transformation, and an orthonormal one in the case of the ilr transformation. The clr transformation is also one-to-one, and is thus a representation, but the clr coefficients are not coordinates with respect to a basis. The clr coefficients of a composition x in S ~ stated in equation (7), can be re-written as clr(x) = [~1, ~2. . . . . ~o] =
I l n gXl -~,
X2 , In XD ] In g-~) . . . . g(x)_]'
(38)
(39)
an expression already given in equation (7). The main properties of the clr coefficients are: 1.
2.
the clr coefficients translate perturbation and powering of compositions into ordinary sum and multiplication by a scalar of vectors of clr coefficients as in equations (8) and (9). Ordinary Euclidean distance between vectors of clr coefficients are equal to the Aitchison distance of their corresponding compositions. This also holds for the inner product and the norm, i.e.
{X, Y)a =
(clr(x), clr(y)),
da(x, y) = d(clr(x), clr(y)),
Ilxll. = Ilclr(x)ll, (40)
where ( . , 9), JJ 9]J and d ( . , 9) are the ordinary Euclidean inner product, norm or modulus, and distance. These two properties make the clr coefficients extremely useful, because they allow one to operate on compositions in a straightforward way without reference to an orthonormal basis. However, there are some inconveniences concerning the representation in clr coefficients. The first is that of representation in real space. It requires an additional dimension to be effective. A composition in S 3 is well represented in a two-dimensional plot, e.g. in a ternary diagram or in ilr coordinates, but clr coefficients require a three-dimensional plot. Another difficult point is the calculation of areas or volumes in the real space of clr coefficients. For example, a square in the space of clr coefficients is not a square in the sense of the Aitchison metrics and, consequently, the area cannot be obtained by squaring the length of the side. This apparently minor point makes the representation of probability densities difficult in the real space of the clr coefficients. This negative property comes from the fact that clr coefficients represent a composition as a combination of perturbationdependent vectors, one more than required and, hence, they do not constitute a basis of the simplex (Egozcue & Pawlowsky-Glahn 2005, appendix).
J.J. EGOZCUE ETAL.
158
Finally, a single clr coefficient can be re-written as a log-ratio,
..(D-I)/D (41)
HjD~i,j=I X)/D'
~alr
ei
similar to the balance between the single-part group {i} and the complementary one. The balance between these groups is x In
~
D~-D-1 ~/1/(D(D-1)) -- V ~ ~i'
(42)
HjDci,j=I Xj which differs significantly from the clr coefficient only for small values of D. Although this interpretation might favour the use of clr coefficients because of their symmetry in the parts, it can be used as well to support the use of balances, as they are easier to deal with from a mathematical point of view and have a similar interpretation. The additive log-ratio transformation (alr) assigns the vector of coordinates a = [al, a2 . . . . . aD-l] = [lnXl , l n -x2 -,...,
k XD
XD
The ai, i----1,2 . . . . . D - 1 are coordinates (Egozcue & Pawlowsky-Glahn 2005, appendix). The basis is, for i = 1, 2 . . . . . D - 1,
lnXD-1] ,
XD d
(43)
to the composition x -- Ix1, x 2 . . . . . XD]. In equation (43), there is a detached part, the one appearing in the denominator of the ratios, in our case xo. This choice is arbitrary and any part can be selected. The alr coordinates can be used to reconstruct the parts of a composition, x = C[exp(al), exp(a2) . . . . . exp(ao-1), 1].
(44)
Perturbation and powering are also computed easily from clr coefficients. These properties can be expressed as alr(x ~ y) -- air(x) + air(y), alr(c | x) = c. alr(x). Similarly to clr coefficients and ilr coordinates, air coordinates transform the simplex operations into the ordinary real sum and product by scalars. However, metric properties are not expressed easily using air coefficients. The air coordinates are also log-ratios; their simplicity is the reason of their popularity. From definition (43), ai = ln(xi/xD); except for a constant, 1/V'-2, this is the balance between two single-part groups, {i} and {D}; consequently, the D-part appears as a reference to balance each part in the composition.
:
D /~~Dl l-y1 [ (D2-1 -1 .... (3C exp Z1 ...... D-1
1 i-th element
1
'D-1
ll) 1
..... DE
'
(46)
where the compositions are unitary, perturbationindependent, but not orthogonal. The angles between each pair of compositions in the basis can be shown to be 60 ~. The expression of the composition is then D-1 X =
( ~ (ai e
e~r).
(47)
i=1 Although air coordinates have been used extensively in applications and were essential to define the additive-logistic normal distribution (Aitchison 1982), they are being replaced by both clr coefficients and ilr coordinates. Metric properties, particularly distances, are not easy to handle with air coordinates, as already stated in Aitchison (1986), in contrast with both clr coefficients and ilr coordinates. The authors thank the reviewers, R. Reyment and N. Coll, for their thorough reading and suggestions, which improved the paper greatly. The work has been supported financially by the Spanish government through the project BEM2003-05640/MATE.
References AITCmSON, J. 1982. The statistical analysis of compositional data (with discussion). Journal of the Royal Statistical Society, Series B (Statistical Methodology), 44 (2), 139-177. AITCHISON, J. 1983. Principal component analysis of compositional data. Biometrika, 70 (1), 57-65. AITCHISON, J. 1986. The Statistical Analysis of Compositional Data. Monographs on Statistics and Applied Probability. Chapman & Hall Ltd., London. Reprinted (2003) with additional material by The Blackburn Press, Caldwell, NJ. AITCmSON, J. & EGOZCUE, J. J. 2005. Compositional data analysis: where are we and where should we be heading? Mathematical Geology, 37 (7), 829-850. AITCHISON, J., BARCELO-VIDAL, C., EGOZCUE, J. J. &
PAWLOWSKY-GLAHN,V. 2002. A concise guide for the algebraic-geometric structure of the simplex, the sample space for compositional data analysis. In" BAYER, U., BURGER, H. 8~ SKALA, W. (eds) Proceedings of IAMG'02 - *The eigth annual
SIMPLICIAL GEOMETRY conference of the International Association for Mathematical Geology, Vols. I and II, pp. 387392. Selbstverlag der Alfred-Wegener-Stiftung, Berlin. BARCELO-VIDAL, C., MART{N-FERN.~NDEZ, J. A. & PAWLOWSKY-GLAHN, V. 2001. Mathematical foundations of compositional data analysis. In: Ross, G. (ed.) Proceedings of IAMG'O1 - The sixth annual conference of the International Association for Mathematical Geology, 20. CD-ROM. BILLHEIMER, D., GUTTORP, P. & FAGAN, W. 2001. Statistical interpretation of species composition. Journal of the American Statistical Association, 96 (456), 1205-1214. BUCCIANT1,A. & PAWLOWSKY-GLAHN,V. 2005. New perspectives on water chemistry and compositional
159
data analysis. Mathematical Geology, 37 (7), 703-727. EGOZCUE, J. J. & PAWLOWSKY-GLAHN, V. 2005. Groups of parts and their balances in compositional data analysis. Mathematical Geology, 37 (7), 795-828. EGOZCUE, J. J., PAWLOWSKY-GLAHN, V., MATEUFIGUERAS, G. & BARCEL0-VIDAL, C. 2003. Isometric logratio transformations for compositional data analysis. Mathematical Geology, 35 (3), 279-300. PAWLOWSKY-GLAHN,V. & EGOZCUE,J. J. 2001. Geometric approach to statistical analysis on the simplex. Stochastic Environmental Research and Risk Assessment (SERRA), 15 (5), 384-398. PAWLOWSKY-GLAHN,V. & EGOZCUE,J. J. 2002. BLU estimators and compositional data. Mathematical Geology, 34 (3), 259-274.
Exploratory compositional data analysis J. D A U N I S - I - E S T A D E L L A 1, C. B A R C E L O - V I D A L 1 & A. B U C C I A N T I 2
IDepartament Informhtica i MatemStica Aplicada, Universitat de Girona, Campus Montilivi, P4, E-17071 Girona, Spain (e-mail: [email protected]) 2Dipartimento di Scienze della Terra, Universitgt degli Studi di Firenze, Via G. La Pira, 4, 50125, Firenze, Italy (e-mail: [email protected]) Abstract: This paper presents the first steps that should be performed whenever the study of a compositional dataset is initiated. Centre, variation matrix and total variance of a compositional dataset are introduced. In addition the biplots are also introduced as a powerful tool to analyse and discover special features related to subcompositions. The exploratory methodology is applied to a dataset consisting of the major, minor and trace elements composition of soil samples from several places in Tuscany (Italy). The structure of the data, collected from three different known country rocks of ophiolitic nature (basic and ultrabasic rocks), represent an interesting case study for experimenting on new methodologies of statistical investigation and for pointing out differences related to parental chemistry and mineralogy as well as the nature of processes to be related to the subsequent evolution.
This paper presents the first steps that should be performed whenever the study of a compositional dataset is initiated. Essentially, there are three steps:
9 computing descriptive statistics, i.e. the centre and variation matrix of the dataset, as well as its total variability; 9 examining the biplot of the dataset for patterns; 9 performing an exhaustive subcompositional data analysis to evaluate the compositional variability and the patterns of variation of interesting subcompositions. In particular, the three-part subcompositions can be visualized in ternary diagrams after centring the dataset for better visualization. Some general aspects should be considered first. In standard statistical analysis the first step is to check the dataset for errors and it is assumed that this has been done already using standard procedures (e.g. the minimum and maximum of each component to check whether the values are within an acceptable range). Furthermore, it is assumed that there are no zero observations in the samples. Zeros require specific techniques (Fry et al. 1996; Martin-Fernfindez et al. 2000, 2003; Mart/nFern~indez 2001; Aitchison & Kay 2003; BaconShone 2003). For further information about the mathematical foundations of compositional data analysis and its terminology, see Barceld-Vidal et al. (2001) and Egozcue & Pawlowsky-Glahn (2006).
Centre, variation matrix and total variance of a dataset Standard descriptive statistics are not very informative in the case of compositional data, in particular the arithmetical mean and the variance or standard deviation of individual components do not fit in with Aitchison geometry as measures of central tendency and dispersion. They were defined as such in the framework of Euclidean geometry in real space, which is not a sensible geometry for compositional data. Therefore, it is necessary to introduce alternatives, which can be found in the concepts of centre (Aitchison 1997), variation matrix (Aitchison 1986, p. 76) and total variance. Let X = {xj : [xlj, x2j . . . . . XD/] E S ~ = 1, 2 . . . . . n } a compositional dataset of size n. The n rows Xl, x2 . . . . . xn of the matrix X correspond to samples, and the D columns X1, X2 . . . . . XD correspond to parts of compositional data.
Centre A measure of central tendency for the compositional dataset X is the closed geometric mean which is called the centre and is defined as
g = C[gl, g2 . . . . . gD],
with (1)
gi:-
Xij
, i = 1 , 2 . . . . . D.
where C is the closure operator to constant K.
From: BUCCIANTI,A., MATEU-FIGUERAS,G. & PAWLOWSKY-GLAHN,V. (eds) Compositional Data Analysis in the Geosciences: From Theory to Practice. Geological Society, London, Special Publications, 264, 161-174. 0305-8719/06/$15.00
9 The Geological Society of London 2006.
162
J. DAUNIS-I-ESTADELLAET AL.
Note that in the definition of centre of a compositional dataset the geometric mean is considered column-wise (i.e. by variables), whereas in the clr transformation, In XD ] = Z, (2) clr(x) = [ln Xl In x----L-2 L g(x)' g(x) . . . . . g(x)J 1/D is conthe geometric mean g ( x ) = (I-Ii~ sidered row-wise (i.e. by samples) 9 It is easy to prove that the centre g can be calculated from the arithmetic mean of the clr-transformed dataset Z = clr(X). More precisely, g =
around the barycentre, giving an automatic and optimal way of visualizing the data. This strategy was first introduced by Mart/nFemfindez et al. (1999) and used by Buccianti et al. (1999) in order to improve the visualization of compositional datasets in temary diagrams. An extensive discussion can be found in von Eynatten et al. (2002), where it is shown that a perturbation transforms straight lines into straight lines. This allows the inclusion of gridlines and compositional fields in the graphical representation without the risk of a non-linear distortion. See Figure 1 for an example of a dataset before and after perturbation with the inverse of the closed geometric mean and the effect on the gridlines.
clr-l[Zl, Z2 . . . . . ZD]
= C[exp(Z1), exp(Z2) . . . . . exp(ZD)],
(3) Variation matrix
with
A way to describe dispersion in a compositional dataset X is by the variation matrix defined as 1
Zi
n
..
= n j~aln g(xj) x# . i.= 1. , 2. , .
D.
(4)
that, although all samples xj =[Xlj, xzj . . . . . XDj], j = 1, 2 . . . . . n were closed to a scale constant K (i.e. Y]~i=l q K , j - - 1 , 2,.. n), the geometric mean vector [gl,g2 . . . . . gD] is not closed. For these reasons, the centre g =C[g~, g2 . . . . . gD] of the compositional set X expresses information only about the ratios of parts of the centre.
T = [zo]
Notice
D
X
~
.,
_ var[ln ] var[ln XD]
L x~j
var[ln~]
..- v a r [ l n x ~ ]
var[ln~]
...
var[lnx~ ]
variant] (5)
Centring a dataset
A standard way in geology to visualize data in a ternary diagram is to rescale the observations in such a way that their range is approximately the same. This is nothing else but applying a perturbation to the dataset, a perturbation which is usually chosen by trial and error. To overcome this somewhat arbitrary approach, note that for a composition x and its inverse x - l , it holds that x ~ x -1 = e = [1/D, 1 / D . . . . . I/D]. This means that one can move by perturbation any composition to the barycentre of the simplex, in the same way that one moves real data in real space to the origin by translation. This property, together with the definition of centre, allows a strategy to be designed to move the set of samples in such a way that its structure is better visualized and all the pairwise ratios between parts are preserved. To do that requires computation of the centre g of the dataset and perturbation of the samples by the inverse g-1. This has the effect of moving the centre of a dataset to the barycentre of the simplex, and the set of samples will gravitate
where z/j = var[ln(Xi/Xj)] stands for the usual variance of the log-ratio of parts i and j. Note that, by definition, T is symmetrical and its diagonal contains only zeros. Furthermore, note that any single entry in T does not depend on the scale constant K associated with the sample space 8 D, as constants cancel out when taking ratios. Consequently, rescaling has no effect. Another way to describe the relative variability in a compositional dataset X is by means of the centred log-ratio covariance matrix F, i.e. the covariance matrix of Z = clr(X): r = [~;j] var[Zl] cov[Z2, Z1] .
cov[Zl, Z2] ... var[Z2] .
CoV[ZD,Z1] CoV[ZD,Z2]
cov[Zl,
ZD]]
cov[Z2, ZD] .
,
var[ZD] ] (6)
EXPLORATORY COMPOSITIONAL DATA ANALYSIS c(A)
A
B
i
10
33
163
67
"90- 99 C
c(B) 110 33
(a)
67
90
(b)
99
c(C)
Fig. 1. Simulated dataset: (a) before centring; (b) after centring.
with Z 1 , 2 2 . . . . . Z D the column-variables of Z. The covariance matrix F is singular because ~-,?=1Yij= 0 , for i = 1,2 . . . . . D.
Total variance A measure of total relative variability of X is given by
1s
totvar[X] = ~ - ~ i=l j=~
L
xjl
(7)
where 'totvar' signifies the total variance of the dataset, i.e. the sum of all the elements of the variation matrix T divided by 2D. The total variance of X coincides with the trace of the centred log-ratio covariance matrix F, i.e. D totvar[X] = trace(F) = Z var[Zi]. i=1
Call Z = clr(X) the clr-transformed centred dataset. Suppose that the transformed data-matrix Z has rank s (usually s - - - D - 1). Thus, standard results can be applied to Z and, in particular, the fact that the best two-rank approximation Y to Z in the least squares sense is provided by the singular value decomposition of Z (Krzanowski 1988, pp. 126-128) introduced by J. J. Sylvester in 1889 and generalized by Eckart & Young (1936). The singular value decomposition of a matrix of coefficients is obtained from the matrix of eigenvectors L of ZZ', the matrix of eigenvectors M of Z'Z and the square roots of the s positive eigenvalues A1, A2. . . . . As of either ZZ' or Z'Z, which are the same. As a result, taking ki = V/~, one can write
Z=L
.
(8)
T h e biplot: a g r a p h i c a l d i s p l a y o f compositional data Gabriel (1971) introduced the biplot to represent simultaneously the rows and columns of any matrix by means of a two-rank approximation. Aitchison (1997) adapted it for compositional data and proved it to be a useful exploratory and expository tool. This section describes first, briefly, the mathematical foundations of this technique and then its interpretation. A very important reference is that of Gower & Hand (1996).
M',
o
(9)
2s
where the singular values kb k2. . . . . ks are in descending order of magnitude. Both matrices L and M are orthonormal. The two-rank approximation is then obtained by substituting all singular values with index larger then 2 by a zero, hence y=(l,1
l~)(k, 0
= /•12
g22
kl
0
"
0
k2
9
Construction of a biplot
\el,
Consider the data-matrix X with n rows and D columns and assume it has been centred already.
•
O)(ml) k2 m2
'
e2.
mll \ m21
m12 m22
999 999
tolD1 . m2D
(10)
J. DAUNIS-I-ESTADELLA ET AL.
164
Links and rays provide information on the relative variability in a compositional dataset, since
The proportion of variability retained by this approximation is computed as (hi + h2)/( ZiL1 /~i)To obtain a biplot, it is first necessary to write Y as the product of two matrices GH', where G is an (n • 2) matrix and H is a (D • 2) matrix. There are different possibilities for obtaining such a factorization, one of which is
Ihihk[ 2 ,~ var l n ~ -k
Nevertheless, care is required for interpreting rays, which cannot be identified, neither with var[Xi] nor with var[ln Xi], as they depend on the full compositions through g(xl) . . . . . g(xn) and vary when a subcomposition is considered. Cosines of the angles between links estimate correlations between log-ratios. Thus, if links is and ks have a common vertex s (Fig. 2a) then
~/n leln ~/rn~g2n { klmll
...
klml2
(gl] / k2m21
\,/-Y-~
=
g2 i
(hi,
k2m22
~-
...
1
(12)
IO~il 2 ~ var[Zi].
Y =
• lv/-n-----I - ~/-n--1
and
klmlD Vrn~-ll k2m2D I
cos(/&) ~ corr[ln(Xi/Xe), ln(Xk/Xe)],
,#Y~/
-
-
(13)
M
and if links ik and jg intersect in M (Fig. 2b) then h2
hD).
(1 l) cos(iMj) ~ corr[ln(Xi/Xk), ln(Xj/Xe)]. (14)
gn / The biplot simply represents the n + D vectors g j , j = l , 2 . . . . . n, and hi, i = l , 2 . . . . . D, in a plane. The vectors gl, g2 . . . . . gn are termed the row m a r k e r s of Y and correspond to the projections of the n samples on the plane defined by the first two eigenvectors of ZZ'. The vectors hi, h2 . . . . . hD are the column markers, which correspond to the projections of the D clr-parts on the plane defined by the first two eigenvectors of Z'Z. Both planes can be superimposed for a visualization of the relationship between samples and parts.
Interpretation of a compositional biplot
3.
4.
The biplot converts the two-rank approximation Y to Z given by the singular value decomposition into a graphical display. A biplot of compositional data consists of (Fig. 2a): 1. 2. 3.
an origin O which represents the centre of the compositional dataset; a vertex at position h i for each of the D parts; and a case marker at position gj for each of the n samples or cases.
The join of O to a vertex hi is termed the ray Ohi and the join of two vertices hi and hi,, the link hihk. These features constitute the basic characteristics of a biplot with the following five main properties for the interpretation of compositional variability (Fig. 2a).
5.
Furthermore, if the two links are at right angles, then cos(iMj) ~ 0, and zero correlation of the two log-ratios can be expected. This is useful for the investigation of subcompositions for possible independence. Subcompositional analysis. The centre O is the centroid (centre of gravity) of the D vertices 1, 2 . . . . . D. Since ratios are preserved under formation of subcompositions, it follows that the biplot for any subcomposition S is formed simply by selecting the vertices corresponding to the parts of the subcomposition and taking the centre of the subcompositional biplot as the centroid Os of these vertices (Fig. 2c). Coincident vertices. If vertices i and k coincide, or nearly so, this means that var[ln(Xi/Xk)] is zero, or nearly so, so that the ratio Xi/Xk is constant, or nearly so, and the two parts, X; and X~, can be assumed to be redundant (Fig. 2d). Collinear vertices. If a subset of vertices is collinear, then the associated subcomposition has a one-dimensional biplot and the subcomposition has unidimensional variability, i.e. compositions plot along a compositional line (Fig. 2d).
It must be clear from the above interpretation that the fundamental elements of a compositional biplot are the links, not the rays as in the case of variation diagrams for unconstrained multivariate data. The complete set of links determines the compositional covariance structure by specifying all the relative variances, and provides direct information about subcompositional variability and
EXPLORATORY COMPOSITIONAL DATA ANALYSIS
"+ l
9
case marker
." link ""
165
1
9
"',,link
i '"'''"l'-u'-n~= - ' ':-'-'-" ~ _ _
k
0
ik"- J
0 origin
(b)
(a)
J
k
(c)
0
(d)
Fig. 2. Biplot: interpretation.
independence. It is also obvious that interpretation of the biplot concerns its internal geometry and would, for example, be unaffected by any rotation or mirror-imaging of the diagram. For applications of biplots to compositional data in a geological context see Aitchison (1990), and for a deeper insight into biplots of compositional data, with applications in other disciplines and extensions to conditional biplots, see Aitchison & Greenacre (2002).
previously, the subcompositions S with strong one-dimensional variability can be detected readily in the biplot because the corresponding vertices must appear approximately collinear. In any case, it is always necessary to confirm a posteriori the linear trend of a subcomposition by means of a simplicial principal component analysis of its parts.
Simplicial principal component Subcompositional
analysis
Often one searches out the C-part subcompositions, with 2 _< C < D, which retain as much relative variability as possible from the total relative variability of the full compositional dataset X. The relative variability retained by a C-part subcomposition S can be calculated from the variation matrix T. It suffices to divide by 2C the summation of all (i,j)elements of T with both indices i and j associated with parts included in the subcomposition S. In particular the relative variability retained by a three-part subcomposition C[Xi, Xj, X~] can be approximated graphically by the sum IO-~ile + IOshjl 2 + IO---s~]2 where Os is the centroid of the vertices hi, hj and hk in the biplot of dataset X. Therefore, the biplot serves as a first approach to the ranking of all three-part subcompositions according to the total relative variability retained by them. Furthermore, as has been mentioned
analysis
The singular value decomposition of a data matrix (Eckart & Young 1936) is also the basis of the principal component analysis (PC analysis for short) of a compositional dataset. The underlying idea is simple. As in the construction of a biplot, it shall be assumed that the n x D matrix X of the compositional dataset has been centred already. Call Y the matrix of isometric log-ratio coefficients ilr of X (Egozcue et al. 2003), i.e. Y = fir(X). It is assumed that the rank sofYisD1. From standard theory it is known how to obtain the matrix M of eigenvectors ml, m2 . . . . . mD-1 of Y'Y. This matrix is orthonormal and the relative variability of the data explained by the i-th eigenvector is Ai, the i-th eigenvalue. Assume the eigenvalues have been labelled in descending order Call of magnitude, A1 > /~2 > " ' " > )[D-1. al, a2 . . . . . aD-1 the backtransformed eigenvectors, i.e. ilr-l(m/) = ai, i = 1, 2 . . . . . D - 1.
166
J. DAUNIS-I-ESTADELLA ET AL.
Gum. prop. expl.:
Pr.Components: --
/
0:158 00,,6330,14
/~,
/
\
Fig. 3. Compositional principal axes for aphyric Skye lava compositions. The PC~ and PC2 axes retain 99% and 1%, respectively, of the total relative variability. PC analysis consists then in retaining the first c compositional vectors {al, a2 . . . . . ac } for a desired proportion of total relative variability to be explained. This proportion is computed as
i=1
\j=l
It is not difficult to demonstrate that the eigenvalues A1 > A2 > ... > AD-1 coincide with the D - 1 positive eigenvalues of the centred log-ratio covariance matrix F, and that al, a2 . . . . . aD-1 are also the clr-backtransformed unitary eigenvectors of F associated with those eigenvalues. The parametric equation of the compositional straight-line principal axis associated with the backtransformed eigenvector ai (PCi-axis for short) is given by the expression x(t) = g 9 (t (i) ai) = C[gl exp(t In ail), g2 exp(t In ai2) . . . . . goexp(tlnaiD)],
t ~ R,
(16)
where g is the centre of the compositional dataset X. See Figure 3 for an example of principal axes associated with a three-part dataset (Aitchison 1986, p. 195).
Application example: differences in the chemical composition of soils collected in a temperate climate on ophiolitic country rocks Soils are an important component of terrestrial ecosystems, being essential for the growth of plants and the degradation and recycling of dead biomass. From a general point of view a soil is a complex heterogeneous medium given by mineral and organic solids, aqueous and gaseous components. The
mineralogy is dominated by weathering (chemically decomposing) rock fragments and secondary minerals such as phyllo-silicates or clay minerals, oxides (hydrous oxides and oxyhydroxides) of Fe, A1 and Mn and sometimes carbonate (CaCO3), as a function of the chemical composition of the parent material. A soil can be considered also as a dynamic system where changes are related to variations in moisture status, pH and redox conditions, and/or changes in the management of the environment, all affecting the form and bioavailability of metals. In this framework, soils developed on ophiolitic bedrocks (basic and ultrabasic rocks) are characterized by low fertility and an high phytotoxic hazard due to their high heavy metal contents such as Ni, Co and Cr, inherited from the parental materials. The influence of the bedrock nature in the geochemical characterization of ophiolitic soils developed in a temperate climate has been scarcely investigated. A tentative of statistical evaluation of the differences is reported in Vaselli et al. (1997). The identification of numerical and graphical tools able to reveal compositional differences in a priori defined groups represents an important contribution in this field of research. Today the investigation of soil chemistry is particularly important since vast regions are degraded by erosion, salinization, pollution. Consequently, the use of statistical methodologies designed for capturing the real features of compositional data is important for managing the related problems of land use. D e s c r i p t i o n o f the d a t a s e t
In this application example, major, minor and trace element data, related to soils from several places in Tuscany (Central Italy), were collected in situations in which the nature of the country rocks was known. Thus, a database consisting of three groups of soils, respectively from serpentinite (group 1, 52 samples), gabbroic (group 2, 37 samples) and basaltic (group 3, 39 samples) country rocks was available for applying explorative methods able to describe the data structure, as well as to model the differences among the three different starting points from which the evolution of the soils began. As reported in Vaselli et al. (1997), all the soil samples (A to C horizons) were air-dried, crushed to pass through a 2 mm sieve and quartered; representative subsamples were ground to <0.1 mm in an agatha ball mill. Some major elements (Si, Ti, A1, Fe, Mn, Ca, K) were determined by XRF using a Philips PW 1480, while Mg, Ni, Cr, Co, Li, Cu, Pb and Zn were analysed by AAS using a Perkin-Elmer 303 after acid-digestion (HC104, HF, HC1; Vaselli et al. 1997 and references therein).
EXPLORATORY COMPOSITIONAL DATA ANALYSIS
Analysis based on major elements In a first phase a comparison among the centres of the three groups permits identification of those major and minor oxides that can have an important discriminatory character. The results are reported in Table 1. As can be seen, group 1 is well discriminated from the other two groups by the high MgO and low A1203 contents. On the other hand, groups 2 and 3 are discriminated by the Fe203 abundance, while K20 clearly divides group 3 from the other two groups. These differences are compatible with the nature of the parent material from which the soils have evolved. Group 1 is related to soils developed on serpentinite, an ultrabasic rock representing an altered peridotite. The content of Fe203 discriminates group 2 (gabbroic parent material, essentially composed of calcic plagioclase, pyroxene and iron oxides) from group 3 (basaltic parent material, consisting essentially of calcic plagioclase and pyroxene, with olivine and minor foids or minor interstitial quartz). The analysis of the centres can be followed by the analysis of the relative variability of each variable in the three groups. Results reported in Table 2 reveal that K20 is the species characterized by higher relative variability due to the incompatible behaviour of K in the bedrock mineralogy (its concentration increases with the magmatic evolution of the parent material) and to its involvement in the chemical reactions that characterize the solutions of the soil-plant system. It is followed by CaO, showing a high relative variance in group 1, a behaviour which can be attributed to the sensibility of the cation to weathering processes that have affected, particularly, the serpentinite parent
167
material. A1203 has a low variance in all groups and represents a component inherited from the bedrock, subsequently involved in clay mineral formations. A1 is mobilized only if particular conditions of oxidation-reduction and pH are present, a situation also characterizing Mn. The analysis can be followed by the inspection of the variation matrices (Table 3) so that the ratios between variables affected by high or low variability can be identified. The patterns could be related to the similar/opposite geochemical behaviour of the variables involved in the natural system investigated. The results indicate that the ratios involving K20 are characterized by high variability. The ratios KzO/MgO (in the serpentinitic and gabbroic groups) and K20/CaO (in all groups) are associated with high variability and involve cations mobilized to a different degree by aqueous solutions in the soils and which show incompatible (K) and compatible (Mg and Ca) behaviour in the mineralogy of the bedrock. In addition, it should be noted here that all the ratios involving variables that form hydrous oxides and oxy-hydroxides (MnO/AI203 and MnO/Fe203 in serpentinitic and basaltic groups, Fe203/SiO 2 in all groups, and Fe203/ A1203 in gabbroic group), mobilized only in particular E h - p H conditions, show low variability. The behaviour of silica (influenced by pH conditions too) appears to be associated with that of Mn, A1 and Fe. After determining the centres and relative variability of each group, a biplot analysis was performed with the aim of determining the relationships among the three groups. The results reported in Figure 4 show that soils developed on a serpentinious bedrock are characterized by a
Table 1. Centres of the composition of major oxides (in %) of each bedrock group Group
SiO2
TiO2
A1203
Fe203
MnO
MgO
CaO
K20
Serpentinitic Gabbroic Basaltic
45.26 49.58 50.59
0.13 0.49 1.63
5.42 16.53 15.54
12.20 7.21 12.23
0.25 0.15 0.21
22.02 9.15 6.06
1.22 6.54 2.55
0.10 0.18 0.75
Global
50.50
0.43
10.84
11.02
0.22
12.11
2.61
0.23
Note: Centreshad been calculatedafter addinga residualpart (100 minusthe sum of the eight majoroxidescomposition)to all samples.
Table 2. Centred log-ratio variances and total variance for major oxide composition of each bedrock group Group
SiO2
TiO2
A1203
Fe203
MnO
MgO
CaO
K20
Total variance
Serpentinitic Gabbroic Basaltic
0.1922 0.0188 0.0977
0.4800 0.3300 0.2295
0.0997 0.0259 0.0531
0.1516 0.0339 0.0598
0.1376 0.1072 0.0466
0.7368 0.2617 0.0969
0.9661 0.2204 0.4716
1.6555 0.7472 0.6263
4.4195 1.7451 1.6814
Global
0.1579
0.9884
0.1629
0.2357
0.2730
1.0766 0.9275
1.4328
5.2547
J. DAUNIS-I-ESTADELLAETAL.
168
Table 3. Variation matrix for major oxide composition of each bedrock group
SiO2
SiO2
TiO2
A1203
Fe203
1.1426 0.3712 0.4571 1.7220
0.3718 0.0209 0.1541 0.4363
0.1498 0.0633 0.0482 0.1621
0.2628 0.1810 0.1786 0.2745
0.3031 0.2272 0.1598 0.6068
0.4582 0.4338 0.1541 0.7682
0.9921 0.2283 0.3637 1.7981
0.8715 0.3489 0.2809 1.9041
0.2716 0.0571 0.1329 0.6520
MnO
MgO
K20
Group
1.1782 0.1872 0.7588 1.1830
2.5486 0.8450 0.7065 2.1330
Serpentinitic Gabbroic Basaltic Global
2.163 0.9771 0.4987 3.8995
1.4279 0.7251 0.5715 1.7726
1.2042 1.3009 1.1914 1.2973
Serpentinitic Gabbroic Basaltic Global
0.2099 0.1516 0.0934 0.6766
0.9965 0.1999 0.2505 1.6971
1.1748 0.2116 0.5288 0.8091
1.7344 0.8775 0.7926 1.5184
Serpentinitic Gabbroic Basaltic Global
0.1328 0.0944 0.1301 0.1266
0.4429 0.3359 0.1513 0.6200
1.252 0.2938 0.7590 1.6486
2.3910 0.9434 0.5744 2.1328
Serpentinific Gabbroic Bas~tic Global
0.701 0.5290 0.173 0.7011
1.3268 0.4589 0.5025 1.6380
2.0153 0.8390 0.6958 2.1180
Serpentinitic Gabbroic Basaltic Global
1.8633 0.1424 0.4129 2.2243
3.8442 1.4271 0.8103 4.1188
Serpentinitic Gabbroic Basaltic Global
3.9255 1.4897 1.9205 3.3990
Serpenfinific Gabbroic BasNtic Global
TiO2
1.1426 0.3712 0.4571 1.7220
A1203
0.3718 0.0209 0.1541 0.4363
0.4582 0.4338 0.1541 0.7682
Fe203
0.1498 0.0633 0.0482 0.1621
0.9921 0.2283 0.3637 1.7981
0.2716 0.0571 0.1329 0.6520
MnO
0.2628 0.1810 0.1786 0.2745
0.8715 0.3489 0.2809 1.9041
0.2099 0.1516 0.0934 0.6766
0.1328 0.0944 0.1301 0.1266
MgO
0.3031 0.2272 0.1598 0.6068
2.163 0.9771 0.4987 3.8995
0.9965 0.1999 0.2505 1.6971
0.4429 0.3359 0.1513 0.6200
0.7010 0.5290 0.1730 0.7011
CaO
1.1782 0.1872 0.7588 1.1830
1.4279 0.7251 0.5715 1.7726
1.1748 0.2116 0.5288 0.8091
1.2520 0.2938 0.7590 1.6486
1.3268 0.4589 0.5025 1.6380
1.8633 0.1424 0.4129 2.2243
K20
2.5486 0.8450 0.7065 2.1330
1.2042 1.3009 1.1914 1.2973
1.7344 0.8775 0.7926 1.5184
2.391 0.9434 0.5744 2.1328
2.0153 0.8390 0.6958 2.1180
3.8442 1.4271 0.8103 4.1188
high dispersion when compared with the others and that the three groups of soils maintain good separation. The approximated collinearity of the variable vectors MnO, Fe203 and A1203 in Figure 4 reveals the possible presence of a pattern in the ternary diagram of the subcomposition [MnO, Fe203, A1203] (Fig. 5). The diagram discloses trends among the different groups and involves only those variables affected by the formation of hydrous oxides and oxy-hydroxides and/or clay minerals. Starting at the A1203 vertex, the decrease of this component is associated with an increase in the variability of the FezO3/MnO ratio, marking the passage from soils of gabbroic rocks towards the basaltic and serpentinious ones. The pattern from the A1203 vertex may be related to differences inherited by the bedrocks mixed with those deriving from the different reduction/oxidation conditions
CaO
3.9255 1.4897 1.9205 3.3990
Serpentinitic Gabbroic Basaltic Global
that favoured the precipitation of hydrous oxides and oxy-hydroxides and/or formation of clay minerals in the evolution of the soil. The pattern is also related approximately to the increase in clay content of the soils. If the information from Table 1 and Figure 4 is combined, it is possible to select a 'good' threepart subcomposition for obtaining a clear discrimination among the three groups. The variables whose centres are very different and whose rays form angles higher than 90 ~ in the biplot were chosen. Following this reasoning in the plot of subcomposition [MgO, K20, A1203] reported in Figure 6, soils developed on serpentinite are clearly discriminated by their MgO content while soils developed on gabbros and basalts can be discriminated by considering the pattern that links the vertex of K20 with those of A120 3.
EXPLORATORY COMPOSITIONAL DATA ANALYSIS Biplot components: 1 and 2
Ca0o /x /
Proportion explained: 0.857
9
9 9
/
ZX 9
o
_~1~,
o~
/:ooOO-o., OAO A A
serpentinitic
zx9 gabbroiCbasaltic
O
O
zx~irl
so,
K20
169
o o
o
O O O 0
o O
O
o OO
0~
0
S Fig. 4. Biplot of the composition of the major oxides composition.
Analysis based on trace elements A similar explorative investigation was performed by considering the behaviour of selected trace elements such as Co, Cr, Cu, Li, Ni, Pb, Sr and Zn. As can be seen in Table 4, where the centres are reported, Ni, Cr and Co abundance clearly discriminate soils of the serpentinious group, a feature related to the compatible behaviour of these elements in the mineralogy of serpentinite rocks (as well as of those of peridotites from which they derive). The high content of Sr, characterizing gabbroic and basaltic groups, is related to the affinity of this element with Ca (substitutes readily Ca within plagioclase and behaves as an Fe203_ r A
MnO-c
o 9
incompatible element from gabbros to basalts). The inspection of the data in Table 5 on variances allows one to derive that Sr presents also the highest value in the basaltic group, similarly to Pb in the gabbroic and basaltic groups, and Li in the serpentinitic and gabbroic groups. The value of the variance of these elements is related to their substitution geochemistry at the expense of major components with the general tendency of increasing concentration in later stages of magmatic differentiation and to mobilization by weathering processes. If the variation matrix of Table 6 is analysed, it is easy to conclude that the ratios Li/Ni and Li/Cr are associated with the highest relative variability while AI203_ c A
serpentinitic gabbroic altic
AI203-c
Fig. 5. Ternary diagram of the subcomposition [MnO, Fe203, A1203] after centring data.
MgO-c
o 9
serpentinitic gabbroic altic
K20-c
Fig. 6. Ternary diagram of the subcomposition [MgO, K20, A1203] after centring data.
170
J. DAUNIS-I-ESTADELLAET AL.
Table 4. Centres of the composition of the trace elements (in ppm) of each bedrock group Group Serpentinitic Gabbroic Basaltic
Ni
Cr
Co
Cu
Pb
Zn
Li
Sr
1457 211 193
1601 348 176
104 51 70
45 62 66
24 19 22
85 49 95
14 9 27
25 137 118
453
529
75
55
22
75
15
63
Global
Note: Centreshad beencalculatedafter addinga residualpart (10 6 minusthe sum of the eighttrace elementscomposition)to all samples.
Table 5. Centred log-ratio variances and total variance of the composition of the trace elements of each bedrock group
Group
Ni
Cr
Serpentinitic Gabbroic Basaltic
0.3041 0.4163 0.1882 0.5246 0.2078 0.3796
Global
0.9203
Co 0.1789 0.1517 0.0490
1.1326 0.1466
Cu
Sr
Total variance
0.4795 0.2845 0.1647 0.6136 0.3964 0.5431 0.3141 0.5688 0.2181 0.5206 0.0383 0.1378
0.4793 0.3989 0.7724
2.9209 3.0857 2.3237
0.4851 0.4314
1.5594
5.5074
those for Co/Ni and Co/Cr accord with the lower one. It is evident that the behaviour of the variables is inherited from the parent material and depends on the incompatible behaviour of Ni, Cr and Co in the mineralogy of the bedrocks and by the more complex reaction of Li to weathering processes. As in the previous case, a biplot analysis was performed (Fig. 7). In the case of trace elements the discrimination among the three groups is not as good as those obtained for major oxides and some data for the basaltic bedrock group overlap with those of the serpentinious bedrock group, whereas data of the gabbroic bedrock group encompass data of the basaltic bedrock group. The relative position of the variable vectors does not allow hypothesizing on the presence of subcompositions that can help reveal hidden patterns in the ternary diagram, as in the major oxides case. With the intention of studying the covariance structure of minor and major components, the exploratory investigation can be concluded with a biplot analysis considering them together. The first feature that emerges from the results reported in Figure 8 is the association of the vectors of C a t , Sr and A1203 in the same quadrant of the plot, as well as for Cr, Ni and MgO, and the lack of subcompositions. The previous associations are coherent with the geochemistry of the trace elements that substitute the respective major components in the lattice of silicate minerals. The indicated variables, together with K20 and T i t 2 (both usually showing an incompatible behaviour from peridotites and gabbros to basalts), are responsible for the data scattering as well as for the discrimination within the gabbroic bedrock group (high
Pb
Zn
Li
0.2041 0.6279
C a t , A1203 and Sr) and basaltic bedrock group (high Tit2, K20 and Li) in relation to the serpentinious bedrock group (high MgO, Cr and Ni).
Modelling linear trends in subcompositions In order to present an example of modelling of linear trends in subcompositions, the group of soils developed on basaltic rocks was chosen. Rocks such as basalts are dominated by a few common minerals, such as pyroxene (Ca, Mg, Fe)2SiO3, olivine (Mg, Fe)2SiO4 and feldspar, a solid solution of NaA1Si3Os-CaAlzSi2Os, with small amounts of KA1Si308 and some trace minerals. Within this framework, the investigation of the subcompositions [MgO, C a t , K20 ] and [Sit2, Tit2, A1203] may give indications about the chemical processes that have affected the bedrock on forming the soil. In the first case (Fig. 9) the data pattern moving from the MgO-CaO side of the ternary diagram towards the K20 vertex can represent the weathering reactions of olivine, pyroxene and feldspar. The reactions contribute Mg2+ and Ca 2+ to interstitial water of the soil, so that their decrease towards the K20 vertex can describe the direction of increasing mobilization. This type of process can also involve K20, but the element is then used in clay mineral formation. The pattern towards the K20 vertex is related approximately to an increase in clay content of the samples. In the second case (Fig. 10), the pattern of the subcomposition [SIO2, Tit2, A1203], from the SiO2 vertex towards the TiO2-AlzO3 side of the ternary diagram, indicates the leaching of SiO2 into
171
EXPLORATORY COMPOSITIONAL DATA ANALYSIS Table 6. Variation matrix of the composition of the trace elements of each bedrock group Ni Ni
Cr
Co
Cu
Pb
Zn
Li
Sr
Group
0.3035 0.3493 0.4438 0.4103
0.1788 0.2168 0.1651 0.7020
1.0965 0.9053 0.5655 2.2067
0.5946 0.8051 0.8114 1.5478
0.5076 0.6962 0.2116 1.3072
1.4309 1.0518 0.2707 2.2554
1.2415 0.5666 1.5180 4.4408
Serpentinitic Gabbroic Basaltic Global
0.2635 0.7579 0.5076 1.0510
1.4061 1.2872 0.8742 2.4993
0.7210 1.2650 0.9467 1.7765
0.6144 1.2554 0.4850 1.6927
1.7316 1.3822 0.8222 2.8019
1.2114 0.9854 1.2813 4.3363
Serpentinitic Gabbroic Basaltic Global
0.9853 0.5492 0.2799 0.8522
0.4517 0.7918 0.7628 0.6807
0.3192 0.4988 0.0591 0.3310
1.1856 0.9045 0.1013 0.9683
0.9679 0.5800 0.8399 2.0952
Serpentinitic Gabbroic Basaltic Global
1.1266 1.0826 0.7868 1.0555
0.6600 0.2668 0.2192 0.5422
0.5892 0.9818 0.2430 0.7594
0.8933 1.1839 1.1001 1.4731
Serpentinitic Gabbroic Basaltic Global
0.5281 0.7404 0.6484 0.6552
0.9349 1.5977 0.7842 1.2115
0.8400 1.1476 1.7480 2.0312
Serpentinitic Gabbroic Basaltic Global
0.9824 1.0229 0.0975 0.7700
0.6269 1.1179 0.9090 1.8418
Serpentinitic Gabbroic Basaltic Global
0.9748 0.6956 1.1070 1.7644
Serpentinitic Gabbroic Basaltic Global
Cr
0.3035 0.3493 0.4438 0.4103
Co
0.1788 0.2168 0.1651 0.7020
0.2635 0.7579 0.5076 1.0510
Cu
1.0965 0.9053 0.5655 2.2067
1.4061 1.2872 0.8742 2.4993
0.9853 0.5492 0.2799 0.8522
Pb
0.5946 0.8051 0.8114 1.5478
0.7210 1.2650 0.9467 1.7765
0.4517 0.7918 0.7628 0.6807
1.1266 1.0826 0.7868 1.0555
Zn
0.5076 0.6962 0.2116 1.3072
0.6144 1.2554 0.4850 1.6927
0.3192 0.4988 0.0591 0.3310
0.6600 0.2668 0.2192 0.5422
0.5281 0.7404 0.6484 0.6552
Li
1.4309 1.0518 0.2707 2.2554
1.7316 1.3822 0.8222 2.8019
1.1856 0.9045 0.1013 0.9683
0.5892 0.9818 0.2430 0.7594
0.9349 1.5977 0.7842 1.2115
0.9824 1.0229 0.0975 0.7700
Sr
1.2415 0.5666 1.5180 4.4408
1.2114 0.9854 1.2813 4.3363
0.9679 0.5800 0.8399 2.0952
0.8933 1.1839 1.1001 1.4731
0.8400 1.1476 1.7480 2.0312
0.6269 1.1179 0.9090 1.8418
Biplot components: 1 and 2 Proportion explained: 0.751
e=
o serpentinitic 9 gabbroic A basaltic
Serpentinitic Gabbroic Basaltic Global
0.9748 0.6956 1.1070 1.7644
Biplot components: 1 and 2 Proportion explained: 0.692
A
K20 `
A q~ oA
"~
Li o
~\
.,
TiO2
o
O
A 0
A
O
Fig. 7. Biplot of the composition of the trace elements.
08
o ~176 ~o
o Cr
/A Sr f
OA
\
o o
AI2~ ~
o
AO
o serpentinitic 9 gabbroic ,~ basaltic
~'
~o/-
~
- ~p, Ap = l k
oO
o
o
/ CaO
_ o
~ 9 ee
o
Fig. 8. Biplot for major oxides and the composition of the trace elements.
172
J. DAUNIS-I-ESTADELLA E T AL.
Cum.proportionexplained: 0.93051 1
PrincipalComponents: 0.32 0.1
0.55 .
0.13 .
MgO-c
, ~ /
I
\
K20-c
CaO -c
Fig. 9. Compositional principal axes for [MgO, CaO, K20] subcomposition (after centring data) of basaltic rocks. The PC~ and PC2 axes retain 93% and 7%, respectively, of the total relative variability.
Cum.proportionexplained: 0.89591 1 PrincipalComponents: , 0.58 0.14 0.28 . . . . . 0.44 0.44 .
TiO2-c
Si02_c
~
AI203-C
Fig. 10. Compositional principal axes for [SiO2, TiO2, A1203] subcompositions (after centering data) of basaltic rocks. The PC1 and PC2 axes retain 90% and 10%, respectively, of the total relative variability.
solution and the abundance increase of elements characterized by low solubility in water. The pattern toward the TiO2-A1203 is again associated with the increase in clay content. Investigation of data by compositional data analysis yields two simple subcompositions in which data patterns are related to the action of geochemical processes, thus giving the basis for their statistical modelling in an appropriate sample space.
Conclusions As long ago as 1897, Pearson, in a classic paper on spurious correlation, first pointed out the problems of interpreting correlations between ratios whose numerators and denominator contain common parts. With the developments in compositional data
analysis there are now several tools available to describe the true features of constrained data. The purpose of this paper is to present methods for exploring compositional data in order to obtain a basis for further statistical modelling in a correct sample space. The presented methods have been applied to the investigation of the chemical composition of soils sampled from three different types of country rock of ophiolitic nature in Tuscany (central Italy). The aim has been to obtain information about the differences relating to parent material as well as about the geochemical processes affecting subsequent evolution. Soils cover a large fraction of the Earth's terrestrial environment, provide a supporting medium for many forms of life and are the basis of agriculture. Soils are the ultimate product of rock weathering and act as a filter for aqueous and solid inputs, including rain, municipal wastes, pesticides and other chemicals. Movement of water causes dissolved material and colloidal particles to be carried from one layer to another. Vegetation plays a role, particularly as it decays and so furnishes acids to the water; some of the organic matter may also be transported by the interstitial water and adsorbed by clays. Roots cause mechanical disruption and, upon their decay, form open channels (root-casts) for the downward transport of solutions and particulate matter. The activity of earthworms, ants, termites and so on, causes continual mixing of the organic and mineral matter. Bacteria are a major part in decomposing organic matter, and probably serve as catalysts for inorganic reactions, including the oxidation of iron and manganese. The result of this complex of reactions is the material usually called soil. Soil formation is not distinguished sharply from weathering, but is merely just the last stage of the weathering process. The results verify that: (1) the three groups of soils can be distinguished by features (major and minor oxides) inherited from the parent material (for example, group 1 is well distinguished from the other two groups by the high MgO and low A1203 contents. Groups 2 and 3 are well separated from each other by the abundance of Fe203, while K20 clearly discriminates group 3 from the other two groups); (2) the variability, of the major and minor oxide ratios is related to the high or low mobility of the elements involved (for example variable ratios that form hydrous oxides and oxy-hydroxides mobilized only under particular E h - p H conditions show low variability, whereas variable ratios involving cations mobilized to a different degree by aqueous solutions show high variability); (3) the content of trace elements is inherited by the parent material and their behaviour tends to follow the geochemistry of the major ones even if group discrimination is not as good as in the case of the major and minor components; (4) the investigation concerning the presence of subcompositions in the
EXPLORATORY COMPOSITIONAL DATA ANALYSIS group of soils developed on basaltic rocks has, as an example, led to our modelling data patterns related to mobilization phenomena as well as to the increase in clay content. The authors express thanks to Vera Pawlowsky for useful comments on an earlier version of this paper; the reviewers, R. Reyment, H. Rollinson and E. Grunsky, for their thorough reading and suggestions, which greatly improved the paper. This work was supported by the Direcci6n General de Investigaci6n of the Spanish Ministry for Science and Technology through the project BFM20003-05640/MATE and from Italian MIUR (Ministero dell'Istruzione, dell'Universith e della Ricerca Scientifica e Tecnologica), PRIN 2004, through the GEOBASI project (prot. 2004048813-002).
References AITCHISON, J. 1986. The Statistical Analysis of Compositional Data. Monographs on Statistics and Applied Probability. Chapman & Hall Ltd, London. Reprinted (2003) with additional material by The Blackburn Press, Caldwell, NJ. AITCHISON, J. 1990. Relative variation diagrams for describing patterns of compositional variability. Mathematical Geology, 22 (4), 487-511. AITCHISON, J. 1997. The one-hour course in compositional data analysis or compositional data analysis is simple. In: PAWLOWSKY-GLAHN,V. (ed.) Proceedings of IAMG'97 - The third annual conference of the International Association for Mathematical Geology, Volume I, II and addendum, pp. 3-35. International Center for Numerical Methods in Engineering (CIMNE), Barcelona (E). AITCHISON, J. & GREENACRE, M. 2002. Biplots for compositional data. Journal of the Royal Statistical Society, Series C (Applied Statistics), 51 (4), 375-392. AITCHISON, J. & KAY, J. 2003. Possible solution of some essential zero problems in compositional data analysis. In: THIO-HENESTROSA, S. & MARTiN-FERNANDEZ, J. A. (eds) Compositional Data Analysis Workshop - CoDaWork'03, Proceedings. Universitat de Girona (http:// ima.udg.es/Activitats/CoDaWork03/). BACON-SHONE, J. 2003. Modelling structural zeros in compositional data. In: THIO-HENESTROSA, S. & MARTIN-FERNANDEZ, J. A. (eds) Compositional Data Analysis Workshop - CoDaWork'03, Proceedings. Universitat de Girona (http: / / ima.udg.es/Activitats/CoDaWork03/). BARCEL0-VIDAL, C., MARTIN-FERNANDEZ, J. A. & PAWLOWSKY-GLAHN, V. 2001. Mathematical foundations of compositional data analysis. In: Ross, G. (ed.) Proceedings of IAMG'O1 - The sixth annual conference of the International Association for Mathematical Geology, p. 20. CD-ROM. BUCCIANTI, A., PAWLOWSKY-GLAHN, V. BARCEL0VIDAL, C. & JARAUTA-BRAGULAT, E. 1999. Visualization and modeling of natural trends in ternary diagrams: a geochemical case study. In: LIPPARD, S. J., NAESS, A. & SINDING-LARSEN,R.
173
(eds) Proceedings of the 5th Annual Conference of the International Association for Mathematical Geology, Trondheim, Norway, pp. 139-144. ECKART, C. & YOUNG, G. 1936. The approximation of one matrix by another of lower rank. Psychometrika, 1, 211-218. EGOZCUE, J. J. & PAWLOWSKY-GLAHN, V. 2006. Simplicial geometry for compositional data. In: BUCCIANTI, A., MATEU-FIGUERAS, G. & PAWLOWSKY-GLAHN, V. (eds) Compositional Data Analysis in the Geosciences: From Theory to Practice. Geological Society, London, Special Publications, 264, 145-159. EGOZCUE, J. J., PAWLOWSKY-GLAHN, V., MATEUFIGUERAS, G. & BARCELO-VIDAL, C. 2003. Isometric logratio transformations for compositional data analysis. Mathematical Geology, 35 (3), 279-300. FRY, J. M., FRY, T. R. L. & MCLAREN, K. R. 1996. Compositional data analysis and zeros in micro data. Centre of Policy Studies (COPS), General Paper No. G-120, Monash University. GABRIEL, K. R. 1971. The biplot - graphic display of matrices with application to principal component analysis. Biometrika, 58 (3), 453-467. GOWER, J. C. & HAND, D. J. 1996. Biplots. Chapman & Hall Ltd, London. KRZANOWSKI, W. J. 1988. Principles of Multivariate Analysis: A user's perspective, Volume 3 of Oxford Statistical Science Series. Clarendon Press, Oxford. MARTiN-FERNANDEZ, J. A. 2001. Medidas de diferencia y clasificaci6n no paramdtrica de datos composicionales. PhD thesis, Universitat Polit&cnica de Catalunya, Barcelona, Spain. MARTiN-FERNANDEZ, J. A., BREN, M., BARCELOVIDAL, C. & PAWLOWSKY-GLAHN, V. 1999. A measure of difference for compositional data based on measures of divergence. In: LIPPARD, S. J., N~ss, A. & SINDING-LARSEN,R. (eds) Proceedings of IAMG'99 - The fifth annual conference of the International Association for Mathematical Geology, Volume I and II, pp. 211 - 216. Tapir, Trondheim, Norway. MARTiN-FERNANDEZ, J. A., BARCELO-VIDAL, C. & PAWLOWSKY-GLAHN, V. 2000. Zero replacement in compositional data sets. In: KmRS, H., RASSON, J., GROENEN, P. & SHADER, M. (eds) Studies in Classification, Data Analysis, and Knowledge Organization (Proceedings of the 7th Conference of the International Federation of Classification Societies (IFCS'2000), University of Namur, Namur, 11-14 July, pp. 155-160. SpringerVerlag, Berlin, Germany. MARTiN-FERNANDEZ, J. A., BARCEL0-VIDAL, C. & PAWLOWSKY-GLAHN, V. 2003. Dealing with zeros and missing values in compositional data sets using nonparametric imputation. Mathematical Geology, 35 (3), 253-278. PEARSON, K. 1897. Mathematical contributions to the theory of evolution on a form of spurious correlation which may arise when indices are used in the measurement of organs. Proceedings of the Royal Society of London, LX, 489-502.
174
J. DAUNIS-I-ESTADELLA ET AL.
VASELLI, O., BUCCIANTI, A., DE SIENA, C., BINI, C., CORADOSSI, N. & ANGELONE,M. 1997. Geochemical characterization of ophiolitic soils in a temperate climate: a multivariate statistical approach. Geoderma, 75, 117-133.
VON EYNATTEN, H., PAWLOWSKY-GLAHN, V. & EGOZCUE, J. J. 2002. Understanding perturbation on the simplex: a simple method to better visualise and interpret compositional data in ternary diagrams. Mathematical Geology, 34 (3), 249-257.
Frequency distributions and natural laws in geochemistry A. B U C C I A N T I 1, G. M A T E U - F I G U E R A S 2 & V. P A W L O W S K Y - G L A H N
2
1Dipartimento di Scienze della Terra, Universith di Firenze, Via G. La Pira 4, 1-50125, Firenze, Italy (e-mail: antonella, buccianti @unifi, it) 2Departament Informhtica i Matemhtica Aplicada, Universitat de Girona, Campus Montilivi, P4, E-17071 Girona, Spain Abstract: The chemical composition of natural waters is derived from many different sources of
solutes, including gases and aerosols from the atmosphere, weathering and erosion of rocks and soil, solution or precipitation reactions occurring below the land surface, and effects resulting from human activities. The chemical composition of the crustal rocks of the Earth, as well as the composition of the ocean and the atmosphere, are important in evaluating sources of solutes. Data used in the investigation of natural and non-natural contributions are obtained usually from chemical analysis of water samples, which may be statistically evaluated with the aim of summarizing the contained information. However, as these data are compositional and thus constrained to move in the simplex, application of usual statistical methodologies may lead to incorrect evaluations and/or interpretations. This paper focuses on how to draw information on natural processes by modelling univariate and multivariate frequency distributions using water data. The chemical composition of 977 samples collected in wells from Vulcano island (Italy) are used as a case study. The methodological approach can be transferred to the investigation of other geochemical or constrained data.
Modelling of geochemical processes affecting water chemistry by applying univariate and multivariate statistical methodologies may improve the interpretation of complex systems and, in the process of finding proper models, the sample space represents an important issue. Chemical constituents of water are reported usually in gravimetric units, such as milligrammes per litre (ppm) or milliequivalent per litre. In both cases, a sample space has been constrained with a metric different from the ordinary one, as measurements cannot fall outside the interval (0, 106) or (0, 109) of the real line, and the measure of difference between observations is relative. It is only reasonable to expect the statistical results to respect these facts and this is only possible if the model used does so. The problem is that most usual statistical models require, instead, the whole real line as sample space and an absolute measure of difference. Therefore, when constrained data with a relative scale are managed, the evaluation of the shape of a frequency distribution, as well as the application of statistical tests to evaluate the performance of a given model, is at the most a good approximation and, in any case, questionable. Why is the knowledge about the distribution that can be used to model the data so important? There are at least two reasons. One is that distributions can frequently be associated with a generating process which might be helpful in understanding the studied phenomenon. This is well known in
geochemistry. For example, concentrations of different elements in terrestrial waters displayed in cumulative frequency plots are used to evaluate the effects of natural phenomena (Davies & DeWiest 1996). For some species, such as Ca 2+, HCO3, SIO2, K + and F - , a steep gradient was found, indicating that the solubility of a mineral places an upper limit on the maximum concentration of the species in natural waters. For most of the other species, the curve has a lower gradient, suggesting that the concentrations depend rather on their availability in rocks, on very slowly dissolving minerals, or are controlled by biological processes. The other reason is that most statistical techniques assume some underlying distribution and it is important that it fits the data reasonably well for results to be valid. For example, standard correlation analysis, factor analysis, discriminant analysis, tests and computation of probability levels are usually based on the assumption of a normal (Gaussian) distribution. In general, inspection about the best probability model to be used in data description is a topic widely neglected, although the problem has been discussed in several papers. Ahrens (1953, 1954a, b 1957) proposed the log-normal distribution for geochemical data as a universal law after analysing abundance of trace elements in crustal rocks, expressed as the frequency of appearance versus concentration. His idea encountered criticism (Aubrey 1954; 1956; Chayes 1954; Miller
From: BUCCIANTI,A., MATEU-F1GUERAS,G. & PAWLOWSKY-GLAHN,V. (eds) CompositionalData Analysis in the Geosciences: From Theory to Practice. Geological Society, London, Special Publications, 264, 175-189. 0305-8719/06/$15.00
9 The Geological Society of London 2006.
176
A. BUCCIANTI E T AL.
& Goldberg 1955; Vistelius 1960), but in recent textbooks his assertion is commonly reported. Consequently, the log-transformation is used the most frequently when geochemical data are managed. However, besides the fact that a logarithmic transformation does not project data from a constrained sample space to the whole real line, log-transformed data do not often adjust sufficiently well to a normal distribution (McGrath & Loveland 1992), as they still present some skewness. This is generally attributed to imprecise measurements, to many potential sources of error involved in sampling, to sample preparation and analysis, to detection limit problems, to the presence of outliers, or to mixing of populations (Reimann & Filzmoser 1999). Another approach for interpreting empirical geochemical distributions has been given by Allbgre & Lewin (1995), who considered that they are not of a unique type. They proposed a unified theory based on the action of differentiation-mixing operators able to give, respectively, fractal and Gaussian distributions. If their reasoning is inverted, when the distribution is Gaussian, mixing is the acting geochemical operator and, when the distribution is fractal or multifractal, differentiation is the dominant process. However, none of the cited works considers the sample space of geochemical data explicitly, nor do the authors justify the appropriateness of models with support for the whole real line (normal distribution), or the strictly positive real line (log-normal model), for data which are constrained to an interval. The conclusions on the shape of the frequency distribution are therefore in general inconsistent, at least from a formal point of view. Recently, the skewness of log-transformed data has been tentatively modelled using the logskewnormal distribution (Mateu-Figueras 2003). This model is defined on the positive real line combining the logarithmic transformation and the univariate skew-normal model on the real line (Azzalini 1985). It is an extension of the log-normal distribution by adding a shape parameter. This model could be considered a good altemative, but it is defined on the whole positive real line, the same as the log-normal model. Thus, this model is also questionable for constrained data. But the multivariate extension by Mateu-Figueras et al. (2005) of the logistic normal distribution (Aitchison 1986), defined on the simplex, and thus on a constrained sample space, appears to be a reasonable alternative for compositional data. Following the (1986) approach of using log-ratios, the basic idea is to transform compositional data to the whole real space using an appropriate log-ratio transformation and to use the multivariate skew-normal distribution (Azzalini & Dalla Valle 1996; Azzalini & Capitanio 1999) for modelling the transformed
sample. The advantages of this strategy are many: it takes into account the constraint character of the sample space, it is consistent with its algebraicgeometric structure, and it is flexible enough to account for some skewness in the transformed data. The aim here is to draw information on the action of natural processes in a correct statistical framework so that inferential modelling can be developed. In this context, the potentiality of the logistic skew-normal distribution as a natural law for geochemical phenomena is evaluated. Strictly speaking, to attain this goal, first an appropriate basis of the simplex, expressed in terms of logratios, has to be found (Egozcue et al. 2003). From it, using basic linear algebra, any vector expressed in terms of log-ratios can be obtained. Therefore, in practice, log-ratios of interest can be defined using known geochemical properties, using appropriate coefficients which depend on the number of parts in the numerator and in the denominator (Egozcue & Pawlowsky-Glahn 2005), and this will be the approach used here.
Theoretical distributions and geochemical processes Although natural processes and phenomena in geochemistry may combine many complex and poorly understood factors, their frequency distributions each appear to follow closely one of a few theoretical models. The theoretical frequency distribution provides a probability density function for predicting the probability of occurrence of certain events. For example, physical and geological observations that follow an apparently symmetrical frequency distribution have been compared to the normal distribution, also named Gaussian distribution after its originator Karl Friedrich Gauss (1777-1855). The normal distribution was introduced to model the pattern of non-systematic and additive errors of observation and measurement. It requires a physical property X ranging theoretically from -co to +co, whose realizations show a strong tendency to cluster around their mean, and are equally likely to undershoot or overshoot the mean. The resulting frequency distribution is symmetrical, with long tails corresponding to rare events, far from the mean. The dispersion about the mean is defined by the variance. Normal processes are obtained under some conditions: the main one is the summation of many continuous random variables, and a mild condition is the independence of these random variables. The normal distribution is a mathematical model of dispersion commonly applied in geochemistry, although many variables do not range from -co to +co, e.g. percentages, ppm or other similar units. Therefore, in
FREQUENCY DISTRIBUTIONS IN GEOCHEMISTRY constrained systems, the use of the normal distribution is not appropriate. Moreover, using the normal distribution, predictions outside the sample space can be obtained, leading in general to negative values or, less frequently, to values larger than the maximum possible value (e.g. 100% for observations). This fact does not only invalidate the obviously wrong results, but puts a reasonable doubt on the correctness of the apparently good results, namely those which satisfy the constraints. The fact that the normal distribution does not respect the constrained character of the sample space when dealing with compositional data is not the only reason for this model to be inappropriate. In fact, geochemical distributions cannot be compared with the normal distribution because multiplicative sources of variability are the best way to interpret their behaviour. Also, in some cases, the presence of a positively skewed distribution is observed, showing a long tail to the right, towards high values of measurement. It indicates that observations with large values are not unusual. Frequently, this positive skewness disappears after applying the logarithmic transformation. In this situation the distribution of the physical property X has been modelled frequently by the log-normal distribution, for which the sample space is the positive real line, R +. Typical examples are concentrations of accessory minerals and other rare components, such as trace elements, which may show a positively skewed frequency distribution with a mode at a low concentration. Compared with the conditions necessary to obtain the normal distribution, a log-normal process is one in which the random variable of interest results from the product of many independent random variables multiplied together. This happens, for example, when the quantity present in each state is expressed as a random proportion of the quantity present in the immediately prior state. If each successive proportion is independent of the previous one, and if many states occur between the initial state and the final one, then the final result can be expressed as a product of random variables and the variable of interest will show some similarity with a lognormal distribution. A general mechanism that in nature can generate the right-skewed concentration distributions is explained under the 'Theory of Successive Random Dilutions'. It represents a special application of the 'Law of Proportionate Effects' originally proposed by Kapteyn (1903). It is considered to be especially appropriate for modelling substances released into environmental carrier media (air, water, soil) experiencing considerable physical movement and agitation. It considers a pollutant released at initially high concentrations into a carrier medium, which undergoes dilution in successive, independent stages. If a mixing stage
177
occurs between each independent dilution stage, the final concentration will be the product of the initial concentration and a series of independent dilution factors. When the number of successive random dilutions becomes large, the distribution of the final concentration has been approximated by a log-normal (Wayne 1990), although the constraint due to the volume is not taken into account in this model. Sometimes the normal model cannot fit the logtransformed data properly because of remaining skewness. To deal with it the skew-normal model can be used; it is given by a family of skewed distributions, including the normal one, defined on the whole real line, and characterized by an extra parameter which allows the density to have positive or negative skewness (Azzalini 1985). The notation X ~ SN(tz, o-, A) is used. The parameter h controls the shape of the distribution; when h = 0 , SN(lx, o-, 0) is equal to the N(/z, 09, while when h ~ ___oo a half-normal density is the result (Azzalini 1985). Consequently, besides the mean and the variance, the skew-normal distribution depends on a skewness index y that varies in the interval (-0.995, +0.995) and when A tends to _+oo, 3' tends to _+0.995. Starting from this model, Mateu-Figueras (2003) investigated the properties of the logskew-normal distribution for a positive random variable X, with support lq+" i.e. for Y = ln(X) following a SN(tz, o', A) distribution. The logskew-normal distribution is a generalization of the log-normal distribution and allows the density to incorporate positive or negative skewness. The use of the log-normal and the logskewnormal distribution as an approximate model for the concentration of the elements could be discussed. Nevertheless, in constrained systems the use of the log-normal or the logskew-normal distribution still has a major drawback, because they are defined on the whole positive real line. To avoid this problem, the proposal here is to work with models defined exclusively on the support space of the geochemical data, with a relative measure of variability, using the log-ratio approach (Aitchison 1986) and an appropriate representation. First, some single log-ratios, constructed from two components properly chosen from a geochemical point of view, will be investigated using the univariate normal and the univariate skew-normal models. To model a compositional vector X, multivariate models with sample space a simplex and with a relative measure of variability, have to be used. The logistic skew-normal distribution (Mateu-Figieras et al. 2005) on an appropriate basis is a reasonable choice. A D-part composition X = (X1, X2 . . . . . X o ) is said to follow a logistic skew-normal distribution when its transform by an appropriate log-ratio transformation,
178
A. BUCCIANTI ETAL.
Y, follows a multivariate skew-normal distribution. It is denoted as Y ~ SND-I(~, ~,, A), where A is a parameter that takes into account the presence of skewness (Azzalini & Dalla Valle 1996; Azzalini & Capitanio 1999). The logistic normal distribution (Aitchison 1982, 1986) is a particular case, obtained when Y follows a multivariate normal distribution or, equivalently, when A is the null vector. Note that in the multivariate case bold notation will be used to indicate vectors and matrices.
Log-ratios and their univariate statistical modelling: results and discussion Database and geochemistry of waters For illustration, the log-ratio analysis is applied to water samples collected in the quiescent volcanic environment of Vulcano, an island belonging to the Aeolian Archipelago (Sicily, Southern Italy). The area, interpreted as a typical volcanic arc generated by subduction processes beneath the Tyrrhenian Sea, had the last eruption from 1888 to 1890. Since then, fumarolic activity of varying intensity has continued to the present day. Several years of geochemical investigations have produced some models to describe the evolution of the fluids (water and gases) related to the volcanic system in time (Martini 1980, 1989, 1996; Montalto 1996; Capasso et al. 1999, 2001; Di Liberto et al. 2002; Buccianti & Pawlowsky-Glahn 2005). These studies allowed the identification of at least two aquifers, a shallow one of meteoric origin and a deeper one influenced by thermal activity. From a hydrological point of view, the aquifers are probably not physically separated. A current hypothesis considers that the shallower less saline aquifer floats over the more saline one of marine origin and is affected by the interaction with volcanic fluids of deeper origin. The differences observed in the wells are then attributable to lateral permeability variations, local alteration processes, and/or to the presence of areas of preferential upflow of volcanic fluids. From a general point of view, the aggressive character of the water able to mobilize the elements in this environmental context is due to the input of carbon dioxide from the deep uprising gaseous flow into the aquifer. The presence of hydromagmatic deposits, which appear to have undergone early syn-depositional alteration processes, contributes elements from secondary minerals as calcium sulphate, calcium fluoride, sodium chloride and others. According to their solubility in aqueous solutions, chemical components can be leached away even if in the presence of a weak alteration of surface waters. Systematic studies have been in progress since 1977 by
geochemists of the University of Florence. At present, 977 samples of groundwater, pertaining to 50 wells located in the northwest sector of the area surrounding the active crater, have been sampled and analysed at regular time intervals to look at the concentrations (ppm) of Ca 2+, Na +, Mg 2+, K +, HCO3, SO 2- and CI-. Calcium is the most abundant of the alkaline-earth metals and is a major constituent of many common rock minerals. It can be derived from carbonate, gypsum, feldspar, pyroxene and amphibole; processes affecting its distribution are related to dissolution, while limits are due to the solubility of calcite. Consequently, the behaviour of calcium in natural aqueous systems is generally governed by the availability of the more soluble calcium-containing solids and by solution and gas phase equilibria that involve carbon dioxide species, as well as by the availability of sulphur in the form of sulphate. It also participates in cation-exchange equilibria of aluminosilicates and other mineral surfaces. The HCO~- species can be derived from carbonates and organic matter; processes affecting its concentration are in general due to soil-CO2 pressure, equilibria involving carbon dioxide species (particularly in a volcanic environment) and weathering, while limits are posed by organic matter decomposition. The SO 2- species is contributed by the atmosphere, gypsum and sulphides, but in a volcanic area the input of gaseous components (affecting S cycle) from fumarolic activity is also important. Processes involved in its distribution are dissolution and oxidation, while concentration limits are attributable to removal by reduction. Sodium, the most abundant member of the alkali-metal group, when brought into solution, tends to remain in that status, once it has been liberated from silicatemineral structures. It derives from feldspar, rocksalts, zeolite and the atmosphere by dissolution and cation exchange, particularly in coastal aquifers, with concentration limits due to silicate weathering. There are no important precipitation reactions that can involve sodium, as carbonate precipitation controls calcium concentration. Potassium is slightly less common than sodium in igneous rocks, but it is more abundant in all the sedimentary rocks. It is liberated with greater difficulty from silicate minerals (feldspar, mica) and it is affected by processes of dissolution, adsorption (it exhibits a strong tendency to be reincorporated into solid weathering products as clay minerals) and decomposition. Moreover, the element is involved in the biosphere processes, especially in vegetation and soil, and is essential for both plants and animals; limits on its concentration are due to solubility of clay minerals and vegetation uptake. Chloride, derived from rock-salts and atmosphere, is present in all natural waters, but mostly the
FREQUENCY DISTRIBUTIONS IN GEOCHEMISTRY concentration is low. Exceptions occur where the inflow of sea water can affect the chemical composition of groundwater, as in the case of Vulcano island. Chloride ions do not enter into oxidation or reduction reactions significantly, form no important solute complexes with other ions (unless the concentration is extremely high), do not form salts of low solubility and are not adsorbed significantly on mineral surfaces. Furthermore, they play a minor role in biogeochemical processes. On the whole, the circulation of chloride ions in the hydrological cycle is largely through physical processes, dominated by a conservative behaviour. The alkalineearth metal Mg 2+ shows only one oxidation state of significance in water chemistry and is a common element, essential in plant and animal nutrition. Processes that are important as sources in water are related to dissolution, while limits are posed by the solubility of clay minerals. Generally, it can derive from dolomite, serpentine, pyroxene, amphibole, olivine and mica, but a source of Mg -+ at Vulcano island can also be ascribed to the influence of deep-seated aquifers of marinelike composition, which occasionally inflow into the overlying water bodies. Even if magnesium has a behaviour similar to calcium (i.e. property of hardness), its ions are smaller and can be accommodated in the space at the centre of six octahedrically co-ordinated water molecules. This behaviour increases the tendency to precipitate crystalline compounds. Magnesium occurs in significant amounts in most limestones and the dissolution of this material can bring magnesium into solution. However, the process is not readily reversible, and the precipitate that forms from a solution may be nearly pure calcite. As a consequence, magnesium concentration would tend to increase along the flow path of a groundwater undergoing such processes, achieving a rather high Mg/Ca ratio.
Suitability of normal and skew-normal models As the data are constrained with a relative scale, log-ratios are taken. A first phase involves the investigation of the shape of the frequency distributions of the log-ratios ( 1 / ~ ) l n ( C a e + / H C O 3 ) , (1/~/-2) ln(Na+/K+), (1/V~) ln(Na+/C1-), (1/V~) and ln(Ca2+/Mg 2+), ( 1/ ~'-2) ln(Mg 2+/SO]- ) (1/~/2)ln(Ca2+/SO2-), considering cations and/ or anions derived from similar sources as, for example, carbonates or sulphates, or showing a similar geochemical behaviour in natural processes. The coefficient (1/4'2) is used to preserve the same scale as in the simplex, the sample space of the full composition. Note that the above log-ratios are not orthogonal in the geometry of the simplex, and
179
therefore they are not appropriate to perform a standard multivariate analysis combining them (Egozcue & Pawlowsky-Glahn 2005). To investigate the shape of the frequency distributions of mentioned log-ratios, whose terms have been chosen from a geochemical point of view (like anions and/or cations with the same charge, or species potentially derived from the same source), the normal and the skew-normal distribution are used. The parameters of the skewnormal distributions have been estimated using the maximum likelihood method, working with the routines of the Matlab version software library available at http: //azzalini.stat.unipd.it/SN/index.html. Note that an underlying hypothesis to this approach is random sampling from a single population. Nevertheless, a common practice in spacetime monitoring of volcanic systems is to set up plans of investigation in which the same variable is measured on each experimental unit at a number of different occasions, and observations that are made at different times on the same experimental unit will frequently show some correlation. In general, observations made close together in time will be more highly correlated than observations taken far apart in time. Furthermore, the presence of outliers and/or groups of samples might be interpreted as a mixture of populations. All these aspects would require a complex model to study the frequency distributions of above mentioned logratios. The hypothesis of random sampling made in the present approach is based on the fact that the sampling time interval of available data is about six months, that the analysis of the behaviour in time of the log-ratios considered did not show any evidence of time-dependence, and that the presence of outliers and/or groups in the sample is attributable to the already mentioned influence of deep-seated aquifers of marine-like composition, which occasionally inflow into the overlying water bodies, to lateral permeability variations, and/or to local alteration processes. 9 "~"F To model the log-ratios (1/x/~) ln(CW" /HCO~), (1/x/2) ln(Na+/K+), (1/x/2) ln(Na+/C1-), (1/x/~), In(Ca2+/Mg2+), (1/~/~) ln(Mg2+/SO]-), and (1/v/2)ln(Ca2+/SO42-), a normal density has been used. This is equivalent to modelling the corresponding ratios using a log-normal model. The Kolmogorov- Smirnov goodness-of-fit test for normality only allows one to accept the Gaussian model for (1/x/~) ln(Ca2+/HCOf) (/z=-0.74, o-=0.72) and (1/~/-2)ln(Na+/K +) (/z=0.67, o-=0.31), with a p-value >0.03. Their probability plots are reported in Figures 1 and 2. If (1/x/~) ln(Ca2+/nco3) and (1/~r2) ln(Na+/K +) are well represented by the normal model, then the ratios (Ca2+/HCO~-) 1/42 and (Na+/K+) 1/4~ follow the log-normal one. The
180
A. BUCCIANTI ET AL . i
i
0.999 0.997 0.99 0.98 0.95 0.90 0.75 0.50 ~- 0.25 0.10 0.05 0.02 0.01 0.003 0.001 -3
-2.5
-2
-1.5
-1
-0.5
0
0.5
1
1.5
(1/~/2)log(CaZ+/HCO3)
Fig. 1. Probability plot of (1/~)ln(Ca2+/HCO~). Continuous line represents a perfect normal distribution. exponent affects only the scale of the ratios and is important to preserve consistency with the geometric properties of the full composition. Nevertheless, for interpretation, the scale can be changed, i.e. the exponent can be omitted, and the ratios
'
0.999 0.997
'
'
Ca2+/HCO~ and N a + / K + will also follow a lognormal distribution. From a geochemical point of view, these ratios appear to be the product of many independent random processes, so that the quantity present in
/
!o
'
I
'.
I
0.99 0.98 0.95 0.90 0.75 0.50 ,.Q O
0.25 0.10 0.05 0.02 0.01 0.003 0.001
.
.
0
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
0.5
1
1.5
2
2.5
3
3.5
4
( 1/~2)log(Na+/K+) Fig. 2. Probability plot of (l/v/2)ln(Na+/K+). Continuous line represents a perfect normal distribution.
FREQUENCY DISTRIBUTIONS IN GEOCHEMISTRY each state can be expressed as a random proportion of the quantity present in the immediately prior state. The chemical species involved in the ratios would have experienced considerable physical movement and agitation, as well as dilution in successive, independent stages. The resulting lognormal distributions would thus represent dominant and general phenomena affecting the investigated waters, where weathering of silicates and carbonates is important. The input of carbon dioxide from the deep uprising gaseous flow is, in fact, able to give an aggressive character to the water. It produces a weak rock weathering with consequent presence in water of Na +, K +, Ca 2+ and HCO~-. Physical movement and agitation, as well as dilution in successive, independent stages can lead to the log-normal distribution. The same simple modelling path cannot be accepted for the log-ratios (1/,,/2) ln(Na+/C1-), (l/v/2) ln(Ca2+/Mg2+), (1/q'-2) ln(Mg2+/SO42-) and (1/V'2)ln(Ca2+/SO 2-) (all p-values of the Kolmogorov-Smirnov test are less than 0.01). Their probability plots, with the reference line of the Gaussian model, are reported in Figures 3, 4, 5 and 6. As the log-ratios display a moderate skewness, the performance of the skew-normal model or, equivalently, the logskew-normal model is explored for the corresponding ratios. In Figures 7, 8, 9 and 10 the histograms and the estimated skew-normal curves obtained using
181
the maximum likelihood estimation procedure (Azzalini 1985) are reported. As can be seen, the skew-normal model appears to capture the skewness affecting the log-ratios; the log-likelihood function (a value similar to the sum of squared error in regression analysis) shows, in fact, a better value if compared with the normal model, for each log-ratio. Furthermore, the value of the likelihood ratio test statistic to compare both models (i.e. the null hypothesis that the shape parameter ,~ is zero) leads always to a p-value < 0.01. Thus, the conclusion is that the skew-normal model is significantly better than the normal one in all cases. As can be seen in Table 1, the shape parameter & explains a skewness ranging from y -- -0.48 to 0.58. Consequently, omitting the exponent for interpretational purposes, low values can be found with higher frequency for the ratios Na+/C1 - and Ca2+/Mg 2+, and high values for Mg2+/SO4aand Ca2+/SO42-, when a comparison with the log-normal model is performed. However, the application of some goodness-of-fit tests for the skew-normal model (Kolmogorov-Smirnov, Kuiper, Anderson-Darling, Cramer-von Mises and Watson (Mateu-Figueras 2003)) indicates that only for the log-ratio (l/v/2) ln(Na+/C1 -) can the skew-normal model be considered statistically acceptable, taking a significance level of 0.01 (p-value > 0.01). In the other cases, the bimodality
0.999 0.997 0.99 0.98 0.95 0.90 0.75 0.50 0.25 0.10 0.05 0.02 0.01 0.003 0.00t -1
-0.5
0
0.5
1.5
( 1/'~2)log(Na+/C1-) Fig. 3. Probability plot of (l/x/2)ln(Na+/C1-). Continuous line represents a perfect normal distribution.
182
A. BUCCIANTI E T A L . i
0.999 0.997 0.99 0.98
.
-! .......
I
.
i
.
.
i........
I
.
i
.
i .......
.
! .......
i
i
i
i .......
i......
.
g
! .......
,,! 9
....
0.95 0.90 0.75 0.50 e-,
0.25 0.10 0.05 0.02 0.01 0.003 0.001 -1
-0.5
0
0.5
1
1.5
2
2.5
(1/~2)log(Ca2+/Mg 2+) Fig. 4. Probability plot of (1/V~)ln(Ca2+/Mg2+). Continuous line represents a perfect normal distribution. affecting the data indicates a more complex structure that cannot be explained considering only a moderate skewness. The adoption of the skew-normal model for (1/v'~) ln(Na+/C1 - ) implies that Na+/C1 -
0.999 .............................. 0.997 -i .......
! ......
(omitting again the exponent) follows the logskew-normal one, so that a sort of mechanism able to generate a further skewness (compared with the log-normal) is present. This result indicates that the samples have not experienced dilution as
!.....................
E ......
i.......
i ......
i.......
i ......
i ......
!~S".
i
i
i
i
i
~
!..,
0.99 0.98 0.95 0.90 0.75 0.50 0.25 0.10 0.05 0.02 0.01
I
0.003 .i.. gO 0.001 -4
-3.5
-3
-2.5
-2
-1.5
-1
-0.5
i..
0
(1/~]2)log(Mg2+/SO2-) Fig. 5. Probability plot of (1/v~)ln(Mg2+/SO~-). Continuous line represents a perfect normal distribution.
FREQUENCY DISTRIBUTIONS IN GEOCHEMISTRY 0.999 ................................ 0.997 i ...... i ..... 0.99 i ...... 0.98 0.95 0.90
! ........... i......
i .....
i .....
!..; :
.S..:
183
...........
.....
i .....
i ......
0.75 ..a
0.50 0.25 0.10 0.05 0.02 0.01 0.003 0.001 -3
-2.5
-2
-1.5
-1
-0.5
0
0.5
1
1.5
(l/42)log(Ca2+/SO 2-) Fig. 6. Probability plot of (1/~/'2)ln(Ca2+/SO2-). Continuous line represents a perfect normal distribution. the dominant process, but that the influence of marine water ( N a + / C 1 - ~ 0.55) is not able to generate clearly different groups of observations. In the case of the log-ratios (l/q/-2) ln(Ca2+/Mg2+), (1/~/2)
ln(Mg2+/SO 2-) and (1/v/-2)ln(Ca2+/ SO]-), the solution of minerals typical of weathered hydromagmatic deposits not homogeneously present in the area (i.e. sulphates) might be the mechanism able to
1.6 1.4 1.2
m
f
1
-~ 0.8
0.~ 0.4 0.2
_AI -0.5
0 (1/q2)ln(Na§
0.5 -)
Fig. 7. Histogram of (l/x/2) ln(Na+/C1 -) values and fitted skew-normal densities.
1
1.5
184
A. BUCCIANTI E T AL. 0.9 0.8 0.7 0.6 L~ 0.5 0.4 0.3 I
0.2 84 0.1 0
-2
-1
0
l
2
3
(1/{2)ln(CaZ+/Mg2+) Fig. 8. Histogram of (1/4"2)ln(Ca2+/Mg z+) values and fitted skew-normal densities.
increase the ratio of Ca 2+/SO]- in part of the waters. Here, to take into account moderate skewness is insufficient to describe reality and the presence of groups of data appears to be the dominant feature. Summarizing, univariate frequency distributions in water geochemistry of Vulcano island can be
characterized by three different models, log-normal, logskew-normal and multimodal. The goodness of fit of the skew-normal (log-skew) model compared with the normal (log-normal) one depends on the persistence and continuity of natural processes affecting a part of the population so that a moderate skewness is
0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 -6
_
~
-5
~
-4
-3
-2
-1
(l/'J2)ln(Mg2+/SOo-) Fig. 9. Histogram of (1/~/2)ln(Mg2+/SO] -) values and fitted skew-normal densities.
0
FREQUENCY DISTRIBUTIONS IN GEOCHEMISTRY
185
1.4
1.2
0.8
-8 0.6
0.4
0.2
0 -5
I
i-'-i
-4
-3
-2
-1
0
1
2
3
(l/x/2)ln(Ca2+/SO2-) Fig. 10. Histogram of (1/~/2) ln(Ca2+/SO2-) values and fitted skew-normal densities.
generated. However, when the geochemical phenomena affect with recurrence only a part of the population, groups tend to develop and a more complicated analysis is needed. The geochemicalstatistical approach presented can be used to verify how successive independent stages of dilution are able to describe the behaviour of the data and how other processes are able to sign their presence and persistence in time and/or space until generation of groups of cases occur.
Log-ratios and their multivariate statistical modelling in water chemistry: results and discussion In the previous part, log-ratios were analysed using a geochemical-statistical procedure to investigate their behaviour. The terms of the logratios were chosen following geochemical criteria. Table 1. Value of the estimated shape parameter, A, and the corresponding estimated skewness index, 4/, of each log-ratio
Log-ratio
( 1/ ~/2) ln(Na+/C1- ) (1/v/2) ln(Ca2+/Mg 2+) (1/~2) ln(Mg2+/SO] - ) (1/x/2) ln(Cae+/SO 2-)
A
,~
2.97 2.52 -2.11 - 1.74
0.5 5 0.58 -0.48 -0.38
However, water chemistry can be considered also as a whole, and to do so the shape of the frequency distribution of the composition X = (Na +, K +, Ca 2+ Mg 2+ HCO 3, SO]-, C1-) should be considered. In this case, one can verify if X follows a logistic normal distribution, or a logistic skew-normal one. The importance of processes able to introduce skewness in single log-ratios is considered when all the members of the composition are analysed together. The adequacy of the models was investigated applying the isometric log-ratio transformation, ilr(X) = (ilrl, ilr2, ilr3, ilr4, ilr5, ilr6), given by Egozcue et al. (2003)
1 /Na+\ ilr, = --=ln/--~T/ ~/2 \ K ,/' 1 [ Na+K + \ ilr2 = ~ l n ~ ( s o 2 _ ) 2 ) , ilr3=~
1 In{Na+K+SOp ~ k (-~5 -j,
ilr4 = ~
1 ln{Na+K+SO2-CI-~ \ (HCO~_)4 j ,
1 /Na+K+SO42-Cl-HCO3 \ ilr5 = ~ l n ~ (-C-~a2+-~ ), ilr6 = ~
1
In
(Na+K+SOZ-CI-HCO3Ca 2+ ) (Mg2+)6
(1)
A. BUCCIANTI E T A L
186
Then, a multivariate normal model and a multivariate skew-normal model are fitted to the transformed samples. The maximum likelihood method is used to obtain estimates of the parameters. The two fitted models are compared by applying the likelihood ratio test (i.e. the null hypothesis that all the components of the shape parameter k are 0). As the p-value is <0.01 (the value of the associated statistic is approximately 233), the conclusion is that the multivariate skew-normal model is significantly better than the multivariate normal one. Thus, the composition of Vulcano water X is described better by the logistic skew-normal model, compared with the logistic normal one. But it is necessary to validate the logistic skew-normal model for the composition X or, equivalently, the skew-normal model for the ilr-transformed vector. The univariate skew-normality of each component is a necessary but not sufficient condition for the skew-normality of the whole vector. Application of goodness-of-fit tests on the univariate distributions (Watson, Anderson-Darling, Cramervon Mises, Kolmogorov-Smirnov), as reported in Mateu-Figueras (2003), indicates that, except for the first log-ratio, the other single log-ratios cannot be statistically well described by a
Value of the estimated shape parameter, A, and the corresponding estimated skewness index, ~/, of each log-ratio
T a b l e 2.
Component ilrl ilr2 ilr3 ilr4 ilr5 ilr6
~t
~,
1.94 - 2.96 1.72 - 1.17 - 0.58 3.39
0.44 - 0.66 0.37 -0.19 - 0.04 0.72
univafiate skew-normal model (p-value < 0.01). As can be seen in Table 2, the log-ratios show a A value ranging from - 2 . 9 6 to 3.39, corresponding to skewness indices ranging from 3' = - 0 . 6 6 to 0.72; However, investigations on the univariate frequency distributions allow verification of a strong presence of multimodality, as well as of outliers (Figs 11 and 12), as expected from the univariate analysis presented in the previous section. The multivariate skew-normal model can also be validated considering the Mahalanobis distances di -= (Yi - ~ ) , ~ - i (Yi - ~), where Yi represent each
300
ilr2
ilr 1
0 /
-2
O'
0
6
floo, ] i'13 , -1
-2
-1
0
1
300 200
,oOr_Z} -2
l
-3
100
0
1
7t
I
2
-2
0
3~176 L
ilr5
ilr6
1001
-2
0
2
Fig. 11. Histograms of the investigated log-ratios.
4
0/ 0
~,"'r-~
1
2
3
,
4
FREQUENCY DISTRIBUTIONS IN GEOCHEMISTRY
187
n
I
I --V ilr 4
4-
__i_ I >
"= 9 0
_k.
I I
__L_
I
I I
I
-1 ilr 1
I A_
I
I
.J_
I
A_
4-
ilr5
I _1_
ilr 3
ilr 6
ilr 2
Fig. 12. Box-plots of the investigated log-ratios.
obse~ation of the vector Y (i = 1. . . . . 977) a n d / i and ZE are the estimates of the parameters/x and 2~ of a skew-normal model (Azzalini & Capitanio 1999). If the log-ratio vector Y has a multivariate skew-normal distribution, then the Mahalanobis distances are sampled from a ~ distribution 9 However, application of graphical tests ( P - P plot and Q - Q plot of the Mahalanobis distance versus h~5 values shown in Fig. 13) and numerical tests (based on the Anderson-Darling, Cramer-von Mises and Kolmogorov-Smimov statistics) to validate the h~5 distribution of values di allow one to conclude that the multivariate skew-normal model is not yet able to describe statistically the composition in an exhaustive way and a more complex investigation about the presence of groups and anomalous values is needed. The water composition of some wells at Vulcano island appears thus to be affected by persistence phenomena (i.e. influence of uprising gas and/or presence of secondary minerals) able to generate isolated groups of observations with homogeneous behaviour.
Conclusions
In any aquifer chemical relationships between the different species are affected by the development of acid-base and redox reactions, solutionprecipitation processes and adsorption phenomena. Which process dominates at any time depends on the mineralogy of the aquifer, the hydrogeological environment and the history of the groundwater
movement (i.e. residence time). In the investigated volcanic environment the mobilization of chemical species as the result of fumarolic activity and of the alteration of volcanic products directly affects the unconfined aquifer feeding the wells of Vulcano. Several investigations in this area indicate that the weak acidity of the circulating solutions, able to leach chemical species, is attributable to the volcanic CO2 provided at a discontinuous rate to the shallow-water body. Consequently, the mobilization depends on the CO2 input, on the rate of neutralization by rock weathering and on the intensity of rainfall acting as a diluting factor. In this general framework, if solutions circulate in volcanic products involved in syn-depositional changes, significant quantities of calcium sulphate, sodium chloride, fluorine and other trace elements can be provided to the groundwater and no natural mechanism can remove them to a significant extent. The only limiting factor is the saturation of the solution with respect to the mineral. Further occasional contributions have to be ascribed to marine-like solutions that inflow into the overlying water bodies. In this situation, investigation of the shape of the frequency distribution of single log-ratios for the Vulcano waters (properly chosen from a geochemical point of view) has been used to verify: (1) whether one general process, dominated by dilution, is able to describe the behaviour of the data; or (2) whether further processes are overlapping dilution, so that a moderate negative or positive skewness is present; or (3) whether these processes are important enough to generate a
188
A. BUCCIANTI E T AL . P - P plot
Q - Q plot
30
1
0.9
25 9
0.8
0.7 c O
0
20 Q
._
d3 'E
o
I'I"
"(3
~
15
~ .i,..., c
~
0.6
{3. O
.=>
05
E o
0.4
'13
10
Q.
~
x U.I
0.3 0.2 0.1
0
0
5'0
i 100
, 150
, 200
Mahalanobis distances
0
' 0.2
0
' 0.4
' 0.6
' 0.8
' 1
Observed cumulative probability
Fig. 13. P - P Plot and Q - Q plot for multivariate skew-normal distribution.
multimodal complex distribution. The analysis can be extended to the multivariate case, choosing an appropriate transformation. It has indicated that the multivariate distribution is described better by the skew-normal model compared with the normal one. Thus, the processes affecting the composition would tend to follow a log-skew normal law and dilution is not the only present and dominant phenomenon. However, the multivariate skewnormal model is not yet completely adequate to describe the simultaneous relationships among the variables. The presence of bimodality in some marginals, as well as of anomalous values, may be the underlying reason. Summarizing the results, in water chemistry univariate frequency distributions of log-ratios can show three fundamental features, lognormal, logskew-normal, and overlapping processes leading to bimodality. In this context, the skewnormal distribution family appears to have an important intermediate role in a better description of natural phenomena where dilution is not the
only phenomenon, but the persistence and continuity of other processes has not yet clearly generated different groups of observations. When marginal distributions of all the three previous types are considered together in a multivariate framework, the reciprocal relationships are complex. In such a case, the multivariate skew-normal model may be better than the normal one, but it is not yet adequate to describe the whole composition. Other statistical procedures are required to identify samples pertaining to different potential groups.
This research has been financially supported by Italian MIUR (Ministero dell'Istruzione, dell'Universit~ e della Ricerca Scientifica e Tecnologica), PRIN 2004, through the GEOBASI project (prot. 2004048813002) and by the Direccirn General de Ensefianza Superior (DGES) of the Spanish Ministry for Education and Culture through the project BFM200305640.
FREQUENCY DISTRIBUTIONS IN GEOCHEMISTRY
References AHRENS, L. H. 1953. A fundamental law of geochemistry. Nature, 172, 1148. AHRENS, L. H. 1954a. The lognormal distribution of the elements a fundamental law of geochemistry and its subsidiary. Geochimica et Cosmochimica Acta, 6, 49-74. AHRENS, L. H. 1954b. The lognormal distribution of the elements, ii. Geochimica et Cosmochimica Acta, 6, 121-132. AHRENS, L. H. 1957. Lognormal type distribution, iii. Geochimica et Cosmochimica Acta, 11, 205- 213. /MTCHISON, J. 1982. The statistical analysis of compositional data (with discussion). Journal of the Royal Statistical Society Series B, 44 (2), 139-177. AITCHISON, J. 1986. The Statistical Analysis of Compositional Data. Chapman & Hall, London. ALLI~GRE, C. J. & LEWIN, E. 1995. Scaling laws and geochemical distributions. Earth and Planetary Science Letters, 132, 1- 13. AUBREY, K. V. 1954. Frequency distribution of the concentrations of the elements in rocks. Nature, 174, 141 - 142. AUBREY, K. V. 1956. Frequency distributions of elements in igneous rocks. Geochimica et Cosmochimica Acta, 9, 83-90. AZZALINI, A. 1985. A class of distribution which includes the normal ones. Scandinavian Journal of Statistics, 12, 171-178. AZZALINI, A. • CAPITANIO, A. 1999. Statistical applications of the multivariate skew-normal distribution. Journal of the Royal Statistical Society, Series B, 61 (3), 579-602. AZZALINI, A. t~ DALLA VALLE, A. 1996. The mutlivariate skew-normal distribution. Biometrika, 83 (4), 715-726. BUCCIANTI, A. & PAWLOWSKY-GLAHN,V. 2005. New perspectives on water chemistry and compositional data analysis. Mathematical Geology, 37 (7), 703 -727. CAPASSO, G., FAVARA, R., FRACOFONTE, S. 8z INGUAGG1ATO, S. 1999. Chemical and isotopic variations in fumarolic discharge and thermal waters at Vulcano island Aeolian islands, Italy during 1996: evidence of resumed volcanic activity. Journal of Volcanology and Geothermal Research, 88, 167-175. CAPASSO, G., D'ALESSANDRO, W., FAVARA, R., INGUAGGIATO, S. & PARELLO, F. 2001. Interaction between the deep fluids and the shallow groundwaters on Vulcano island (Italy). Journal of Volcanology and Geothermal Research, 108, 187-198. CHAYES, F. 1954. The lognormal distribution of elements: a discussion. Geochimica et CosmochimicaActa, 6, 119-121. DAVIES, S. N. & DEWIEST, R. C. M. 1996. Hydrogeology. Wiley and Sons, New York. DI LIBERTO, V., NuccIo, P. M. & PAONITA, A. 2002. Genesis of chlorine and sulphur in fumarolic emissions at Vulcano island (Italy): assessment of ph
189
and redox conditions in the hydrothermal system. Journal of Volcanology and Geothermal Research, 116, 137-150. EGOZCUE, J. J. & PAWLOWSKY-GLAHN, V. 2005. Groups of parts and their balances in compositional data analysis. Mathematical Geology, 37 (7), 795-828. EGOZCUE, J. J., PAWLOWSKY-GLAHN, V. MATEUFIGUERAS, G. & BARCEL0-VIDAL, C. 2003. Isometric logratio transformations for compositional data analysis. Mathematical Geology, 35 (3), 279-300. KAPTEYN, J. C. 1903. Skew Frequency Curves in Biology and Statistics. Astronomical Laboratory, Groningen, Noordhoff. MARTINI, M. 1980. Geochemical survey on the phreatic waters of vulcano (Aeolian Islands, Italy) Bulletin of Volcanology, 43 (1), 265-274. MARTINI, M. 1989. The forecasting significance of chemical indicators in areas of quiescent volcanism: examples from bulcano and phlegrean fields (Italy). In: LATTER, J. H. (ed.) Volcanic Hazard, IA V-CEI Proceedings in Volcanology 1. SpringerVerlag, Berlin Heidelberg, Germany, 372-383. MARTINI, M. 1996. Chemical characters of the gaseous phase in different stages of volcanism: precursors and volcanic activity. In: SCARPA, R. & TILLING, R. I. (eds) Montoring and Mitigation of Volcanic Hazard. Springer-Verlag, Berlin Heidelberg, Germany, 200-219. MATEU-FIGUERAS, G. 2003. Models de distribuci6 sobre el sfmplex. PhD thesis, Universitat Polit~cnica de Catalunya, Barcelona, Spain. MATEU-FIGUERAS, G., PAWLOWSKY-GLAHN, V. ~: BARCELO-VtDAL, C. 2005. The additive logistic skew-normal distribution on the simplex. Stochastic Environmental Reserach and Risk Assessment (SERRA), 19 (3), 205-214. MCGRATH, S. P. & LOVELAND, P. J. 1992. The Soil Geochemical Atlas of England and Wales. Blackie Academic, London. MILLER, R. L. & GOLDBERG, E. D. 1955. The normal distribution in geochemistry. Geochimica et Cosmochimica Acta, 8, 53-62. MONTALTO, A. 1996. Signs of potential renewal of eruptive activity at La Fossa (Vulcano, Aeolian Islands). Bulletin of Volcanology, 57, 483-492. REIMANN, C. & FILZMOSER, P. 1999. Normal and lognormal data distribution in geochemistry: death of a myth. consequences for the statistical tretment of geochemical and environmental data. Environmental Geology, 39 (9), 1001-1014. VISTELIUS, A. B. 1960. The shew frequency distributions and the fundamental law of the geochemical processes. Journal of Geology, 68, 1-22. WAYNE, R. O. 1990. A physical explanation of the lognormality of pollutant concentrations. Journal of Air & Waste Management Associaties, 40 (10), 1378-1383.
Rounded zeros: some practical aspects for compositional data J. A. M A R T I N - F E R N / ~ N D E Z
& S. T H I 0 - H E N E S T R O S A
Departament Informhtica i Matem~tica Aplicada, Universitat de Girona, Campus Montilivi, Edifici P-IV, E-17071, Girona, Spain (e-mail: josepantoni.martin @udg. es) Abstract: It is very important to realize that the well-known 'zeros problem' in compositional data is inherent in the nature of the data rather than in the log-ratio methodology. In a strict sense, any null value is informative itself and needs specific treatment before a multivariate method is applied. In the rounded zero case specific techniques of missing data should be applied as a previous step, taking into account that any of these techniques must respect the compositional nature of the data. In practice, when an imputation method is applied, then it is necessary to make a sensitivity analysis of the results from multivariate analysis. These methodological aspects are applied and illustrated using compositional data from geological samples.
Martfn-Fernfindez et al. (2004a) dealt with the zeros in the database of Cenozoic volcanic rocks of Hungary (6.Kov~ics & Kovfics 2001). In that study the authors are interested in log-ratio analysis (Aitchison 1986) of subcompositional patterns in order to contribute to the understanding of petrogenetic processes (Martfn-Fernfindez et al. 2004b) that occurred in the Carpatho-Pannonian region. The dataset consists of 959 unaltered rock samples and nine major oxides from that database: [SiO2; TiO2; A1203; FezO3total; MgO; CaO; Na20; K20; P205]. Since some of these observations have null values the authors conclude that they have the well-known 'zeros problem'. Nobody disagrees with this assertion. Nevertheless, one should think about the reasons for this 'logical' conclusion. Obviously, the first temptation is to reason as follows: Since multivariate statistical methods based on log-ratio methodology applied, then one cannot work with null values and thus zeros are a problem. Therefore it is preferable to apply classical multivariate methods based on Euclidean distance because this methodologyhas not the zeros problem. Certainly, it is obvious that if some sample has null values neither ratios nor logarithms can be formed. Nevertheless, that kind of reasoning is clearly very simple and incomplete because it does not take into account the nature of the null value. The first question that one must answer in relation to a null value is about its nature - one must decide if the zero value is a true value or not. If it is considered as a true value then it is informative by itself. Therefore, this null value means the absolute absence of the part in the observation, i.e. the null value is an essential or structural zero. On the other hand, if the null value indicates the presence of a component, but below the detection limit, then this zero represents a missing small
value, i.e. null values are rounded zeros. Since the nature of the two kinds of zeros is different the treatment should be different. Note that this different treatment is a consequence of the nature of the zero rather than of the statistical methodology (Euclidean, log-ratio . . . . ). Two kinds of zeros, two different treatments
In the structural zeros case, two initial questions must be answered: (1) is the n u m b e r of parts too large for the goals of the study?; (2) is the presence in a part of an essential zero an indication that the composition belongs to a different group or population? An affirmative answer to the first question is related to the sampling or measuring step and suggests amalgamating some related parts (Aitchison 1986, p. 36). This amalgamation procedure reduces the dimensionality and probably the amount of null values. The answer to the second question is related to the true information of the null value. For example, an affirmative answer could indicate that it is sensible to divide the sample, and then a statistical analysis of any kind would be applied to each sub-sample separately. After both questions have been solved, and after data have been closed, the statistical analysis can be applied. But, now the question is: which one? Nowadays, in the context of compositional studies, most scientists apply either Euclidean or log-ratio statistical methods. Obviously, when a scientist chooses the Euclidean option then there are no problems with the remaining structural zeros. Nevertheless note that when a scientist selects this option one is assuming, for example, that the difference or similarity between two threecompositions - in percentages - as (0, 10, 90) and
From: BUCCIANTI,A., MATEU-FIGUERAS,G. & PAWLOWSKY-GLAHN,V. (eds) Compositional Data Analysis in the Geosciences: From Theory to Practice. Geological Society, London, Special Publications, 264, 191-201. 0305-8719/06/$15.00
9 The Geological Society of London 2006.
192
J.A. MART[N-FERN/~qDEZ & S. THI0-HENESTROSA
(10, 0, 90) is exactly the same as the difference between (30, 40, 30) and (40, 30, 30): de((0, 10, 90),(10, 0, 90)) 1
-----r
10) 2 + (10-- 0) 2 + ( 9 0 - 90) 2
= 104~,
de((30, 40, 30),(40, 30, 30)) /
= ~/(30 - 40) 2 + (40 - 30) 2 -k- (30 - 30) 2 = 104~,
(1)
where de means the Euclidean distance. This assumption seems to be unsuitable when, for example, one is working with geological data such as percentages of {clay, silt, sand} in sediments. This weakness is serious because all the classical multivariate methods are based on Euclidean concepts. Certainly, all these multivariate methods contain in their formulation the classical variance/covariance matrix and, in a strict sense, this matrix is a Euclidean measure of the variability. On the other hand, selection of the log-ratio methodology involves a weakness with the null values. Fortunately, recent works (Aitchison & Kay 2003; Bacon-Shone 2003) offer new strategies for the application of the log-ratio methodology combined with conditional models and subcompositional analysis. When the null values denote that no quantifiable proportion could be recorded to the accuracy of the measurement process, then this kind of zero is usually understood as 'a trace too small to measure'. Note that in reality these null values are missing values rather than zeros (Martfn-Fern~indez et al. 2003a). Consequently, it seems reasonable to apply specific techniques for a statistical analysis of incomplete multivariate data (Little & Rubin 2002). By analogy to the structural zeros case, it is very important to emphasize that the decision to apply specific techniques for missing values to deal with the rounded zeros is independent of the statistical methodology (Euclidean, log-ratio . . . . ) that one has selected. In other words, even when one scientist selects the Euclidean option the 'rounded zeros problem' appears because the missing values should be treated. Moreover, as in the structural zeros case, in the Euclidean option the scientist is assuming, for example, that the difference between two three-compositions as (0.1, 10, 89.9) and (10, 0.1, 89.9) is exactly the same as the difference between (30.1, 40, 29.9) and (40, 30.1, 29.9). Here, for example, the value 0.1 comes from some imputation procedure which has replaced the null value by 0.1. The calculations (1) can be repeated and, in this case, for both pairs, the Euclidean distance is equal to 9.9~/2. Observe that in
both cases the third part does not contribute to the Euclidean distance between the compositions. For example, if one is working with geological data as percentages of {clay, silt, sand} in sediments, the information that in one case the sample has 89.9% of sand and in the other case it has the 29.9% is missing in the calculations. This unsuitable effect is clearer if the subcompositons in terms of clay and silt are made. The two first three-compositions have respectively the {clay, silt} subcompositions (0.99, 99.01) and (99.01, 0.99). The other pair respectively transforms in (42.94, 57.06) and (57.06, 42.94). The difference between the first pair seems to be more important than the difference between the second pair. In Figure 1 the difference between the first pair of compositions with the difference between the second pair can be appreciated. It seems to be reasonable to accept that the difference between the first pair must be greater than the difference between the second pair. In this sense, from the log-ratio methodology, the Aitchison distances between the two pairs is da((0.1, 10, 89.9), (10, 0.1, 89.9)) = de(clr(0.1, 10, 89.9), clr(10, 0.1, 89.9)) 6.5127, da((30.1, 40, 29.9),(40, 30.1, 29.9)) = de(clr(30.1, 40, 29.9), clr(40, 30.1, 29.9)) 0.4021
(2)
Aitchison distance works better than the Euclidean distance because it shows more similarity between the second pair of compositions. For more details about the definition and properties of this distance see Egozcue & Pawlowsky-Glahn (2006). The well-known 'zeros problem' in compositional data is inherent in the nature of the data. In any case - structural or rounded - one null value is informative and calls for a specific treatment before some multivariate method is applied. This paper focuses on the rounded zeros case since it is the most frequent case in geological studies. T r e a t m e n t of rounded zeros The assumption that a rounded zero is a small missing value indicates that some specific technique to deal with these missing values must be applied. After the missing data procedure is applied one can then apply such log-ratio method as desired. Many missing data techniques have been suggested in the literature (Little & Rubin 2002); these can be classified into parametric and non-parametric techniques. Among parametric techniques to treat the inference problem in the presence of missing data in real space several methods have been developed,
ROUNDED ZEROS: SOME PRACTICAL ASPECTS
193
=~I 9149
(0.1, 10, 89.9)
=:::::::::::::Dl~lm 9149 Dd~ 9149149149 ~l~aam~ 9149149149149 -----idTdddoJ 9149149149149149149 --I~ 9149149
~
m||mid|||ai|amimiiiiiiiiiiiiiiiiiimiiiina 9
Jlllga 9149149149149 ~%%~laaaalaaaaawgD 9149149149149149 ~%%~JaJllaaJaJBiiJliiliinllii 9 ~%~llamHHmaiHHemlalSmaalilli 9149 ~saamaiaesammummaammim 9149149 ~am 9149149149149 ~%~ 9149149149149149149 ~aaaaaaaaomauusea 9 ~%%~laa anaaaoaummulllaannlnmnnnnllnniununnnnnunnnna ~%%~maaaaaaaalaaau 9149 ~%~mmunanaasmmummmmummmamammmmmmmimmmmmmmmmnnummnmn
Z
g
I,-, i
(10, 0.1,89.9)
r O
o. (30.1, 40, 29.9)
0
0
~
~ i l u i i n i l i i i l i n n a l /inniliiiiiinlll! .... milninlilililnlal ~ l m n u i n n l a n i i n i i n
(40, 30.1,29.9)
~ ' ~ ~ ~ ~ m l n W l l l l l l l l l i l i ~%%%%%%%%%%%%%%%%%%~~mmmmmmmmmmmmmmsm ~%%%%%%%%%%%%%%%%%%~~ammmmmime 9 ~ ~ ~ ~ : mm 9149149
Clay 0
I
20
I
40
I
60
Percentages
1
80
t
1O0
[ ] Silt Sand
Fig. 1. Four three-compositions of {clay, silt, sand}.
but the EM algorithm, and its extensions, and the multiple imputation method are the most used approaches. These techniques rely on fully parametric models for multivariate data, usually the normal distribution, and contain in their formulation the variance/covariance matrix. In view of the above arguments unreasonable results are likely to be obtained when classical methodology is applied to compositional data. Consequently, it seems reasonable to combine the parametric techniques for missing data with the log-ratio methodology. Buccianti & Rosso (1999) following the method proposed in Sandford et al. (1993) made the first approach to such a combination from an empirical point of view, the performances of the EM algorithm and the log-ratio methodology. In Martfn-Fern~indez et al. (2003b) the authors showed a first approach to combine the log-ratio methodology with the multiple imputation techniques via Markov Chain Monte Carlo (MCMC) simulation algorithms. These research lines are unfinished and require more effort. On the other hand, the group of non-parametric techniques for missing data in real space (Little &
Rubin 2002) consists essentially of a family of imputation strategies: cold deck imputation, composite methods, hot deck imputation and mean imputation. The expression imputation is equivalent to a replacement strategy which completes in some way the incomplete dataset by inserting a quantity for each missing value. Sandford et al. (1993) indicated that if the missing values are reported as 'less than' a given threshold value for some variables, a replacement can be considered. After this, from the completed dataset, any multivariate method can be applied. Nevertheless, all the papers and books related to imputation techniques recommend that one should be careful in using a replacement strategy because the general structure of the data could be seriously distorted. In particular, the covariance structure and the metric properties of the dataset should be preserved in order to avoid further analysis on sub-populations being misleading. Note that the previous sentence provides the clue to replacement techniques for rounded zeros in compositional data. The specific nature of compositional data forces a decision in advance as to which kind of covariance structure and metric
194
J.A. MARTiN-FERN,~dqDEZ & S. THIO-HENESTROSA
properties one wishes to preserve. According to the existing possibilities, there is a decision between the preservation of either the classical - Euclidean covariance and metric or the covariance and the metrics induced by the log-ratio methodology. On one side, the above examples (1) and (2) show that the Euclidean distance could not produce reasonable results. On the other side, compositional data are formed by continuous variables whose scale of measurement is a ratio scale (Jobson 1992, p. 7) and their main operations are perturbation and subcomposition. Consequently, at least for compositional data, the replacement strategies must be coherent with all these basic aspects. Following this point of view, Martfn-FemS_ndez et al. (2003a) analysed one multiplicative replacement and, following Sandford et al. (1993), suggested that when the proportion of these null values is not large (less than 10% of the values in data matrix) a simple-replacement method which uses an imputation value equal to 65% of the threshold value can be used.
Three non-parametric methods of imputation for rounded zeros R e p l a c e m e n t s additive, simple a n d multiplicative
Suppose that a D-composition x = (xb x2. . . . . xt)) contains Z rounded zeros and a scientist wants to replace x by a new composition r = (rl, r2 . . . . . rr)) without zeros according to the above arguments. In the literature the scientist finds three different formulae: 3j(Z + 1)(D - Z) D2 q =
xj--~---Z+l() E
ifxj =0,
(3) 3k
klxk=0
ifxj > 0 ;
c q =
c + ~klx~=O 3k 3j C C + Eklxk=-O ~k Xj
if Xj = O,
(1
Table 1, Descriptive measures for each component
(4)
if Xj > O;
8j rj =
~klx~-=0~k') Xj
of the dataset
Number Minimum Zeros obs. without observed zeros value (%) Number Percent Components
ifxj = 0 , if Xj > 0;
non-zero values. In (3) this modification is additive (Aitchison 1986). This formula (3) is referred to as the additive replacement. The term simple replacement is assigned to the formula (4) because its procedure is very simple. This strategy consists of replacing each rounded zero in the composition by appropriate ~j; and, after this, closing the vector. In (5) the modification of the non-zero values is multiplicative. This kind of modification was suggested independently by Fry et al. (2000) and Martin-Fermindez et al. (2000). In MartinFernfindez et al. (2003a) multiplicative replacement (5) was analysed in depth. There, from a theoretical point of view, the authors compared its properties in relation to the properties of the additive and simple replacements. This work focuses on showing from a practical point of view, how, why and when the multiplicative replacement may provide better results that the others. For this practical goal the dataset from the database of Cenozoic volcanic rocks of Hungary referred to above is used. In Tables 1 and 2 the pattern and the location of null rounded zeros in the dataset is summarized. Observe that three components have no zeros: SIO2, A1203 and K20. From the rest, 101 compositions have at least one null value and the zeros are concentrated mainly in components P205, TiO2 and MgO. Note (Table 1) that the minimum measured value in these components is 0.01%. Certainly, the number of null values is reduced (1.8%) because out of the 959 x 9 values in the data matrix, 153 are zeros. Therefore, it seems reasonable to consider a simple-substitution strategy (MartinFern~indez et al. 2003a). In advance of the application of some of the three formulae of replacement, the value of the threshold of each component must be selected. For these purposes four samples (Table 3) from the dataset were selected. These samples represent different possibilities in relation to the number of null values. For these samples a running number
(5)
SiO2 TiO2
A1203
where 8j is a small value, less than the given threshold of part xj, and c is the constant of the sum-constraint; for example, c = 100 where data are percentages. This constraint forces the replacement formulae to contain a modification of the
Fe203_tot MgO CaO Na2O K20 P205
959 914 959 958 935 957 956 959 881
42.30 0.01 8.21 0.26 0.01 0.07 0.10 0.62 0.01
0 45 0 1 24 2 3 0 78
0.0 4.7 0.0 0.1 2.5 0.2 0.3 0.0 8.1
195
ROUNDED ZEROS: SOME PRACTICAL ASPECTS Table 2. The pattern of null values in the dataset Pattern of zeros Number observations
SiO2
TiO2
A1203
FezO3_tot
MgO
CaO
NaO2
K20
P205
858 39 3 12 3 8 25 5 3
Observations without zeros a 858 897 912 870 881 866 930 953 933 931 898 914
0 0
1 1 1
0 0 0 0 0 0
0 0
0
Note: '0' symbolizesthat the componentcontains null value. aNumber of observations without zeros if the correspondingvariables with zero value is not considered.
has been included - first column - in order to reference them in calculations and examples. The sample numbered as 1" in the last row is an artificial sample obtained from sample number 1 by forcing parts TiO2, MgO and P205 to take the value zero, and then closing this constructed sample to obtain again a sum equal to 100. The reason for recording all these values using four decimal digits is to illustrate the effect of each replacement formula more clearly.
Restoration o f the 'true' composition: natural replacement It seems logical to expect that a replacement strategy will restore the 'true' sample when the null values are replaced by the 'true' values. Consider the sample numbered as 1" in Table 3. Suppose that a scientist decides that the appropriate ~j for parts TiO2, MgO and P205 respectively are equal to 0.14, 0.13 and 0.03, which are the true values of sample number 1 (Table 3). These 6j values are used in the formulae (3), (4) and (5). Table 4 shows the result produced by the application of additive, simple and multiplicative replacement to
this composition. Observe that the multiplicative replacement is the only one which restores exactly the true composition - number 1 in Table 3 making this direct imputation. It is clear from the structure of formulae (3) and (4) that they are also capable of restoring the true sample. It is necessary to analyse the relationship only between the specific value ~j and the final imputed value. For example, consider the situation for the part TiO2 in the additive replacement case. If one wishes the replaced composition r to restore the true value 0.140 then the value of ~2 must satisfy the following relationship: 82(3 + 1)(9 - 3) = 0.140. 92
(6)
From this relationship it is easy to calculate that ~2 = 0.4724. Making this calculation for parts MgO and P205, the values ~5 ---- 0.4386 and ~9 = 0.1012 are obtained. In this way the final imputed values by the additive replacement (3) are true. However, if the modification of the nonnull values is calculated, the corresponding true values are not produced. This distortion happens
Table 3. Arbitrary selected samples from the dataset No.
SiO2
TiO2
A1203
Fe203_tot
MgO
CaO
Na20
K20
P205
1 2 3 4 1"
75.0775 76.2700 75.6819 76.3489 75.3033
0.1400 0.0000 0.0000 0.0000 0.0000
14.3957 14.1100 13.3080 13.6091 14.4390
1.6495 1.1600 2.5377 1.7486 1.6545
0.1300 0.0300 0.0699 0.0000 0.0000
0.9397 0.9800 1.1290 1.1791 0.9425
2.9091 3.3100 3.1671 3.4173 2.9179
4.7286 4.1200 4.1063 3.6970 4.7428
0.0300 0.0200 0.0000 0.0000 0.0000
196
J.A. MARTIN-FERNANDEZ & S. THI0-HENESTROSA
Table 4. Result from the application of replacement to the composition numbered as 1" in Table 3
Replacement
SiO2
TiO2
None (1") Additive Simple Multiplicative
75.3033 75.2885 75.0782 75.0775
0.0000 0.0415 0.1395 0.1400
A 1 2 0 3 Fe203_tot MgO 14.4390 14.4242 14.3958 14.3957
1.6545 1.6397 1.6495 1.6495
because the non-null values in (3) are modified in an additive way using the arithmetic mean of the imputed values. For example, in the additive replacement (3) the SiO 2 part is forced to take the value
0.0000 0.0385 0.1296 0.1300
CaO
Na20
K20
P205
0.9425 0.9277 0.9397 0.9397
2.9179 2.9031 2.9092 2.9091
4.7428 4.7280 4.7286 4.7286
0.0000 0.0089 0.0299 0.0300
restoration of the true values in an easier and faster way.
Relationship between final imputed value and threshold: clear replacement
3+1 rl = 75.3033 - - 92
(7)
(0.4724 + 0.4386 § 0.1012) -- 75.2533, which is different from the true value 75.0775 in sample number 1. Following the same example, in the simple replacement (4) for the part TiO2 the relationship considered is 100 1 -q- ~2 + g5 § ~9
~2 -~- 0.140.
(8)
It is obvious that in order to know the value of ~2 it is necessary to simultaneously solve this relationship for the part TiO2 and the parts MgO and P205. At the end, when these gj values are used in (4) all true values, zeros and non-zeros of the sample number 1 can be restored. In relation to the possibility of restoring the true values in one sample it is concluded that the multiplicative replacement (5) is more natural than the additive and simple replacement because it allows
Dealing with rounded zeros essentially is the same as dealing with NMAR missing values, where NMAR (Little & Rubin 2002) means Not Missing At Random. In the NMAR missing values one considers that the probability that a component is missing may depend on the unobserved component of the data. That is, the mechanism of 'missingness' is nonignorable. Essentially that is the case of rounded zeros in compositional data because the 'missingness' is strictly related to the nature of the variable rather than the sample itself. Consequently, in nonparametric imputation methods, it seems logical to expect that the final imputed value in a specific part depends on the nature of the part rather than the other values in the sample. For ease of readability ~j is considered to take the value 0.1% for all those parts which have null values, but it is important to remark that it is recommended (Mart/n-Fern~indez et al. 2003a) to use ~j equal to 65% of the threshold value. Using those values (~j = 0.1%), the replacement formulae (3), (4) and (5) for the samples numbered 2 to 4, and 1" (Table 3) are applied. Table 5 shows the replaced compositions for each
Table 5. Results from the application of replacement to the samples numbered 2, 3, 4 and 1" in Table 3
Replacement/No. Additive 2 3 4 1" Simple 2 3 4 1" Multiplicative 2 3 4 1"
SiO2
TiO2
A1203
Fe203_tot
MgO
CaO
Na20
K20
P205
76.2675 75.6745 76.3341 75.2885
0.0198 0.0259 0.0296 0.0296
14.1075 13.3006 13.5943 14.4242
1.1575 2.5303 1.7338 1.6397
0.0275 0.0625 0.0296 0.0296
0.9775 1.1216 1.1642 0.9277
3.3075 3.1597 3.4025 2.9031
4.1175 4.0989 3.6822 4.7280
0.0175 0.0259 0.0296 0.0296
76.1938 75.5308 76.1206 75.0781
0.0999 0.0998 0.0997 0.0997
14.0959 13.2815 13.5684 14.3958
1.1588 2.5327 1.7434 1.6495
0.0300 0.0698 0.0997 0.0997
0.9790 1.1267 1.1755 0.9397
3.3067 3.1608 3.4070 2.9092
4.1159 4.0981 3.6860 4.7286
0.0200 0.0998 0.0997 0.0997
76.1937 75.5305 76.1199 75.0774
0.1000 0.1000 0.1000 0.1000
14.0959 13.2814 13.5683 14.3957
1.1588 2.5326 1.7434 1.6495
0.0300 0.0698 0.1000 0.1000
0.9790 1.1267 1.1755 0.9397
3.3067 3.1608 3.4070 2.9091
4.1159 4.0981 3.6860 4.7286
0.0200 0.1000 0.1000 0.1000
ROUNDED ZEROS: SOME PRACTICAL ASPECTS
197
Table 6. Some ratios from the application of replacement to the samples numbered 2, 3, 4 and 1" in Table 3 Replacement ~j = 0.1% Ratio (in %) A1203/SIO2 of sample 4 Between of 1" and 4 in part Na20 Between of 3 and 2 in part Na20 Between of 4 and 2 in part Na20
Initial
Additive
Simple
Multiplicative
17.8249 85.3863 95.6843 103.2407
17.8089 85.3227 95.5318 102.8698
17.8249 85.3863 95.5888 103.0348
17.8249 85.3863 95.5885 103.0340
replacement. Let attention be concentrated on part TiO2. For the additive (3) and the simple (4) replacements the final imputed value is different for the three observations because the final imputed value depends on the number of null values in the sample. It is clear that for the additive replacement the final imputed value increases when the number of zeros increases; and for the simple replacement the effect is the contrary. These effects seem not to be reasonable since for the same part, i.e. the same threshold, the final imputed value in the part depends on the presence or absence of null values on others parts of the sample. In contrast, the multiplicative replacement imputes exactly the same value in all the null values of the part. This effect is reasonable, taking into account the nature of the rounded zeros. In addition, this replacement introduces artificial correlation between parts which have null values in the same samples. This effect is unsuitable and can distort the results of posterior multivariate analysis when the number of null values in the dataset is large, more than 10% (Sandford et al. 1993). Nevertheless, this artificial correlation is an inherent effect of the non-parametric methods of imputation rather than an effect of the specific formula (5). Note that this effect is present in this kind of technique for data in real space and it is well known for the missing data approach (Little & Rubin 2002). When the number of zeros in the dataset is quite large, parametric methods of imputation are recommended. These methods incorporate the information included in the covariance structure and impute values taking into account this information. As a consequence, the imputed value would be different for each sample and the specific mechanism of imputation must be the formula of multiplication replacement (5) since the procedure of this multiplicative replacement is clearer than the additive and simple formulae.
of the data is analysed in this section. It is important to analyse the ratios between two parts in one sample and, also, it is important to analyse the ratios between two samples in one specific part. Naturally, as the log-ratio covariance is based on the ratios, those replacements that preserve the ratios will be more reliable in preserving the covariance structure. In Table 6 the ratio between the parts A120 3 and SiO 2 for sample number 4 is shown. Initially, this ratio is equal to 17.8249. After each replacement is applied (~j = 0.1%), simple and multiplicative replacements preserve the ratio and the additive replacement distorts it. Samples 4 and 1" have the same number (3) of (null) values, located in the same parts. For the part Na20, initially the ratio between the sample 1" and 4 is equal to 85.3863. After the replacements are applied the additive method distorts the ratio and the simple and multiplicative methods preserve it. This preservation of ratios by simple and multiplicative replacement is a consequence of the multiplicative modification of the non-null values in formulae (4) and (5). In other words, the additive modification of the non-zero values in formula (3) has the effect of the distortion of ratios. The ratios between samples 3, 4 and 2 are distorted by all the replacements since these samples have different number of null values. Nevertheless, note that in these cases the behaviour of the simple and the multiplicative replacements is extremely similar. In Martin-FernAndez et al. (2003a) an extended analysis of theoretical properties of these replacements is presented. There the authors review in depth the properties of these replacements in relation to the basic operations: subcomposition, perturbation and power transformation; and in relation to basic elements of log-ratio methodology: Aitchison distance, compositional geometric mean, variance matrix and total variance.
Avoiding the distortion o f covariance structure: preservation o f ratios
Sensitivity analysis: the obligatory step in non-parametric replacements
In order to evaluate the distortion of the covariance structure the effect of each replacement on the ratios
In any non-parametric replacement strategy the imputed value is selected in advance. After the
198
J.A. MARTIN-FERN/~IDEZ & S, THIO-HENESTROSA decreasingly small imputed values, spurious clusters are obtained when these imputed values tend to infinity or minus infinity. For the dataset of Cenozoic volcanic rocks, a detailed revision of the data and geological knowledge (6.Kov~ics & Kov~ics 2001) of the sampling process, suggest a common threshold for all parts equal to 0.01%, and hence the maximum rounding-off error 8n~o = 0.005% is considered. Martfn-Fern~indez et al. (2003a) suggested 65% of the threshold - ~j = 0.0065% - as a suitable imputed value. Using these values the multiplicative replacement (5) was applied. In MartfnFernfindez et al. (2004b) the authors were interested in linear discriminant analysis using log-ratio methodology on the dataset from the database of Cenozoic volcanic rocks of Hungary referred to above. In this dataset two groups of samples exist: alkaline basalts and the calc-alkaline series. The main goal of this study was to contribute to the understanding of petrogenetic processes that occurred in the Carpatho-Pannonian region. After a first descriptive step based on the compositional biplot (Aitchison & Greenacre 2002) of the replaced dataset (Fig. 2), it was noted that the rays of clr(SiO2), clr(K20), clr(Na20) and clr(Al203) are the closest rays to the first axis of the biplot, the direction along which the projections of the alkaline basalts and the calc-alkaline series are best separated.
replacement is made some multivariate method will be applied and some results expressed on some indices will be obtained. For example, after PCA is applied one has the proportion of explained variance; in cluster analysis the number of groups is obtained; in discriminant analysis the linear discriminant function (LDF) misclassfication rate is calculated; in linear regression the R 2 index is made. There are numerous examples of results from a multivariate analysis. The question that naturally arises is how robust are the results in relation to the imputed values in the replacement method. Therefore, a sensitivity analysis of the results in relation to these values must be made. It is very important to emphasize that this step it is not inherent in log-ratio methodology. This kind of analysis is obligatory for any kind of dataset and any methodology applied. In Tauber (1999) a descriptive example is presented in order to illustrate the strong influence of selection on the 6j value in cluster analysis studies using log-ratio methodology for compositional data. There the main argument was that when the imputed value tends to zero spurious clusters appears. Note that this effect is logical and not inherent in the logratio methodology since in the Euclidean context exactly the same effect would appear. If there are missing values in some Euclidean dataset and these are replaced by increasingly large or
i
' i I
clr(MgO) 9
~
.
~
clr(CaO)" .o
0 ~../
~ Oo OO
~ ,o , -
I axis1 -
0 0
0
............
T - "~-'"
............
-
~
OtOotal" 0
r
~
clI~A[2O3)
~
clr(SiO2)
- - -o. . . . . . ~
~
~
clr(K20 ,
o
/'%
~~
/ clr(Ti~21.,_._, " ' -'
/
,,/
o
.
| o o Oo O ~o 0% o o o,o o o oo ~ oo o oo o f3 o o oo
clr(P205 )
o o oO
axis2
o
o
o
Fig. 2. Biplot in the clr-transformed space. Clr components and samples: circles represent calc-alkaline series; dots, alkaline basalts.
ROUNDED ZEROS: SOME PRACTICAL ASPECTS
MgOc
(Na20+K20) c
Fig. 3. Centred [SiO2; MgO; Na20 + K20] subcomposition. Circles represent calc-alkaline series; dots, alkaline basalts. Superindex c indicates centred parts.
The good separation of the alkaline basalts and the calc-alkaline series in the biplot is also numerically confirmed by a linear discriminant analysis of the two groups applied to the clr-transformed dataset: only 3.96% of the observations are incorrectly classified (misclassification rate) by the LDF. Subcompositional linear patterns and simultaneously reasonable separation of the two groups are obtained with ternary diagrams including SiO2 and K20, or SiO2 and Na20, and a third component, e.g. MgO whose vertex lies further apart in the
199
biplot (Fig. 2). Further, in order to allow comparison of the log-ratio analysis with the results from traditional methods (Martfn-Fern~indez et al. 2004b), the amalgamated subcomposition [SiO2; MgO; Na20 + K20] is considered. Clearly, the separation of the two groups is better visually (Fig. 3; LDF spell out misclassification rate miscl, rate: 3.96%) with the amalgamated subcomposition. Note that the data have been centred (Martfia-Femfindez et al. 1999). For more details of the centring operation see Pawlowsky-Glahn & Egozcue (2006). A sensitivity analysis must now be performed. In Aitchison (1986), for a sensitivity analysis the range gmro/5 < g < 2gmro, where gr~o is the maximum rounding-off error, is suggested as reasonable. The imputed value (Martin-Fernfindez et al. 2003a) is 65% of the threshold and so the range seems to be appropriate. Figure 4 shows the pattern of the variation of the LDF misclassification rate when the value gj = 0.000065 simultaneously varies for all parts between 0.00001 < gj < 0.0001. Observe that for values around the imputed value the LDF rate is reasonably stable; and, when g tends to zero, the LDF rate increases showing that the two groups become more mixed. The reason for this behaviour lies in the null values in the part MgO of the samples which belong to the calc-alkaline series group. These values are responsible for the calc-alkaline group increasing its variability.
5.6 5.4 [ 5.2
~
5
L,. E
.o
4.8
o
4.6
"5 ._
E 4.4 I.L
D .J
4.2 3.96
381
; 6.5 0.00001 < 5 < 0.0001
Fig. 4. Sensitivity analysis: variation of the LDF misclassification rate.
lO x 10-5
200
J.A. MARTiN-FERNaNDEZ & S. THI0-HENESTROSA
Therefore, those calk-alkaline samples which are close to the alkaline basalts group with high MgO, are misclassified. This sensitivity analysis could be more sophisticated in the sense that different combinations of the variation of the gj among the parts could be made. For example, one could fix the value gj in some parts and vary those values in other parts in order to detect the contribution of each combination of parts to the sensitivity. Other possible combinations are to decrease the value 8j for some parts and simultaneously increase the gj values for other parts. Naturally, the results produced are different in each practical study performed. From the authors' experience the most interesting and interpretable results are obtained when a global sensitivity analysis is performed in the way that has produced the LDF misclassification rate in Figure 4.
Concluding remarks The well-known 'problem of zeros' is inherent in the nature of compositional data rather then the log-ratio methodology. In particular, rounded zeros should be considered as small missing values. In datasets where null values are less than 10% of data, three different formulae of non-parametric replacement can be applied. The multiplicative replacement appears to be the easier, faster, more coherent and natural formula for substitution of the rounded zeros by appropriate small values. Whatever the replacement method employed, a sensitivity analysis of the results is obligatory in order to analyse the variability of the results in relation to the variation of the imputed value. For datasets with a large number of rounded zeros, parametric methods of missing data should be applied. These methods are not developed here but it seems reasonable to imagine that these methods will consist of a combination of either EM algorithm or MCMC methods with the appropriate log-ratio transformation: additive log-ratio (air), isometric log-ratio (ilr) or centred log-ratio (clr) transformations. Future research will focus on this strategy. This work has received financial support from the Direcci6n General de Investigaci6n of the Spanish Ministry for Science and Technology through the project BFM2003-05640/MATE. The data set from the database of Cenozoic volcanic rocks of Hungary has been kindly provided by Drs L. 6. Kovfics and G. P. Kovfics from the Hungarian Geological Survey.
References AITCHISON, J. 1986. The statistical analysis of compositional data. Chapman & Hall, London. Reprinted (2003) by The Blackburn Press, Caldwell, NJ.
AITCHISON, J. & GREENACRE, M. 2002. Biplots of compositional data. Applied Statistics, 51, 375-392. AITCHISON, J. & KAY, J. W. 2003. Possible solutions of some essential zero problems in compositional data analysis. In: THI0-HENESTROSA, S. & MARTINFERN,~NDEZ, J. A. (eds) Proceedings of CODAWORK'03, The First Compositional Data Analysis Workshop, October 15-17, University of Girona (Spain). CD-ROM (World Wide Web: http://ima.udg.es/Activitats/CoDaWork03/index. html#session2). BACON-SHONE, J. 2003. Modelling structural zeros in compositional data. In: THIO-HENESTROSA, S. & MARTiN-FERN.~NDEZ, J. A. (eds) Proceedings of CODA WORK'03, The First Compositional Data Analysis Workshop, October 15-]7, University of Girona (Spain). CD-ROM (World Wide Web: http: //ima.udg.es/Activitats/CoDaWork03/index. html#session2). BUCCIANTI, A. & ROSSO, F. 1999. A new approach to the statistical analysis of compositional (closed) data with observations below the 'detection limit'. Geoinformatica, 3, 17-31. EGOZCUE, J. J. & PAWLOWSKY-GLAHN,V. 2006. Simplicial geometry for compositional data. In: BUCCIANTI, A., MATEU-FIGUERAS, G., & PAWLOWSKY-GLAHN, V. (eds) Compositional Data Analysis in the Geosciences: From Theory to Practice. Geological Society, London, Special Publications, 264, 1- 10. FRY, J. M., FRY, T. R. L. & MCLAREN, K. R. 2000. Compositional data analysis and zeros in micro data. Applied Economics, 32, 953-959. LITTLE, R. J. A. & RUBIN, D. B. 2002. Statistical Analysis with Missing Data (2nd edn). John Wiley and Sons, New York. JOBSON, J. D. 1992. Applied multivariate data analysis, Vol IL Categorical and multivariate data analysis. Springer texts in statistics, Springer, New York. MARTiN-FERNANDEZ, J. A., BREN, M., BARCELOVIDAL, C. & PAWLOWSKY-GLAHN, V. 1999. A measure of difference for compositional data based on measures of divergence. In: Proceedings of IAMG'99. Trondheim, Norway, 1, 211-216. MARTiN-FERN,~NDEZ, J. A., BARCEL0-VIDAL, C. & PAWLOWSKY-GLAHN, V. 2000. Zero replacement in compositional data sets. In: IQERS, H. A. L., RASSON, J.-P., GROENEN, P. J. F. & SCHADER, M. (eds) Proceedings of the 7th Conference of the International Federation of Classification Societies, University of Namur (Belgium). Springer-Verlag, Berlin, Germany, 155-160. MARTiN-FERNANDEZ, J. A., BARCELO-VIDAL, C. & PAWLOWSKY-GLAHN, V. 2003a. Dealing with zeros and missing values in compositional data sets. Mathematical Geology, 35 (3), 253-278. 1VIARTiN-FERN,~NDEZ,J. A., PALAREA=ALBALADEJO,J. & GOMEZ-GARC{A,J. 2003b. Markov Chain Monte Carlo Method Applied to Rounding Zeros of Compositional Data: First Approach. In: THI0-HENESTROSA, S. & MART[N-FERN.~NDEZ, J. A. (eds)
ROUNDED ZEROS: SOME PRACTICAL ASPECTS
Proceedings of CODA WORK'03, The First Compositional Data Analysis Workshop, October 1517, Univeristy of Girona (Spain). CD-ROM (World Wide Web: http://ima.udg.es/Activitats/ CoDaWork03/index.html#session2). MARTiN-FERNANDEZ, J. A., O.KovAcs, L., KOVACS, G. P. & PAWLOWSKY-GLAHN,V. 2004a. The treatment of zeros in compositional data analysis: the database of cenozoic volcanites of Hungary. 32nd International Geological Conference, Florence (I), Abstracts Volume, part 1, abstract 41-12, p. 213. MARTiN-FERNANDEZ, J. A., PAWLOWSKY-GLAHN,V., O.KovAcs, L. & KOVACS, G. P. 2004b. Subcompositonal exploration in the database of cenozoic volcanites of Hungary. 32nd International Geological Conference, Florence (I), Abstracts Volume, part 1, abstract 41-16, p. 214.
201
O.KovAcs, L. & KovAcs, G. P. 2001. Petrochemical database of the Cenozoic volcanites in Hungary: structure and statistics. Acta Geologica Hungarica, 44 (4), 381-417. PAWLOWSKY-GLAHN, V. & EGOZCUE, J. J. 2006. Compositional data and their analysis: an introduction. In: BUCCIANTI, A., MATEU-FIGUERAS, G., & PAWLOWSKV-GLAHN, V. (eds) Compositional Data Analysis in the Geosciences: From Theory to Practice. Geological Society, London, Special Publications, 264, 1- 10. SANDFORD, R. F., PIERSON, C. T. & CROVELLI, R. A. 1993. An objective replacement method for censored geochemical data. Mathematical Geology, 25 (1), 59-80. TAUBER, F. 1999. Spurious clusters in granulometric data caused by logratio transformation. Mathematical Geology, 31 (5), 491-504.
Is the simplex open or closed? (some topological concepts) E. B A R R A B I ~ S & G. M A T E U - F I G U E R A S
Department Informgttica i Matemhtica Aplicada, Universitat de Girona, Campus Montilivi, P4, E-17071 Girona, Spain (e-mail: [email protected]) Abstract: The simplex is the natural space to work with when compositional data are considered. Sometimes, the concepts of open simplex and closed simplex are used, although most of the time they are not well defined. The objective of this contribution is to expose some of the mathematical concepts related to the simplex and its structure in order to make clear when the terms open and closed are mathematically appropriate. Moreover, these concepts sometimes generate discussion about the proper representation of the simplex. It will be shown that this discussion makes no sense when considering the simplex as a Euclidean vector space.
Since Aitchison (1982) introduced the log-ratio approach to compositional data, there has been some occasional discussion about the proper representation of this type of observation and the terminology to be used. Since this discussion arises twice and again, it seems appropriate to clarify the concepts involved. The discussion arises from two ways of defining the sample space of compositional data. On one side, there is the traditional sample space in geology and other fields of science, which is
General
concepts
To analyse the concepts open simplex and closed simplex, it is necessary to recall the definitions of open and closed ball, open and closed set, and boundary and frontier of a set in a metric space (see, for example, Schechter (1997) or Rudin (1976)). Let ( ~ o , d) be the D-dimensional real space with the usual Euclidean metric, which is given by the function d ( x , x*) = IIx - x* II
R = {x = (Xl, X2 . . . . . XD); X1 "at- X2
= ~(x I - x~) 2 -+- 9 9 9 +
(XD --
x~)) 2,
(3)
- } - ' ' ' + X D = I , x i > O , i = 1 . . . . . D}. (1)
Observe that one can consider the set R e m b e d d e d in the D-dimensional space ~D and it includes data with zero values. On the other side, the sample space in the log-ratio approach excludes zero values and is given by
xi>O, i=l
Bd(Z, r) = {x E X; d(x, z) < r},
. . . . . D}.
(2)
Kd(Z, r) = {x E X; d(x,z) < r}. The restriction in S to strictly positive values for all the components is necessary for modelling within the log-ratio approach, as division by zero and log-transformation of zero is not defined. Both sets, R and S, are e m b e d d e d in the D-dimensional real space R D and the essential difference between them is inclusion, respectively exclusion, of ntuples with zero components. The discussion is whether it is fight or wrong to use the terminology closed and open simplex for the sets R and S, respectively. In Figure 1, a representation of both sets are shown for D = 3.
(5)
The difference between both sets is the inclusion or not of an equality (recall that the same situation stands for the sets R and S): closed balls admit all points at distance to the centre equal to r, while the open balls do not. For example, for D = 3, closed and open balls are spheres with and without their hull, respectively. A set T C_ R D is said to be open if for each point z ~ T, there exists r > 0 such that Ba(z, r) C_ T. This is, open sets are those for which around their points can be considered an open ball contained in the set, no matter how small the radius of the ball
From: BUCCIANTI,A., MATEU-FIGUERAS,G. & PAWLOWSKY-GLAHN,V. (eds) Compositional Data Analysis
in the Geosciences: From Theory to Practice. Geological Society, London, Special Publications, 264, 203-206. 0305-8719/06/$15.00
(4)
this is, the set of all points of R ~ located at a distance less than r of Z. The closed ball with centre z and radius r is the set
S = {x = (xl,x2 . . . . . XD); xl +x2
+'''+XD=I,
where d(x, x*) stands for distance between x and x* and Hx - x* ]1the norm of the vector x - x*. For any z E ED and r > 0, the o p e n ball centred at Z with radius r is the set
@ The Geological Society of London 2006.
204
E. BARRABI~S & G. MATEU-HGUERAS
Fig. 1. Representation of the sets R and S (Deft and right respectively) in ~3. In the first case, the segments of the triangle (its boundary) on the coordinate planes are included, while in the second case they are not. The vertices of the triangle correspond to data with exactly two components equal to zero and the sides (excluding the vertices) correspond to data with one component equal to zero.
has to be taken. A set T C ~ o is said to be a closed set if its complementary set, this is the set of all points of ~ o that do not belong to T, is an open set. The complementary set of a set T is written as T c = ~D\T. Observe that a set can be neither open nor closed. For example, in the real line, the interval (0, 1) is an open set, [0, 1] is a closed set but (0, l] is neither open nor closed. To define the boundary or frontier of a set needs the concepts of interior and closure of a set. The interior of a set is the largest open set contained in it, and shall be denoted by int(T). The closure of a set is the smallest closed set which contains it, and shall be denoted by cls(T). Observe that an open ball is an open set, and that the closure of an open ball is the closed one with the same centre and radius. Furthermore a closed ball is a closed set, and the interior of a closed ball is the open one with the same centre and radius, this is, clS(Bd(Z, r)) : Kd(Z, r) and i n t ( K d ( z , r)) : Bd(Z, r).
(6) Now, the boundary or frontier OT of a set T is defined as the intersection of cls(T) with cls(TC), and it can be shown that for any T C ED, the space R ~ can be partitioned into the three disjoint sets int(T), int(T c) and OT, which are open, open and closed, respectively. This is,
distance due to the fact that i n t ( R ) = i n t ( S ) = and, therefore, they cannot contain any open ball. On another side, R is a closed set of ~D, as its complementary set is open, but S is not a closed set because cls(S) = R :~ S. Furthermore OR : 0S = {x = (Xl,X 2. . . . . XD); x i : O, for some i :
1. . . . . D},
(8)
this is, in both cases the boundary is exactly the same: the set of the compositions with one or more components equal to 0. In conclusion, if one considers R, S C ~D with the Euclidean distance d, the set R is closed but the set S is neither open nor closed.
The s i m p l e x as a subset of [~D-1 W h e n dealing with the sets R and S, a different approach can be used. The condition xl + x 2 +...-I-XD = 1 allows expression of one of the components, say XD, in terms of the other variables as
XD = 1
-
X 1 . . . . .
XD_ 1 .
(9)
Therefore, the following sets can be defined R' = {x = ( x l , x 2 . . . . . X D - I ) ; Xl -'[-X2 -~-- " " "
~D = int(T) LJ int(T C) tO OT.
(7)
Furthermore, T is a closed set if and only if it is equal to its closure, this is if and only if cls(T) = T. An immediate conclusion of this definition is that in ~3 open sets have volume (in general, in ~D, open sets contain open balls); thus, neither R nor S are open sets, given that they are completely flat (see Fig. 1). This can be generalized to any dimension: R and S are not open sets in R D with the Euclidean
-JI-XD_ 1 St :
"< 1 ,
X i >_ O ,
i = 1.....
D - 1},
(10)
{x : (Xl,X 2. . . . . XD_I); x I --[-x 2 - { - . . . +XD-1 <1,
xi>O,
i=l
.....
D--l},
(11)
and x E R if and only if x = (x', XD), where x' ~ R' and XD satisfies (9) (analogously for the sets S and S'). It is observed that whereas R and S are e m b e d d e d in the D-dimensional real space, R' and S' are e m b e d d e d in the ( D - 1)-dimensional real
TOPOLOGICAL CONCEPTS ON THE SIMPLEX
205
Fig. 2. The sets R' and S' for D = 3 (left and fight respectively) in two-dimensional space. In the first case the sides of the triangle are included while in the second case they are not.
space. In Figure 2 the representation of the sets R' and S' in the case D = 3 are given. Then R' and S' are regarded as subsets of R D- 1 and here it is found that R' is a closed set whereas S' is an open set. Furthermore, R' = cls(S') and S' = int(R') (recall the difference with R and S). This justifies the common terminology closed simplex for R' and open simplex for S'. But it is necessary to emphasize the fact that in this case only D - 1 variables are involved, so the representation would be different from the representation of the sets R and S.
The simplex as a Euclidean vector space Aitchison (1986) showed that the standard operations in real space may have no sense from a compositional point of view. Aitchison defined two fundamental operations, perturbation and powering, as well as a distance in the simplex, known as the Aitchison distance, which is given by
d,~(x,x*) =
~
x
x:t
in---In--d.
x:
xj/
.
(12)
The distance da cannot be defined on any composition x with one or more components equal to zero because it involves division by zero and/or logarithm of zero. In consequence, the distance d~ cannot be defined on the set R, whereas there is no problem in defining it on the set S. Later, Billheimer et al. (2001) and PawlowskyGlahn & Egozcue (2001) proved that the set S with the operations defined by Aitchison has a Euclidean vector space structure and it is not adequate to consider it as a subset of the real space R D. A summary of the state of the art can be found in this volume in the article 'Simplicial geometry for compositional data' by Egozcue & Pawlowsky-Glahn (2006). Thus, if one wants to be coherent from a compositional point of view the simplex must be considered as a Euclidean
vector space in itself, that is, (S, da) must be considered as a metric space. In this context, the discussion about closed and open simplex has no meaning whatsoever, as S is the whole space and therefore, open and closed at the same time. Do the compositions with zero parts play any role in this context? It is clear that they do not belong to the space S, and that they are at infinite distance (da) of any point of the set S. So, any point of x @ R with xi = 0 for some i = 1. . . . . D, has the same behaviour as a point with an infinite component in the real space. But what happens if a dataset contains zeros? There will be problems using the log-ratio approach. At this point it will be necessary to apply a replacement strategy to the zeros (see, for example, MartfnFernandez et al. (2003)) in order to use the Aitchison distance without numerical problems.
Representation: the ternary diagram Now the discussion is which is the suitable representation for the simplex (S, du). The argument put forward by supporters of different representations is that R is a closed set, whereas S is an open set. As has been seen before, this discussion has no sense because it is considering S as a vector space in itself, which means that it is open and closed at the same time. Let attention be centred on the case D = 3. All the ideas developed here can be generalized easily to an arbitrary number of parts (although there is no suitable representation for the simplex for D > 4). It is known widely that a convenient way to represent a three-part composition is the ternary diagram. As can be seen in Figure 3, there is a one-to-one correspondence between a three-part composition and a point in the triangle. Note that all the parts (xl,x2,x3) are involved and appear in the diagram, so the representation is different from the one shown in Figure 2, where the set S' is considered (and only two components are represented).
206
E. BARRABt~S & G. MATEU-FIGUERAS X1
Xl
X2
\ --
X3
Fig. 3. Representation of a three-part composition, x = (Xl,X2,X3), in the ternary diagram.
Observe that the borderlines of the ternary diagram are associated with scales of variation of the different parts involved in the composition. In fact, when a ternary diagram is regarded, it is understood that every side of the triangle represents a regular scale between 0 and c, where e.g. c = 1 if observations are in parts per one, or c = 100 if observations are in parts per hundred, or c = 1 000 000 if they are in ppm. A composition without zero parts is represented by a point inside the triangle (Fig. 3). A composition with exactly one component xi equal to zero is represented by a point on the side opposite to the vertex xi. Each vertex represents a composition with exactly one component different from zero. Therefore, in this context the sides of the ternary diagram can be viewed as axes in a coordinate system instead of parts of the whole space. This representation has been used for a long time. Only recently the approach presented in the previous section using the Aitchison distance turns out to be more suitable than the classical methods. This implies that only the set S has to be considered, but the ternary diagram representation is still valid as the sides of the triangle are reference axes. Furthermore, the points on the sides are at infinite distance da to any point inside the triangle. Representing points at infinity is not a contradiction. For example, the stereographic projection is a well-known case where there is a correspondence between each point on the plane and a point on a sphere. In this case, the whole sphere is always represented, although the north pole represents the points at infinity. Conclusions
The concepts of open simplex and closed simplex have been used commonly to refer to the sets S
and R respectively. From a mathematical point of view these expressions are not correct: as subsets of the Euclidean space R ~ only the terminology for the set R is correct (it is a closed set), whereas S is neither closed nor open. Nevertheless, the terminology is correct when it is applied to the sets R' and S', which are embedded in the D - 1 dimensional Euclidean space but, in this case, only D - 1 components are taken into account. Finally, if one considers the simplex as a Euclidean vector space with the Aitchison distance da, only the subset S makes sense. Then, the whole space is open and closed at the same time and the terminology of open and closed simplex should be ruled out. With respect to their representation, the authors propose the use of the classical ternary diagram, where the sides of the triangle represent the points at infinity of the simplex space (S, da). This research has been supported by the Direcci6n General de Ensefianza Superior (DGES) of the Spanish Ministry for Education and Culture through the projects BFM2000-0540 and BFM2003-05640/MATE.
References
AITCHISON,J. 1982. The statistical analysis of compositional data (with discussion). Journal of the Royal Statistical Society, Series B (Statistical Methodology), 44 (2), 139-177. AITCmSON, J. 1986. The Statistical Analysis of Compositional Data. Monographs on Statistics and Applied Probability. Chapman & Hall Ltd, London. Reprinted (2003) with additional material by The Blackburn Press, Caldwell, NJ. BILLHEIMER, D., GUTTORP, P. & FAGAN, W. 2001. Statistical interpretation of species composition. Journal of the American Statistical Association, 96 (456), 1205-1214. EGOZCUE, J. J. & PAWLOWSKY-GLAHN, V. 2006. Simplicial geometry for compositional data. In: BUCCIANTI, A., MATEU-FIGUERAS, G. & PAWLOWSKY-GLAHN, V. (eds) Compositional Data Analysis in the Geosciences: From Theory to Practice. Geological Society, London, Special Publications, 264, 145 - 159. MARTIN-FERNANDEZ, J. A., BARCELO-VIDAL, C. & PAWLOWSKY-GLAHN, V. 2003. Dealing with zeros and missing values in compositional data sets using nonparametric imputation. Mathematical Geology, 35 (3), 253-278. PAWLOWSKY-GLAHN,V. & EGOZCUE,J. J. 2001. Geometric approach to statistical analysis on the simplex. Stochastic Environmental Research and Risk Assessment (SERRA), 15 (5), 384-398. RUDIN, W. 1976. Principles of mathematical analysis (3rd edn). International Series in Pure and Applied Mathematics. McGraw-Hill Book Co, NY. SCHECHTER, E. 1997. Handbook of Analysis and its Foundations. Academic Press, San Diego.
Index Note: Page numbers in italics refer to figures while those in bold refer to entries in tables.
accretion system 43 acid-base reactions 187 active margin 43, 50, 53 additive log-ratio 3-5, 8, 8, 82, 157-8 drawbacks 81 transformed 20, 79, 80-1, 87, 106 additive replacement 194-7, 196, 197 adsorption 187 Afrobolivina afra foraminifera 59 Aitchison composition 120-1 Aitchison distance 4, 5, 6-7, 119, 121, 122-3, 124, 130, 137, 150-1,153, 156-7, 192, 197, 205, 206 Aitchison geometry 7, 120, 121, 145, 161 Aitchison inner product 6-7 Aitchison norm 6-7, 151, 152 alkalis-total Fe-Mg (AFM) plot 19, 22 alteration processes 178, 179 alternative hypothesis 37 amino-acid analysis 60, 63 analysis of variance (ANOVA) 67, 70, 119 andesite 18 Appin Group 26, 35-6, 35 aquifers 178, 179, 187 aragonite 30 Argyll Group 26 Arsen'evskoe deposit, Russia 45, 47, 55 balances 4-5, 52, 52, 145, 154-7, 155, 158 Ballachulish Subgroup 33, 38 barplot 121 barycentre 7, 149, 162 basalt 18, 156, 166-72 alkaline 11, 13, 14-17, 101, 103, 106, 198, 198, 199, 200 bases 148-9 Basic Compositional Data Analysis functions from S + / R 130 Be/V ratio 52, 53 between-groups matrix 60-1, 62 Bhattacharyya distance 130 bimodality 188 binary logistic discriminant analysis 47-9, 50 biplot 4, 13, 14, 30, 32, 35, 43, 119, 123, 161, 170, 198 analysis 168 CoDaPack 103, 104, 114 coincident vertices 165 collinear vertices 165 construction of 163-4 cosines of angles 164 interpretation of 164-5 links and rays 164 major oxides 169 subcompositional analysis 164-5 Blair Atholl Dark Schist & Limestone Formation 35-6, 36 Blair Atholl Subgroup 28, 33, 37-8 box plot 70, 71, 119, 121,123, 187 Box-Muller algorithm 88
brachiopods 60, 61-4, 63 British Geological Survey 27 calc-alkaline series 11, 13, 14-17, 18, 19, 101, 103, 106, 198, 198, 199 calc-silicate phases 30 calcite 30, 178, 179, 187 carbon, isotopes 39 carbon dioxide 178, 181,187 carbonates 181 Carpatho-Pannonian Region 11, 12, 198 casserite 43-55, 49, 51 cation-exchange equilibria 178 Cenozoic vulcanites, Hungary 11-23, 12, 101, 191, 194, 198 centered data 7-8, 8, 15-17, 104, 111, 116, 130, 131,135, 161-3, 163, 166, 167, 169, 170, 199, 199 centered log-ratio 3-5, 8, 103, 112-15, 157-8, 167, 170 coefficients 45-7, 148, 152, 156, 157 covariance matrix 60, 64, 72, 162-3, 166 transformed 19-20, 106, 148, 162, 163,198, 199 chemo-stratigraphical study 61 chloride 178-9 classification 1, 13 probability 87-8 sediment 80, 82-4, 83 clay minerals 166-72, 178-9 clear replacement 196-7 closed data 1, 32, 79, 146 closed geometric mean 7 closed simplex 203, 205 closure constant 146, 147, 161 cluster analysis 43, 119, 122-3, 198 see also Ward cluster analysis CoDaPack 101-18, 129-30, 133 ALN confidence region 114, 116 ALN predictive region 114, 115 ALR plot 112, 114 amalgamation 111 analysis 107, 108, 117 atypicality indices 117, 117 biplot 103, 104, 114 centering 111 centre 116 CLR plot 112-14 CLR variance 114-15 descriptive statistics 102-3, 103, 107, 108, 114-17, 116, 117 graphs 103-7, 107, 108, 112-14, 113-16 logistic normality test 117 menu features 107-18 operations 102, 107, 110-12 perturbation 110, 111 power transformation 110, 112 preferences 107, 108, 117 principal components 114, 115 raw ALR 109, 110
208 CoDaPack (Continued) raw CLR 109 raw ILR 110 real data example 101-7 rounded zero replacement 101-2, 102, 111-12 standardisation 111 subcomposition/closure 111 sum constraint 117, 118 summary 102-3, 103, 114-16, 116 ternary diagram 105-6, 112, 113 total variance 116 transformations 107-10, 108 unconstrain/bias 107-9, 109 variation array 114, 116-17 websites 101, 130 coefficients 3, 8-9, 45-7, 60, 148-9, 152, 155-7, 156, 157 collision margin 68 composition 11, 18 exploratory methodology 161-73 non-negativity 80 principal component analysis 2, 72, 74, 104-6, 106-7, 119, 165-6 software 101 - 18 three part 80-1, 82, 205, 206 see also subcomposition compositional fields 87, 162 compositional lines 6, 6, 9, 13, 104, 149-50, 149 principle axis 166, 166 compositions software package 119-27, 131 four classes 131,143 websites 119 confidence region 81-2, 82, 87, 88, 9 3 - 6 additive logistic normal 114 constant ratios 136, 137 constant sum 2, 25, 38, 61, 67, 80, 94-5, 117, 118, 135, 145, 194 continental block margin 79, 84 coordinates see coefficients Coptothyris grayi, brachiopod 60 correlation matrix 1, 4, 25 covariance structure 1, 2, 3, 4, 25-6, 61, 79, 81, 97, 151, 162-3, 165, 170, 192, 193-4, 197 see also variance Dalradian limestone 25-39, 27, 29 FMC subcomposition 34-9 lithological characteristics 31 data 30, 32 3D visualisation 138, 142 detection limit 191-2 errors 3, 161 full-space data 59 glacial dataset 129, 130 missing data techniques 191-200 NMAR missing values 196 number of zeros 197 properties 1-2 random sampling 179 Researcher's daily activities 129, 130 residual 130 sampling error 176 visualisation 13, 18, 46, 162
INDEX see also centered data; closed data; open data; transformed data decay, rates of 150 deciles values 136 deformation 25, 67 degrees of freedom 87, 146 dendrogram 2, 124, 137, 138, 141,142 depositional process 5 descriptive statistics 102-3, 103, 107, 108, 114-17, 116, 117, 161,194 diagenetic fluid 30 Dickinson model 79-98 database 84-7, 86, 94 discriminatory power 80 as exploration tool 97-8 methods 87-8 spatial resolution 96 temporal resolution 96 differentiation 19 mixing operators 176 dilution 185, 187, 188 dimensional real space 204-5,205, 206 dimensionality reduction 13 discriminant analysis 1, 2, 50, 61, 79-80, 125-6, 175, 198 discriminant coordinates 59-65 discriminant power 59, 61, 64, 65, 94 Dickinson model 80 dispersion see variance distance 4, 6, 145, 150-1, 155,205 see also Aitchison distance; Bhattacharyya distance; Euclidean distance dolomite 28 Dufftown Limestone Formation 33-9
earthquake 67, 75 eigenvalue 163, 166 eigenvector 163-4, 166 EM algorithm 193,200 equilibrium constant 70-3 erosional process 5, 175 essential zero see structural zero Euclidean distance 5, 130, 157, 191, 192, 193, 194 Euclidean geometry 161 Euclidean metric 203 Euclidean space 3, 7, 145, 151, 192, 198, 203, 205, 206 Euclidean statistical methods, assumptions 191-2 factor analysis 2, 175 feldspar 170, 172 plagioclase 30, 167, 169 Festival'noe deposit, Russia 45, 47, 54 fixed percentage line 82, 84 fluid source 43, 44, 53 fluorine 187 fractal distribution 176 frequency distribution 19, 175-88 positively skewed 177 shape of 36, 175, 179, 181, 185, 185-6, 187 theoretical 176-8 uni & multivariate 175, 177-85, 186-8
INDEX fumarole field 67-75 inner and outer 68 fumarolic gases 67-75, 71, 178, 187 fluid circulation 74 spatial variation 67, 68, 70, 72-5 subcompostional analysis 70-3 temperature 68, 69-70, 69, 72-5 temporal variation 67, 68, 69, 70, 71, 72-5 gabbro 166-72 gas phase equilibria 178 Gaussian distribution 69, 70, 175-7, 180-3, 193 Gaussian model 179, 181 geochemistry, natural laws 175-88 geometric mean 131,133, 134, 161-2, 197 glacial till 129 goodness of fit 179, 181, 186 grain size 28 Grampian Group 26 Grampian Orogeny 26 Hilbert space 151 histogram 18, 19, 183, 186 Hungarian Geological Survey 101 Hungary 11-23, 191, 194 hydromagrnatic deposits 178, 183 immunoasssay 60 imputation strategies 193-4 non-parametic 194-7 imputation value 196, 197 Inchrory Limestone Formation 33-9 inferential modelling 176 inferential success ratio 93-4 inner product 4, 145, 151, 152, 157 see also Aitchison inner product iron 11-12, 167 iso-density partitioning 87, 88-93, 94-6, 97 iso-density point 87, 88 isometric log-ratio 3-5, 8, 20, 81, 82, 157-8 coefficients 166 transformed 106, 125, 185-7 isotopes 11, 53, 54, 149-50 Italy 67-75, 161, 172, 175, 178, 184 soil dataset 166-7 Kavalerovo region, Russia 45, 47, 48, 54-5 Khingan-Okhotsk metallogenic belt, Russia 45 kin file 137 Kincraig and Ord Ban limestone 28, 35-6, 36, 37 kinemage 138 KiNG viewer 129, 137, 138, 142 Kinlochlaggan limestones 36, 37-8 Kinlochlaggan Syncline 28 Kolmogorov-Smirnov test 69, 70, 179 Komsomol'sk region, Russia 45, 47, 48, 53-4 Kullback-Leibler divergence 130 Laqueus rubellus, brachiopod 60 latent root 60-1, 63, 65 lattice of hypotheses 33, 34 Law of Mass Action 74-5
209
Law of Proportionate Effects 177 leaching 178, 187 Leny Limestone 26 Levene statistic 70 likelihood ratio test 181, 186 limestone 25, 27-30 linear dependence 148 linear discriminant analysis (LDA) 19-21, 51, 52, 106, 199 linear discriminant function (LDF) 19, 20, 198, 199-200, 199 see also misclassification rate linear regression 49, 198 lithophile elements 43, 45, 53 lithostratigraphy 25, 26-8, 29, 39 Lithuania, Silurian sediments 59-60, 61, 62 Lochaber Subgroup 38 log-contrasts 33, 70-3, 72, 73, 74-5 log-likelihood function 181 log-normal model 177-8, 179, 184 log-ratio 1, 2-9, 11, 13-18, 25, 26, 35, 38, 45-7, 67, 70, 81-4, 82, 101, 155, 157-8, 162, 176, 177, 178-87, 186-7, 192, 193, 198, 203 covariance matrix 60, 62, 63 linear discriminant analysis 21 space 83, 84, 87 time independent 70, 70, 71 transformation 59, 64, 131,145, 176, 200 see also additive log-ratio; centered log-ratio; isometric log-ratio logskew-normal model 176, 177, 181,184, 188 loss on ignition 12, 30, 34 MAGE viewer 129, 137 magma 11, 21, 167 chamber system 43-5, 44 differentiation 170 magmatic arc 79, 84, 94-5, 178 magnesium 179 Mahalanobis distances 186-7 Mann-Whitney U-test 25, 37-8, 37 marker horizon 28 Markov Chain Monte Carlo (MCMC) simulation algorithms 193,200 MATLAB version software library 179 matrices see between groups matrix; correlation matrix; covariance matrix; variance matrix; within-groups matrix metamorphism 25, 53 metasomatic alteration 18 metric space 203, 205 Mg/Ca ratio 179 mica 30 mineral paragenesis 25, 45 misclassification rate 49, 50, 52, 88, 198, 199, 199 MixeR 129-43 Activity researchers dataset 138 data file format 129, 141 data input 131 glacial dataset 142 library 131 routines 131-2 websites 129, 131, 143
210 Moine Supergroup 28 monitoring tools 67, 75, 98 monograph 2 multi element variation see spider diagram multimodal model 184, 186, 188 multiplicative replacement 140-1,194-7, 196, 197, 198, 200 multivariate analysis 81, 192, 198 distribution 188 incomplete data 192 Multivariate Analysis of Variance (manova) 125 redundant information 59 statistical modelling 185-7 multivariate normal model 186-7 multivariate skew-normal model 186-7, 188, 188 natural replacement 195-6 non-parametic techniques 36-7, 192-3, 194-200 see also parametic techniques norm 5, 145, 155, 157 see also Aitchison norm normal distribution see Gaussian distribution null hypothesis 37 null value 191,192, 195 OCR software 84 olivine 170, 172 open data 25, 26 ophiolite 161, 166-72 orogen, recycled 79, 84 orthogonal system 4-5, 60, 151-7, 179 orthonormal basis 145, 149, 152-3, 153, 155-6, 157, 164 oxides hydrous 166, 168, 173 major 11-23, 101, 167, 167, 168, 169, 171, 191 oxyhydroxides 166, 168, 173 P-P plot 187, 188 Pagetides trilobite 26 palaeontology 59 Panarea, Italy 67, 75 parametic techniques 192-3, 200 see also non-parametic techniques percentiles 18, 82, 84, 131, 133 border lines 133, 134-6, 134, 135 Pereval'noe deposit, Russia 45, 47, 54 peridotite 167 permeability variations 178, 179 perturbation 5-6, 7, 110, 111, 123, 130, 145, 147-9, 151, 156, 157, 158, 162, 163, 194, 197, 205 pH conditions 167-8, 173 Pictathyrispicta, brachiopod 60 pie-chart 121 Pitlurg limestones 35, 36, 37-8 plagioclase see feldspar Port Arisaig Formation (PAF) 26 positive vectors 146 potassium 178 power transformation 5-6, 110, 112, 123-4, 130, 145, 147-9, 151, 157, 158, 197, 205 power-perturbation equation 8-9
INDEX precipitation reactions 175, 178 predictive distribution 81, 82, 87, 88, 8 9 - 9 6 additive logistic normal 114 principal component analysis 2, 72, 74, 104-6, 106-7, 119, 165-6 probability 33, 36, 87-8, 125, 175 classification 87-8 density 2, 87, 88, 157, 176 plot 69, 72, 73, 179, 180-3 ternary sandstone 97 provenance association 79, 87, 88, 89-96, 97 provenance field 79, 84, 85, 87, 88, 94 pyroxene 167, 170, 172 Q-Q plot 187, 188 QFL diagram 84, 85, 88, 89, 93 QmFLt diagram 84, 85, 88, 90 QmPK diagram 88, 91, 95 QpLvLs diagram 88, 92, 96 Quadrays formulas 137 Quartz 30, 84-96 Quick Basic 130 R statistical package 119-27, 129-143 Aitchison geometry 120 classification routines 143 closure constant 121 composition visualization 129-43 compositional computation 123-4 compositional data analysis 120-2 data command 120 data subsets 125 defining a vector 132 discrimination analysis 125-6 download and installation 120, 126 Fahrmeir package 131 GNU S 130 grouped data 124-6 help window 120, 121 importing data 126 library command 120 mixture procedures 131-2 Mixtures with see MixeR multiple parts 122, 123, 124 Multivariate Analysis of Variance (manova) 125 percentile lines 133, 134-6, 134 plot command 121 principle component analysis 121-2 programming interface 126 ratio lines 136-7 specifying components 122 starting up 119 statistical sub-routines 126 subcomposition 132-3 technical help 126-7 ternary diagram 121,123, 133-4 tetrahedron 137-9 variance 121-2, 121 version number 120 websites 119, 126 zero replacement value 120, 139-41 rare earth element diagram 32 ratio lines 131, 136-7
INDEX redox reactions 167-8, 187 replacement formulae 194-7, 195-6 replacement strategy 193-4, 205 see also additive replacement; clear replacement; multiplicative replacement; natural replacement; simple replacement; zero replacement value representations see transformed data rhyolite 18 ridge-type constant 59, 60-1 rounded zero 101-2, 102, 111 - 12, 191-200 Russia 43-55, 44 sample space 175, 176-7, 203 sandstone 79-98, 80, 95-8, 97 scatter plot 52 Scotland, Dalradian limestone 25-39 scrubbing 73, 74, 75 sediment 1, 59, 79 classification 4, 80, 80, 82-4, 83 sedimentary basin 79 seismic data 67 sequential binary partition 152-3, 153, 154, 154, 155-6, 155, 156 order of partition 155, 155 serpentinite 166-72 set boundary and frontier 203-4 interior 204 open and closed 203-4, 205 set R 203-6, 204, 205, 205, 206 set S 203-6, 204, 205, 205, 206 shear zone 28 shrinkage 59, 60-1, 62, 63, 64-5 siderophile elements 43, 45, 53 Sikhote 'Alin, Russia 43-55, 45 silicates 18, 34, 166, 181 Silurian sediments, Lithuania 59-60, 61, 62 simple replacement 139-40, 194-7, 196, 197 simplex 26, 32, 38, 68-9, 107, 119, 162, 175, 176, 179, 203-6 Aitchison 123 closed 203-5 D-part 3, 6, 145-58 open 203, 205 as subset 204-5 vector space 147-50 skewness index, estimated 1 8 5 - 6 skew-normal model 177, 181,183-5, 188 slab-window zone 43 sodium 178, 187 software tools 129-31 soil 166-72 chemistry and land use 166 dataset 166-7 interstitial water 172 solution-precipitation processes 175, 187 Southern Highland Group 26 spatial averaging 96 spectral analysis, quantitative 45 spider diagram 28, 30, 34 standard correlation analysis 175 standard deviation see variance Stromboli, Italy 67, 75
strontium 39 structural zero 191-2 Student t-test 69 subcomposition 15-17, 21, 147, 153, 154, 165, 194, 197 coherence 2, 25-6 covariance 2 discrimination 25, 32-4, 34 fumarolic gases 70-3 projection of 153, 153, 154 ternary sandstone 85 three part 13, 129-43, 134-7, 161, 169, 169 subduction margin 43, 54, 178 substitution constant 102 t-test see Student t-test tectogenetic switch hypothesis 43-6, 54 tectonics 11, 25, 45, 79 Terebratalia coreanica, brachiopod 60 ternary diagram 6, 8, 15-17, 104, 119, 129, 130-1,146, 146, 149, 161, 162, 169, 199, 205-6, 206 CoDaPack 105-6, 112, 113 Dickinson model 79 statistical analysis 80-1, 82 visualisation in R 133-4, 135, 136 tetrahedral diagram 129, 131, 141,142, 146 visualisation in R 137-9 Theory of Successive Random Dilutions 177 thin section point count 79, 85 tholeiitic series 19 tin deposits 43-55, 50 Tindari-Letojanni lithospheric fault 68 Torulian Limestone 28, 35, 36, 38 total alkali-silica (TAS) plot 18, 21 total variance 161-3, 167, 170, 197 tourmaline 53 trace elements 11, 30, 45, 169-70, 170, 171, 171, 173, see also casserite transform margin 43, 50, 53, 54 transformed data 19, 20, 25, 30, 38, 45-7, 81, 107-19, 108, 119, 157 Tyrrhenian Sea 67, 68, 178 U-Pb zircon ages 26 univariate normal model 177-85 univariate skew-normal model 176, 177-85 variance 59-65, 70, 103-4, 164, 165, 167, 168, 170, 176, 177, 192, 198 matrix 161-3, 167, 168, 171, 197 see also covariance structure; total variance vectors 63, 205 collinear 168 column markers 164 row markers 164 stability analysis 59-65 Visualbasic routines 101 volcanic arc see magmatic arc volcanoes, space-time monitoring 179 Vulcano Island, Italy 67-75, 175, 178, 184 Vysokogorskoe deposit, Russia 45, 47, 54-5
211
212 Wald-Wolfowitz Runs Test 69 Ward cluster analysis 46-9, 50, 50, 51 water chemistry 39, 175-88, 180, 182, 183, 184, 185 weathering 166, 170, 172, 175, 178, 181, 183, 187 weight percentage 3, 129 weighted method 87 within-groups matrix 60-1, 62
INDEX X-Ray Fluorescence 30 zero component 3, 84, 161, 191-200, 203, 205 see also rounded zero; structural zero zero replacement value 86 in R 139-41 rounded 101-2, 102, 111-12 zircon, U-Pb ages 26
Compositional Data Analysis in the Geosciences From Theory to Practice Edited by
A. Buccianti, G. Mateu-Figueras and V. Pawlowsky-Glahn
Since Karl Pearson wrote his paper on spurious correlation in 1897, a lot has been said about the statistical analysis of compositional data, mainly by geologists such as Felix Chayes. The solution appeared in the 1980s, when John Aitchison proposed to use Iogratios. Since then, the approach has seen a great expansion, mainly building on the idea of the 'natural geometry' of the sample space. Statistics is expected to give sense to our perception of the natural scale of the data, and this is made possible for compositional data using Iogratios. This publication will be a milestone in this process. This book will be of interest to geologists using statistical methods. It includes the intuitive justification of the methodology, convincing through case studies and presenting user-friendly software, which includes a section for those who need to see the proof of the mathematical consistency of the methods used.
Visit our online bookshop: http://www.geolsoc.org.uk/bookshop Geological Society web site: http://www.geolsoc.org.uk
ISBN -86239-205-6
Cover illustration: Volcan Licancabur(22°50'S67°50'W,5900 m) is a stratovolcanowhich lies on the border of Chile and Bolivia(the peak proper being locatedin Chile), 30 km eastof the villageof San Pedrode Atacama.The 70 x 90 m crater lakeat the summit is believedto be the highest lake in the world, and despiteair temperaturesof -30°Cit containsnumerousliving creatures. Photographcourtesyof Prof. PiermariaLuigi Rossi,Universityof Bologna (I). The equation representsthe set of the d-dimensionalsimplex embedded in D-dimensionalreal space.When D = 3 the simplexis represented by a triangle. (Aitchison, 1986, The Statistical Analysis of Compositional Data, Chapman& Hall).
I1!!11!! !!!11