Chemometrics in Spectroscopy
This page intentionally left blank
Chemometrics in Spectroscopy
Howard Mark Mark El...

This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!

Chemometrics in Spectroscopy

This page intentionally left blank

Chemometrics in Spectroscopy

Howard Mark Mark Electronics

Suffern, New York

USA

Jerry Workman Jr. Thermo Fischer Scientific Inc.

Molecular Spectroscopy & Microanalysis

Madison, WI

USA

Amsterdam • Boston • Heidelberg • London • New York • Oxford

Paris • San Diego • San Francisco • Singapore • Sydney • Tokyo

Academic Press is an imprint of Elsevier

Academic Press is an imprint of Elsevier 84 Theobald’s Road, London WC1X 8RR, UK Radarweg 29, PO Box 211, 1000 AE Amsterdam, The Netherlands Linacre House, Jordan Hill, Oxford OX2 8DP, UK 30 Corporate Drive, Suite 400, Burlington, MA 01803, USA 525 B Street, Suite 1900, San Diego, CA 92101-4495, USA First edition 2007 Copyright © 2007 Elsevier Inc. All rights reserved No part of this publication may be reproduced, stored in a retrieval system or transmitted in any form or by any means electronic, mechanical, photocopying, recording or otherwise without the prior written permission of the publisher Permissions may be sought directly from Elsevier’s Science & Technology Rights Department in Oxford, UK: phone (+44) (0) 1865 843830; fax (+44) (0) 1865 853333; email: [email protected]. Alternatively you can submit your request online by visiting the Elsevier web site at http://elsevier.com/locate/permissions, and selecting Obtaining permission to use Elsevier material Notice No responsibility is assumed by the publisher for any injury and/or damage to persons or property as a matter of products liability, negligence or otherwise, or from any use or operation of any methods, products, instructions or ideas contained in the material herein. Because of rapid advances in the medical sciences, in particular, independent verification of diagnoses and drug dosages should be made ISBN: 978-0-12-374024-3

For information on all Academic Press publications visit our website at books.elsevier.com

Printed and bound in USA 07 08 09 10 11 10 9 8 7 6 5 4 3 2 1

Working together to grow libraries in developing countries www.elsevier.com | www.bookaid.org | www.sabre.org

Dedication To our families and to our readers � � � – Howard Mark and Jerry Workman

This page intentionally left blank

Contents Preface Note to Readers 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24. 25. 26. 27. 28. 29. 30. 31.

A New Beginning � � � Elementary Matrix Algebra: Part 1 Elementary Matrix Algebra: Part 2 Matrix Algebra and Multiple Linear Regression: Part 1 Matrix Algebra and Multiple Linear Regression: Part 2 Matrix Algebra and Multiple Linear Regression: Part 3 – The Concept of

Determinants Matrix Algebra and Multiple Linear Regression: Part 4 – Concluding

Remarks Experimental Designs: Part 1 Experimental Designs: Part 2 Experimental Designs: Part 3 Analytic Geometry: Part 1 – The Basics in Two and Three Dimensions Analytic Geometry: Part 2 – Geometric Representation of Vectors and

Algebraic Operations Analytic Geometry: Part 3 – Reducing Dimensionality Analytic Geometry: Part 4 – The Geometry of Vectors and Matrices Experimental Designs: Part 4 – Varying Parameters to Expand the Design Experimental Designs: Part 5 – One-at-a-time Designs Experimental Designs: Part 6 – Sequential Designs Experimental Designs: Part 7 – �, the Power of a Test Experimental Designs: Part 8 – �, the Power of a Test (Continued) Experimental Designs: Part 9 – Sequential Designs Concluded Calculating the Solution for Regression Techniques:

Part 1 – Multivariate Regression Made Simple Calculating the Solution for Regression Techniques: Part 2 – Principal

Component(s) Regression Made Simple Calculating the Solution for Regression Techniques: Part 3 – Partial Least

Squares Regression Made Simple Looking Behind and Ahead: Interlude A Simple Question: The Meaning of Chemometrics Pondered Calculating the Solution for Regression Techniques: Part 4 – Singular

Value Decomposition Linearity in Calibration Challenges: Unsolved Problems in Chemometrics Linearity in Calibration: Act II Scene I Linearity in Calibration: Act II Scene II – Reader’s Comments � � � Linearity in Calibration: Act II Scene III

xi

xiii

1

9

17

23

33

43

47

51

57

63

71

77

81

85

89

91

93

97

101

103

107

109

113

117

119

127

131

135

141

145

149

viii

32. 33. 34. 35. 36. 37. 38. 39. 40. 41. 42. 43. 44. 45. 46. 47. 48. 49. 50. 51. 52. 53. 54. 55. 56. 57. 58. 59. 60. 61. 62. 63. 64. 65. 66. 67. 68. 69. 70. 71.

Contents

Linearity in Calibration: Act II Scene IV Linearity in Calibration: Act II Scene V Collaborative Laboratory Studies: Part 1 – A Blueprint Collaborative Laboratory Studies: Part 2 – using ANOVA Collaborative Laboratory Studies: Part 3 – Testing for Systematic Error Collaborative Laboratory Studies: Part 4 – Ranking Test Collaborative Laboratory Studies: Part 5 – Efficient Comparison of Two

Methods Collaborative Laboratory Studies: Part 6 – MathCad Worksheet Text Is Noise Brought by the Stork? Analysis of Noise: Part 1 Analysis of Noise: Part 2 Analysis of Noise: Part 3 Analysis of Noise: Part 4 Analysis of Noise: Part 5 Analysis of Noise: Part 6 Analysis of Noise: Part 7 Analysis of Noise: Part 8 Analysis of Noise: Part 9 Analysis of Noise: Part 10 Analysis of Noise: Part 11 Analysis of Noise: Part 12 Analysis of Noise: Part 13 Analysis of Noise: Part 14 Derivatives in Spectroscopy: Part 1 – The Behavior of the Derivative Derivatives in Spectroscopy: Part 2 – The “True” Derivative Derivatives in Spectroscopy: Part 3 – Computing the Derivative Derivatives in Spectroscopy: Part 4 – Calibrating with Derivatives Comparison of Goodness of Fit Statistics for Linear Regression:

Part 1 – Introduction Comparison of Goodness of Fit Statistics for Linear Regression:

Part 2 – The Correlation Coefficient Comparison of Goodness of Fit Statistics for Linear Regression:

Part 3 – Computing Confidence Limits for the Correlation Coefficient Comparison of Goodness of Fit Statistics for Linear Regression:

Part 4 – Confidence Limits for Slope and Intercept Correction and Discussion Regarding Derivatives Linearity in Calibration: Act III Scene I – Importance of Nonlinearity Linearity in Calibration: Act III Scene II – A Discussion of the

Durbin-Watson Statistic, a Step in the Right Direction Linearity in Calibration: Act III Scene III – Other Tests for Nonlinearity Linearity in Calibration: Act III Scene IV – How to Test for Nonlinearity Linearity in Calibration: Act III Scene V – Quantifying Nonlinearity Linearity in Calibration: Act III Scene VI – Quantifying Nonlinearity, Part

II, and a News Flash Connecting Chemometrics to Statistics: Part 1 – The Chemometrics Side Connecting Chemometrics to Statistics: Part 2 – The Statistics Side Limitations in Analytical Accuracy: Part 1 – Horwitz’s Trumpet

159

163

167

179

183

185

187

193

223

227

235

243

253

271

277

285

293

299

313

317

323

329

339

351

359

371

379

385

393

399

413

421

427

435

439

451

459

471

477

481

Contents

72. Limitations in Analytical Accuracy: Part 2 – Theories to Describe the

Limits in Analytical Accuracy 73. Limitations in Analytical Accuracy: Part 3 – Comparing Test Results for

Analytical Uncertainty 74. The Statistics of Spectral Searches 75. The Chemometrics of Imaging Spectroscopy Glossary of Terms Index Colour Plate Section

ix

487

491

497

503

509

513

This page intentionally left blank

Preface

This large single volume fulfils the need for chemometric-based tutorials on topics of interest to analytical chemists or other scientists performing modern mathematical and statistical operations for use with analytical measurements. The book covers a very broad range of chemometric topics as indicated in the extensive table of contents. This book is a collection of the series of columns first published in Spectroscopy providing detailed mathematical and philosophical discussions on the use of chemometrics and statistical methods for scientific measurements and analytical methods. In addition the new revolution in biotechnology and the use of spectroscopic techniques therein provides an opportunity for those scientists to strengthen their use of mathematics and calibration through the use of this book. Subjects covered include those of interest to many groups of scientists, mathemati cians, and practicing analysts for daily problem solving as well as detailed insights into subjects difficult to thoroughly grasp for the non-specialist. The coverage relies more on concept delineation than on rigorous mathematics, but the descriptive mathematics and derivations are included for the more rigorously minded. Sections on matrix algebra, analytic geometry, experimental design, instrument and system calibration, noise, derivatives and their use in data analysis, linearity and nonlinearity are described. Collaborative laboratory studies, using ANOVA, testing for systematic error, ranking tests for collaborative studies, and efficient comparison of two analytical methods are included. Discussion on topics such as the limitations in analytical accuracy; and brief introductions to the statistics of spectral searches; and the chemometrics of imaging spectroscopy are included. The popularity of the Chemometrics in Spectroscopy series (ongoing since the early 1990s) as well as the Statistics in Spectroscopy series and books has been overwhelming and we sincerely thank our readership over the years. We have received e-mails from many people, one memorable one thanking us that a career change was made due to the renewed and stimulated interest in statistics and chemometrics due largely to our thought-provoking columns. We hope you find this collection useful and will continue to read the columns and write to us with your thoughts, comments, and questions regarding this stimulating topic. Howard Mark Suffern, NY Jerry Workman Madison, WI

This page intentionally left blank

Note to Readers

In some cases there were errors, both trivial and significant, in the original column from which a given chapter was taken. Sometimes we found the error ourselves (unfortunately after the column was printed) and sometimes, more embarrassingly, the error was brought to our attention by one of our ever-vigilant readers. For all significant errors, the necessary corrections were made in a subsequent column; in all cases, the corrected version is what is in this book. Sometimes, for the more serious errors, we note that the corresponding column was erroneous, so that any reader who wants to go back to the original will be aware that a comparison with what is presented here will fail.

This page intentionally left blank

1

A New Beginning � � �

Why do we title this chapter “A New Beginning � � � ”? Well, there are a lot of reasons. First of all, of course, is the simple fact that that is just the way we do things. Secondly, is the fact that we developed this book in much the same way we developed our previous book Statistics in Spectroscopy (SiS). Those of you out there who have followed the series of articles published in Spectroscopy magazine since 1986 know that for the most part, each column in the series was pretty much self-contained and could stand alone, yet also fit into that series in the appropriate place and contributed to the flow of information in that series as a whole. We hope to be able to reproduce that on a larger scale. Just as the series Statistics in Spectroscopy (this is too long to write out each time, from here on we will abbreviate it SiS) was self-contained and stood alone, so too will we try to make this new series stand alone, and at the same time be a worthy successor to SiS, and also continue to develop the concepts we began there. Thirdly is the fact that we are finally starting to write again. To you, our readership, it may seem like we have been writing continuously since we began SiS, but in fact we have been running on backlog for a longer time than you would believe. That was advantageous in that it allowed us time to pursue our personal and professional lives including such other projects as arranging for SiS to be published as a book [1]. The downside of our getting ahead of ourselves, on the other hand, is that we were not able to keep you abreast on the latest developments related to our favorite topic. However, since the last time we actually wrote something, there have been a number of noteworthy developments. Our last series dealt only with the elementary concepts of statistics related to the general practice of calibration used for UV-VIS-NIR and occasionally for IR spec troscopy. Our purpose in writing SiS was to help provide a small foot bridge to cross the gap between specialized chemometrics literature written at the expert level and those general statistics articles and texts dealing with examples and questions far removed from chemistry or spectroscopic practice. Since the beginning of the “Statistics” series in 1986, several reviews, tutorials, and textbooks have been published to begin the construction of a major highway bridging this gap. Most notably, at least in our minds, have been tutorial articles on classical least squares (CLS), principal components regression (PCR), and partial least squares regression (PLSR) by Haaland and Thomas [2, 3]. Other important work includes textbooks on calibration and chemometrics by Naes and Martens [4], and Mark [5]. Chemometric reviews discussing the progress of tutorial and textbook literature appear regularly in Analytical Chemistry, Critical Review issues. Another recent series of articles on chemometric concepts termed “The Chemometric Space” by Naes and Isaksson has appeared [6]. In addition, there is a North American chapter of the International Chemometrics Society (NAmICS) which we are told has

2

Chemometrics in Spectroscopy

over 300 members. Those interested in joining or obtaining further information may contact Professor Thomas O’Haver at the Department of Chemistry and Biochemistry, University of Maryland, College Park, Maryland 20742 (Donald B. Dahlberg, 1993, personal communication). All the foregoing was true as of when the Chemometrics column began in 1993. Now in 2006, when we are preparing this for book publication, there are many more sources of information about Chemometrics. However, since this is not a review of the field, we forebear to list them all, but will correct one item that has changed since then: to obtain information about NAmICS, or to join the discussion group, contact David Duewer at NIST ([email protected])) or send a message to the discussion group ([email protected]). Finally, since imitation is the sincerest form of flattery (or so they tell us), we are pleased to see that others have also taken the route of printing longer tutorial discussions in the form of a series of related articles on a given topic. Two series that we have no qualms recommending, on topics related to ours, have appeared in some of the sister publications of Spectroscopy [7–15] (note: there have been recent indications that the series in Spectroscopy International has continued beyond the ones we have listed. If we can obtain more information we will keep you posted – Spectroscopy International has also undergone some transformations and it is not always easy to get copies). So, overall the chemometrics bridge between the lands of the overly simplistic and severely complex is well under construction; one may find at least a single lane open by which to pass. So why another series? Well, it is still our labor of love to deal with specific issues that plague ourselves and our colleagues involved in the practice of multivariate qualitative and quantitative spectroscopic calibration. Having collectively worked with hundreds of instrument users over 25 combined years of calibration problems, we are compelled, like bees loaded with pollen, to disseminate the problems, answers, and questions brought about by these experiences. Then what would a series named “Chemometrics in Spectroscopy” hope to cover which is of interest to the readers of “Spectroscopy”? We have been taken to task (with perhaps some justice) for using the broader title label “Chemometrics in Spectroscopy” for what we have claimed will be discussions of the somewhat narrower range of topics included in the field of multivariate statistical algorithms applied to chemical problems, when the term “Chemometrics” actually applies to a much wider range of topics. Nevertheless, we will use this title, for a number of reasons. First, that is what we said we were going to do, and we hate to not follow through, even on such a minor point. Secondly, we have said before (with all due arrogance) that this is our column, and we have been pretty fortunate that the editors of Spectroscopy have always pretty much let us do as we please. Finally, at this point we consider the possibility that we may very well eventually extend our range to include some of these other topics that the broader term will cover. As of right now, some of the topics we foresee being able to expand upon over the series will include, but not be limited to • The multivariate normal distribution • Defining the bounds for a data set

A New Beginning � � �

3

• The concept of Mahalanobis distance • Discriminant analysis and its subtopics of – Sample selection – Spectral matching (Qualitative analysis) • Finding the maximum variance in the multivariate distribution • Matrix algebra refresher • Analytic geometry refresher • Principal components analysis (PCA) • Principal components regression (PCR) • More on Multiple linear least squares regression (MLLSR), also known as Multiple linear regression (MLR) and P-matrix, and its sibling, K-matrix • More on Simple linear least squares regression (SLLSR), also known as Simple least squares regression (SLSR) or univariate least squares regression • Partial least squares regression (PLSR) • Validation of calibration models • Laboratory data and assessing error • Diagnosis of data problems • An attempt to standardize statistical/chemometric terms • Special calibration problems (and solutions) • The concept of outliers: theory and practice • Standardization concepts and methods for transfer of calibrations • Collaborative study problems related to methods and instruments. We also plan to include in the discussions the important statistical concepts, such as correlation, bias, slope, and associated errors and confidence limits. Beyond this, it is also our hope that readers will write to us with their comments or suggestions for chemometric challenges which confront them. If time and energy permit, we may be able to discuss such issues as neural networks, general factor analysis, clustering techniques, maximizing graphical presentation of data, and signal processing.

THE MULTIVARIATE NORMAL DISTRIBUTION We will begin with the concept of the multivariate normal distribution. Think of a cigar, suspended in space. If you cannot think of a cigar suspended in space, look at Figure 1-1a. Now imagine the cigar filled with little flecks of stuff, as in Figure 1-1b (it does not really matter what the stuff is, mathematics never concerned itself with such unimportant details). Imagine the flecks being more densely packed toward the middle of the cigar. Now imagine a swarm of gnats surrounding the cigar; if they are attracted to the cigar, then naturally there will be fewer of them far away from the cigar than close to it (Figure 1-1c). Next take away the cigar, and just leave the flecks and the gnats. By this time, of course, you should realize that the flecks and the gnats are really the same thing, and are neither flecks nor gnats but simply abstract representations of points in space. What is left looks like Figure 1-1d.

4

Chemometrics in Spectroscopy (a)

(b)

(c)

(d)

Figure 1-1 Development of the concept of the Multivariate Normal Distribution (this one shown having three dimensions) – see text for details. The density of points along a cross-section of the distribution in any direction is also an MND, of lower dimension.

Figure 1-1d, of course, is simply a pictorial/graphical representation of what a Multivariate Normal Distribution (MND) would look like, if you could see it. Furthermore, it is a representation of only one particular MND. First of all, this particular MND is a three-dimensional MND. A two-dimensional MND will be represented by points in a plane, and a one-dimensional MND is simply the ordinary Normal distri bution that we have come to know and love [16]. An MND can have any number of dimensions; unfortunately we humans cannot visualize anything with more than three dimensions, so for our examples we are limited to such pictures. Also, the MND depicted has a particular shape and orientation. In general, an MND can have a variety of shapes and orientations, depending upon the dispersion of the data along the different axes. Thus, for example, it would not be uncommon for the dispersion along two of the axes to be equal and independent. In this case, which represents one limiting situation, an appropriate cross-section of the MND would be circular rather than elliptical. Another limiting situation, by the way, is for two or more of the variables to be perfectly corre lated, in which case the data would lie along a straight line (or plane, or hyperplane as the corresponding higher-dimensional figure is called). Each point in the MND can be projected onto the planes defined by each pair of the axes of the coordinate system. For example, Figure 1-2 shows the projection of the data onto the plane at the “bottom” of the coordinate system. There it forms a twodimensional MND, which is characterized by several parameters, the two-dimensional MND being the prototype for all MNDs of higher dimension and the properties of this MND are the characteristics of the MND that are the key defining properties of it. First of all, the data contributing to an MND itself has a Normal distribution along any of the

A New Beginning � � �

5

Figure 1-2 Projecting each point of the three-dimensional MND onto any of the planes defined by two axes of the coordinate system (or, more generally, any plane passing through the coor dinate system) results in the projected points being represented by a two-dimensional MND). The correlation coefficients for the projections in all planes are needed to fully describe the original MND.

axes of the MND. We have discussed the Normal distribution previously [16], and have seen that it is described by the expression: f �x� = ae−�

x−x �

�

2

(1-1)

The MND can be mathematically described by an expression that is similar in form, but has the characteristic that each of the individual parts of the expression represents the multivariate analog of the corresponding part of equation 1-1. Thus, for example, where x represents the mean of the data for which equation 1-1 describes the distribution, there is a corresponding quantity X that represents in matrix notation the fact that for each of the axes shown in Figure 1-1, each datum has a value, and therefore the collection of data has a mean value along each dimension. This quantity represented as a list of the set of means along all the different dimensions is called a vector, and is represented as X (as opposed to x, an individual mean). If we project the MND onto each axis of the coordinate system containing the MND, then as stated above, these projections of the data will be distributed as an ordinary Normal distribution, as shown in Figure 1-3. This distribution will itself then have a standard deviation, so that another defining characteristic of the MND is the standard deviation of the projection of the MND along each axis. This must also then be represented by a vector.

Figure 1-3 Projecting the points onto a line results in a point density that is our familiar Normal Distribution.

6

Chemometrics in Spectroscopy

The final key point to note about the MND, which can also be seen from Figure 1-2, is the fact that when the MND is projected onto the plane defined by any two axes of the coordinate system the data may show some correlation (as does the data in Figure 1-2). In fact, the projection onto any of the planes defined by two of the axes will have some value for the correlation coefficient between the corresponding pair of variables. The amount of correlation between projections along any pair of axis can vary from zero, in which case the data would lie in a circular blob, to unity, in which case the data would all lie exactly on a straight line. Since each pair of axes define another plane, many such projections may be possible, depending on the number of dimensions in which the MND exists. Indeed, every possible pair of axes in the coordinate system defines such a plane. As we have shown, we mere mortals cannot visualize more than three dimensions, as so our examples and diagrams will be limited to showing data in three or lesser dimensions, but the mathematical descriptions can be extended with all generality, to as high dimensionality as might be needed. Thus, the full description of the MND must include all the correlations of the data between every pair of axes. This is conventionally done by creating what is known as the correlation matrix. This matrix is a square matrix, in which any given row or column corresponds to a variable, and the individual positions (i.e., the m, n position for example, where m and n represent indices of the variables) in the matrix represent the correlation between the variable represented by the row it lies in and the variable represented by the column it lies in. In actuality, for mathematical reasons, the correlation itself is not used, but rather the related quantity the covariance replaces the correlation coefficient in the matrix. The elements of the matrix that lie along what is called the main diagonal (i.e., where the column and row numbers are the same) are then the variances (the square of the standard deviation – this shows that there is a rather close relationship between the standard deviation and the correlation) of the data. This matrix is thus called the variance-covariance matrix, and sometimes just the covariance matrix for simplicity. Since it is necessary to represent the various quantities by vectors and matrices, the operations for the MND that correspond to operations using the univariate (simple) Normal distribution must be matrix operations. Discussion of matrix operations is beyond the scope of this column, but for now it suffices to note that the simple arithmetic operations of addition, subtraction, multiplication, and division all have their matrix counterparts. In addition, certain matrix operations exist which do not have counterparts in simple arithmetic. The beauty of the scheme is that many manipulations of data using matrix operations can be done using the same formalism as for simple arithmetic, since when they are expressed in matrix notation, they follow corresponding rules. However, there is one major exception to this: the commutative rule, whereby for simple arithmetic: A (operation) B = B (operation) A e.g.: A + B = B + A A−B = B−A does not hold true for matrix multiplication: A×B = B×A

A New Beginning � � �

7

That is because of the way matrix multiplication is defined. Thus, for this case the order of appearance of the two matrices to be multiplied may provide different matrices as the answer. Thus, instead of f�x� and the expression for it in equation 1-1 describing the simple Normal distribution, the MND is described by the corresponding multivariate expression (1-2): f �X� = Ke−�X−X�

T A�X−X�

(1-2)

where now the capital letters X and K represent vectors, and the capital letter A represents the covariance matrix. This is, by the way, a somewhat straightforward extension of the definition (although it may not seem so at first glance) because for the simple univariate case the matrix A degenerates into the number 1, X becomes x, and thus the exponent becomes simply the square of x − x. Most texts dealing with multivariate statistics have a section on the MND, but a particularly good one, if a bit heavy on the math, is the discussion by Anderson [17]. To help with this a bit, our next few chapters will include a review of some of the elementary concepts of matrix algebra. Another very useful series of chemometric related articles has been written by David Coleman and Lynn Vanatta. Their series is on the subject of regression anal ysis. It has appeared in American Laboratory in a set of over twenty-five articles. Copies of the back articles are available on the web at the URL address found in reference [18].

REFERENCES 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14.

Mark, H. and Workman, J., Statistics in Spectroscopy (Academic Press, Boston, 1991) Haaland, D. and Thomas, E., Analytical Chemistry 60, 1193–1202 (1988). Haaland, D. and Thomas, E., Analytical Chemistry 60, 1202–1208 (1988). Naes, T. and Martens, H., Multivariate Calibration (John Wiley & Sons, New York, 1989). Mark, H., Principles and Practice of Spectroscopic Calibration (John Wiley & Sons, New York, 1991). Naes, T. and Isaksson, T., “The Chemometric Space”, NIR News (PO Box 10, Selsey, Chichester, West Sussex, PO20 9HR, UK, 1992). Bonate, P.L., “Concepts in Calibration Theory”, LC/GC, 10(4), 310–314 (1992). Bonate, P.L., “Concepts in Calibration Theory”, LC/GC, 10(5), 378–379 (1992). Bonate, P.L., “Concepts in Calibration Theory”, LC/GC, 10(6), 448–450 (1992). Bonate, P.L., “Concepts in Calibration Theory”, LC/GC, 10(7), 531–532 (1992). Miller, J.N., “Calibration Methods in Spectroscopy”, Spectroscopy International 3(2), 42–44 (1991). Miller, J.N., “Calibration Methods in Spectroscopy”, Spectroscopy International 3(4), 41–43 (1991). Miller, J.N., “Calibration Methods in Spectroscopy”, Spectroscopy International 3(5), 43–46 (1991). Miller, J.N., “Calibration Methods in Spectroscopy”, Spectroscopy International 3(6), 45–47 (1991).

8

Chemometrics in Spectroscopy

15. Miller, J.N., “Calibration Methods in Spectroscopy”, Spectroscopy International 4(1), 41–43 (1992). 16. Mark, H. and Workman, J., “Statistics in Spectroscopy – Part 6 – The Normal Distribution”, Spectroscopy 2(9), 37–44 (1987). 17. Anderson, T.W., An Introduction to Multivariate Statistical Analysis (Wiley, New York, 1958). 18. Coleman, D. and Vanatta, L., Statistics in Analytical Chemistry, International Scientific Com munications, Inc. found at http://www.iscpubs.com/articles/index.php?2.

2

Elementary Matrix Algebra: Part 1

You may recall that in the first chapter we promised that a review of elementary matrix algebra would be forthcoming; so the next several chapters will cover this topic all the way from the very basics to the more advanced spectroscopic subjects. You may already have discovered that the term “matrix” is a fanciful name for a table or list. If you have recently made a grocery list you have created an n×1 matrix, or in more correct nomenclature, an Xn×1 matrix where n is the number of items you would like to buy (rows) and 1 is the number of columns. If you have become a highly sophisticated shopper and have made lists consisting of one column for Store A and a second one for Store B, you have ascended into the world of Xn×2 matrix. If you include the price of each item and put brackets around the entire column(s) of prices, you will have created a numerical matrix. By definition, a numerical matrix is a rectangular array of numbers (termed “ele ments”) enclosed by square brackets [ ]. Matrices can be used to organize information such as size versus cost in a grocery department, or they may be used to simplify the problems associated with systems or groups of linear equations. Later in this chapter we will introduce the operations involved for linear equations (see Table 2-1 for common symbols used).

Table 2-1 Common symbols used in matrix notation Matrix∗ Determinant∗ Vectors∗ Scalars∗ Parameters or matrix names Errors and residuals Addition Subtraction Multiplication Division Empty or null set Inverse of a matrix Transpose of a matrix Generalized inverse of a matrix Identity matrix ∗

[X] or X �X� x x A, B, C, G, H, P, Q, R, S, U, V D, E, F + − × or • ÷ or / � [X]−1 �X�� or [X]T [X]− [I] of [1]

Where X or x are represented by any letter, generally those are listed under “Parameters or matrix names” in this table.

10

Chemometrics in Spectroscopy

The symbols below represent a matrix:

a1 a2

b1 b2

Note that a1 and a2 are in column 1, b1 and b2 are in column 2, a1 and b1 are in row 1, and a2 and b2 are in row 2. The above matrix is a 2 × 2 (rows × columns) matrix. The first number indicates the number of rows, and the second indicates the number of columns. Matrices can be denoted as X2×2 using a capital, boldface letter with the row and column subscript.

MATRIX OPERATIONS The following illustrations are useful to describe very basic matrix operations. Discus sions covering more advanced matrix operations will be included in later chapters, but for now, just review these elementary operations.

Matrix addition To add two matrices, the following operation is performed:

a1 a2

b1 c + 1 b2 c2

d1 a + c1 = 1 d2 a2 + c2

b1 + d1 b2 + d2

To add larger matrices, the following operation applies:

a1 a2

b1 b2

c1 c2

d1 e + 1 e2 d2

f1 f2

g1 g2

h1 a + e1 = 1 a2 + e2 h2

b1 + f1 b2 + f2

c1 + g1 c2 + g2

d1 + h1 d2 + h2

c1 − g1 c2 − g2

d1 − h1 d2 − h2

Subtraction For subtraction, use the following operations:

a1 a2

b1 c − 1 b2 c2

d1 a − c1 = 1 d2 a2 − c2

b1 − d1 b2 − d2

The same operation holds true for larger matrices such as

a1 a2

b1 b2

and so on.

c1 c2

d1 e − 1 d2 e2

f1 f2

g1 g2

h1 a − e1 = 1 h2 a2 − e2

b1 − f1 b2 − f2

Elementary Matrix Algebra: Part 1

11

Matrix multiplication To multiply a scalar by a matrix (or a vector) we use a A 1 a2

A × a1 b1 = b2 A × a2

A × b1 A × b2

where A is a scalar value.

The product of two matrices (or vectors) is given by

a1 a2

b1 c × 1 b2 c2

d1 a c + b1 c2 = 1 1 d2 a2 c1 + b2 c2

a1 d1 + b1 d2 a2 d1 + b2 d2

In another example, in which an X1×2 matrix is multiplied by an X2×1 matrix, we have:

a1

b1

a × 2 = �a1 b1 + a2 b2 � b2

denoted by X1 × X2 in matrix notation.

Matrix division Division of a matrix by a scalar is accomplished:

a1 a2

b1 a A ÷ A = 1 b2 a2 A

b1 A b2 A

where A is a scalar value.

Inverse of a matrix The inverse of a matrix is the conceptual equivalent to its reciprocal. Therefore if we denote our matrix by X, then the inverse of X is denoted as X−1 and the following relationship holds. X × X−1 = �1� = X−1 × X where [1] is an identity matrix. Only square matrices, which have an equal number of rows and columns (for example, 2 × 2, 3 × 3, 4 × 4, etc.) have inverses. Several computer packages provide the algorithms for calculating the inverse of square matrices. The identity matrix for a 2 × 2 matrix is �1�2×2 =

1 0

0 1

12

Chemometrics in Spectroscopy

and for a 3 × 3 matrix, the identity matrix is ⎡

1 �1�3×3 = ⎣ 0 0

0 1 0

⎤ 0 0⎦ 1

and so on. Note that the diagonal is always composed of ones for the identity matrix, and all other values are zero. To summarize, by definition: X2×2 × X−1 2×2 = �1�2×2 The basic methods for calculating X−1 will be addressed in the next chapter.

Transpose of a matrix The transpose of a matrix is denoted by X� (or, alternatively, by XT �. For example, for the matrix: �X� = a1 a2

b1 b2 ⎡

then

a1 �X�� = ⎣ b1 c1

c1 c2

⎤ a2 b2 ⎦ c2

The first column of [X] becomes the first row of �X�� ; the second column of [X] becomes the second row of �X�� ; the third column of [X] becomes the third row of �X�� ; and so on.

ELEMENTARY OPERATIONS FOR LINEAR EQUATIONS To solve problems involving calibration equations using multivariate linear models, we need to be able to perform elementary operations on sets or systems of linear equations. So before using our newly discovered powers of matrix algebra, let us solve a problem using the algebra many of us learned very early in life. The elementary operations used for manipulating linear equations include three simple rules [1, 2]: • Equations can be listed in any order for convenience and organizational purposes. • Any equation may be multiplied by any real number other than zero. • Any equation in a series of equations can be replaced by the sum of itself and any other equation in the system. As an example, we can illustrate these operations using

Elementary Matrix Algebra: Part 1

13

the three equations below as part of what is termed an “equation system” or simply a “system” (equations 2-1 through 2-3): 1a1 + 1b1 = −2

(2-1)

4a1 + 2b1 + c1 = 6

(2-2)

6a1 − 2b1 − 4c1 = 14

(2-3)

To solve for this system of three equations, we begin by following the three elementary operations rules above: • We can rearrange the equations in any order. In our case the equations happen to be in a useful order. • We decide to multiply equation 2-1 by a factor such that the coefficients of a are of opposite sign and of the same absolute value for equations 2-1 and 2-2. Therefore, we multiply equation 2-1 by −4 to yield −4a1 − 4b1 = 8

(2-4)

• We can eliminate a1 in the first and the second equations by adding equations 2-4 and 2-2 to give equation (2-5) �−4a1 − 4b1 = 8� + �4a1 + 2b1 + c1 = 6� = 6a1 − 2b1 + c1 = 14

(2-5)

and we bring equation 2-1 back in the system by dividing equation 2-4 by −4 to get a1 + b1 = −2

(2-6)

−2b1 + c1 = 14

(2-7)

6a1 − 2b1 − 4c1 = 14

(2-8)

Now to eliminate the a1 term in equations 2-6 and 2-8, we multiply equation 2-6 by −6 to yield −6a1 − 6b1 = 12

(2-9)

Then we add equation 2-9 to equation 2-8: �−6a1 − 6b1 = 12� + �6a1 − 2b1 − 4c1 = 14� = −8b1 − 4c1 = 26

(2-10)

14

Chemometrics in Spectroscopy

Now we bring back equation 2-6 in its original form by dividing equation 2-9 by −6, and our system of equations looks like this: a1 + b1 = −2

(2-11)

−1b1 + c1 = 14

(2-12)

−8b1 − 4c1 = 26

(2-13)

We can eliminate the b1 term from equations 2-12 and 2-13 by multiplying equation 2-12 by −8 and equation 2-13 by 2 to obtain 16b1 − 8c1 = −112

(2-14)

−16b1 − 8c1 = 52

(2-15)

−16c1 = −60

(2-16)

Adding these equations, we find

Restore equation 2-7 by dividing equation 2-14 by −8 to yield a1 + b1 = −2

(2-17)

−2b1 + c1 = 14

(2-18)

−16c1 = −60

(2-19)

The solution Solving for c1 , we find c1 = �−60/ − 16� = 3�75� Substituting c1 into equation 2-18, we obtain −2b1 + 3�75 = 14� Solving this for b1 , we find b1 = −5�13� Substituting b1 into equation 2-17 , we find a1 + �−5�13� = −2. Solving this for a1 , we find a1 = 3�13� Finally, a1 = 3�13 b1 = −5�13 c1 = 3�75 A system of equations where the first unknown is missing from all subsequent equations and the second unknown is missing from all subsequent equations is said to be in echelon form. Every set or equation system comprised of linear equations can be brought into echelon form by using elementary algebraic operations. The use of augmented matrices can accomplish the task of solving the equation system just illustrated.

Elementary Matrix Algebra: Part 1

15

For our previous example, the original equations a1 + b1 = −2

(2-20)

4a1 + 2b1 + c1 = 6

(2-21)

6a1 − 2b1 − 4c1 = 14

(2-22)

can be written in augmented matrix form as: ⎡ ⎤ 1 1 0 −2 ⎣4 2 1 6⎦ 6 −2 −4 14

(2-23)

The echelon form of the equations can also be put into matrix form as follows. Echelon form: a1 + b1 = −2

(2-24)

−2b1 + c1 = 14

(2-25)

−16c1 = −60

(2-26)

Matrix form: ⎡

1 ⎣0 0

1 −2 0

⎤ 0 −2 1 14 ⎦ −16 −60

(2-27)

SUMMARY In this chapter, we have used elementary operations for linear equations to solve a problem. The three rules listed for these operations have a parallel set of three rules used for elementary matrix operations on linear equations. In our next chapter we will explore the rules for solving a system of linear equations by using matrix techniques.

REFERENCES 1. Kowalski, B.R., Recommendations to IUPAC Chemometrics Society (Laboratory for Chemo metrics, Department of Chemistry, BG-10, University of Washington, Seattle, WA 98195; 1985), pp. 1–2. 2. Britton, J.R. and Bello, I., Topics in Contemporary Mathematics (Harper & Row, New York, 1984), pp. 408–457.

This page intentionally left blank

3 Elementary Matrix Algebra: Part 2

ELEMENTARY MATRIX OPERATIONS To solve the set of linear equations introduced in our previous chapter referenced as [1], we will now use elementary matrix operations. These matrix operations have a set of rules which parallel the rules used for elementary algebraic operations used for solving systems of linear equations. The rules for elementary matrix operations are as follows [2]: 1) Rows can be listed in any order for convenience or organizational purposes. 2) All elements within a row may be multiplied using any real number other than zero. 3) Any row can be replaced by the element-by-element sum of itself and any other row. To solve a system of equations, our first step is to put zeros into the second and the third rows of the first column, and into the third row of the second column. For our exercise we will bring forward equations 2-1 through 2-3 as (equation set 3-1): 1a1 + 1b1 = −2 4a1 + 2b1 + 1c1 = 6 6a1 − 2b1 − 4c1 = 14

(3-1)

We can put the above set or system of equations in matrix notation as: ⎡

1 A = ⎣4 6

⎤ 0 1⎦ −4

1 2 −2

⎡ ⎤ a1 B = ⎣ b1 ⎦ c1

⎡

⎤ −2 C = ⎣ 6⎦ 14

and so, AB = C

or

A • B = C

Matrix A is termed the “matrix of the equation system”. The matrix formed by A C is termed the “augmented matrix”. For this problem the augmented matrix is given as:

⎡

1 A C = ⎣4 6

1 2 −2

0 1 −4

⎤ −2 6⎦ 14

18

Chemometrics in Spectroscopy

Now if we were to find a set of equations with zeros in the second and the third rows of the first column, and in the third row of the second column we could use equations 2-17 through 2-19 [1] which look like (equation set 3-2): a1 + b1 = −2 −2b1 + c1 = 14 −16c1 = −60 we can rewrite these equations in matrix notation as: ⎡ ⎤ ⎡ ⎤ 1 1 0 a1 1⎦ H = ⎣ b1 ⎦ G = ⎣0 −2 0 0 −16 c1

(3-2) ⎡

⎤ −2 P = ⎣ 14⎦ −60

and the augmented form of the above matrices is written as: ⎡ ⎤ 1 1 0 −2 G P = ⎣0 −2 1 14⎦ 0 0 −16 −60 For equation 2-7, we can reduce or simplify the third row in G P by following Rule 3 of the basic matrix operations previously mentioned. As such we can multiply row III in G P by 1/2 to give ⎡ ⎤ 1 1 0 −2 G P = ⎣0 −2 1 14⎦ 0 0 −8 −30 We can use elementary also known as elementary matrix to row operations, operations obtain matrix G P from A C . By the way, if we can achieve G P from A C using these operations, the matrices are termed “row equivalent” denoted by X1 ∼ X2 . To begin with an illustration of the use of elementary matrix operations let us use the following example. Our original A matrix above can be manipulated to yield zeros in rows II and III of column I by a series of row operations. The example below illustrates this: ⎡ ⎤ ⎡ ⎤ 1 1 0 −2 1 1 0 −2 ⎣4 2 1 6⎦ ∼ ⎣0 −2 1 14⎦ 6 −2 −4 14 0 −8 −4 26 The left-hand augmented matrix is converted to the right-hand augmented matrix by II/II − 4I or row II is replaced by row II minus 4 times row I. Then III/III − 6I or row III is replaced by row III minus 6 times row I. To complete the row operations to yield G P from A C we write ⎡ ⎤ ⎡ ⎤ 1 1 0 −2 1 1 0 −2 ⎣0 −2 1 14⎦ ∼ ⎣0 −2 1 14⎦ 0 −8 −4 26 0 0 −8 −30

Elementary Matrix Algebra: Part 2

19

This is accomplished by III/III − 4II or row III is replaced by row III minus 4 times row II. As we have just shown using two series of row operations we have ⎡ ⎤ 1 1 0 −2 ⎣0 −2 1 14⎦ 0 0 −8 −30 which is equivalent to equations 2-17 through 2-19, and equations (3-3) above; this is shown here as (equation set 3-3). a1 + b1 = −2 −2b1 + c1 = 14 −8c1 = −30

(3-3)

Now, solving for c1 = −30/− 8 = 375; substituting c1 into equation 2-18, we find −2b1 + 375 = 14, therefore b1 = −513; and substituting b1 into equation 2-17, we find a1 + −513 = −2, therefore a1 = 313; and so, a1 = 313 b1 = −513 c1 = 375 Thus matrix operations provide a simplified method for solving equation systems as compared to elementary algebraic operations for linear equations.

CALCULATING THE INVERSE OF A MATRIX In Chapter 2, we promised to show the steps involved in taking the inverse of a matrix. Given a 2 × 2 matrix X2×2 , how is the inverse calculated? We can ask the question another way as, “What matrix when multiplied by a given matrix Xr×c will give the identity matrix ([I])? In matrix form we may write a specific example as: −2 −3

1 1 ∼ 2 0

0 1

Therefore, −2 −3

1 c × 1 2 d1

1 d1 = 0 d2

0 =1 1

or stated in matrix notations as A × B = I, where B is the inverse matrix of A, and [I] is the identity matrix.

20

Chemometrics in Spectroscopy

By multiplying A × B we can calculate the two basic equation systems to use in solving this problem as: −2c1 + 1c2 = 1 System 1 −3c1 + 2c2 = 0 −2d1 + 1d2 = 0

System 2

−3d1 + 2d2 = 1 The augmented matrices are denoted as: −2 1 −3 2

1 0

0 1

The first (preceding) matrix is reduced to echelon form (zeros in the first and the second rows of column one) by −2 1 1 0 −2 1 1 0 ∼ −3 2 0 1 0 −1 3 −2 The row operation is II/3I − 2II or row II is replaced by three times row I minus two times row II. The next steps are as follows: −2 1 1 0 −2 0 4 −2 ∼ 0 −1 3 −2 0 −1 3 −2 with row operations as (I/I + II) and I/ − 1/2I.

Thus c1 = −2, c2 = −3, d1 = 1, and d2 = 2. So B = A−1 (inverse of A) and

−2 1 −1 A = −3 2 So now we check our work by multiplying A • A−1 as follows: −2 1 −2 1 −2 × −2 + 1 × −3 −2 × 1 + 1 × 2 −1 × = A × A = −3 2 −3 2 −3 × −2 + 2 × −3 −3 × 1 + 2 × 2 1 0 = = 1 0 1 By coincidence, we have found a matrix which when multiplied by itself gives the identity matrix or, saying it another way, it is its own inverse. Of course, that does not generally happen, a matrix and its inverse are usually different.

SUMMARY Hopefully Chapters 1 and 2 have refreshed your memory of early studies in matrix algebra. In this chapter we have tried to review the basic steps used to solve a system of linear equations using elementary matrix algebra. In addition, basic row operations

Elementary Matrix Algebra: Part 2

21

were used to calculate the inverse of a matrix. In the next chapter we will address the matrix nomenclature used for a simple case of multiple linear regression.

REFERENCES 1. Workman, J., Jr. and Mark, H., Spectroscopy 8(7), 16–19 (1993). 2. Britton, J.R. and Bello, I., Topics in Contemporary Mathematics (Harper & Row, New York, 1984), pp. 408–457.

This page intentionally left blank

4

Matrix Algebra and Multiple Linear Regression: Part 1

In a previous chapter we noted that by augmenting the matrix of coefficients with unit matrix (i.e., one that has all the members equal to zero except on the main diagonal, where the members of the matrix equal unity), we could arrive at the solution to the simultaneous equations that were presented. Since simultaneous equations are, in one sense, a special case of regression (i.e., the case where there are no degrees of freedom for error), it is still appropriate to discuss a few odds and ends that were left dangling. We started in the previous chapter with the set of simultaneous equations: 1a + 1b + 0c = −2

(4-1a)

4a + 2b + 1c = 6

(4-1b)

6a − 2b − 4c = 14

(4-1c)

(where we now leave the subscripts off the variables for simplicity, with no loss of generality for our current purposes). Also note that here we write all the coefficients out explicitly, even when the ones and zeroes do not necessarily appear in the original equations – this is so that they will not be inadvertently left out of the matrix expressions, where the “place filling” function must be performed), and we noted that we could express these equations in matrix notation as: ⎡ ⎤ ⎡ ⎤ ⎡ ⎤ 1 1 0 a −2 2 1⎦ B = ⎣b ⎦ C = ⎣ 6⎦ A = ⎣4 6 −2 −4 c 14 where the equations then take the matrix form: A ∗ B = C

(4-2)

The question here is, how did we get from equations 4-1 to equation 4-2? The answer is that it is not at all obvious, even in such a simple and straightforward case, how to break up a group of algebraic equations into their equivalent matrix expression. It turns out, however, that going in the other direction is often much simpler and straightforward. Thus, when setting up matrix expressions, it is often desirable to run a check on the work to verify that the matrix expression indeed correctly represents the algebraic expression of interest. In the current case, this can be done very simply by carrying out the matrix multiplication indicated on the left-hand side of equation 4-2. Thus, expanding the matrix expression AB into its full representation, we obtain ⎡ ⎤ ⎡ ⎤ 1 1 0 A ⎣4 2 1⎦ × ⎣ B ⎦ (4-3) 6 −2 −4 C

24

Chemometrics in Spectroscopy

From our previous chapter defining the elementary matrix operations, we recall the operation for multiplying two matrices: the i j element of the result matrix (where i and j represent the row and the column of an element in the matrix respectively) is the sum of cross-products of the ith row of the first matrix and the jth column of the second matrix (this is the reason that the order of multiplying matrices depends upon the order of appearance of the matrices – if the indicated ith row and jth column do not have the same number of elements, the matrices cannot be multiplied). Now let us apply this definition to the pair of matrices listed above. The first matrix (A) has three rows and three columns. The second matrix (B) has three rows and one column. Since each row of A has three elements, and the single column of B has three elements, matrix multiplication is possible. The resulting matrix will have three rows, each row resulting from one of the rows of matrix A, and one column, corresponding to the single column in the matrix B. Thus the first row of the result matrix will have the single element resulting from the sum-of-products of the first row of A times the column of B, which will be 1a + 1b + 0c

(4-4)

Similarly the second row of the result matrix will have the single element resulting from the sum-of-products of the second row of A times the column of B, which will be 4a + 2b + 1c

(4-5)

and the third row of the result matrix will have the single element resulting from the sum-of-products of the third row of A times the column of B, which will be 6a + −2b + −4c

(4-6)

6a − 2b − 4c

(4-7)

or, simplifying:

The entire matrix product, then, is ⎡

⎤ 1a + 1b + 0c AB = ⎣4a + 2b + 1c⎦ 6a − 2b − 4c Equations 4-4, 4-5, and 4-6 represent the three elements of the matrix product of A and B. Note that each row of this resulting matrix contains only one element, even though each of these elements is the result of a fairly extensive sequence of arithmetic operations. Equations 4-4, 4-5, and 4-7, however, represent the symbolism you would normally expect to see when looking at the set of simultaneous equations that these matrix expressions replace. Note further that this matrix product AB is the same as the entire left-hand side of the original set of simultaneous equations that we originally set out to solve. Thus we have shown that these matrix expressions can be readily verified through straightforward application of the basic matrix operations, thus clearing up one of the loose ends we had left.

Matrix Algebra and Multiple Linear Regression: Part 1

25

Another loose end is the relationship between the quasi-algebraic expressions that matrix operations are normally written in and the computations that are used to implement those relationships. The computations themselves have been covered at some length in the previous two chapters [1, 2]. To relate these to the quasi-algebraic operations that matrices are subject to, let us look at those operations a bit more closely.

QUASI-ALGEBRAIC OPERATIONS Thus, considering equation 4-2, we note that the matrix expression looks like a simple algebraic expression relating the product of two variables to a third variable, even though in this case the “variables” in question are entire matrices. In equation 4-2, the matrix B represents the unknown quantities in the original simultaneous equations. If equation 4-2 were a simple algebraic equation, clearly the solution would be to divide both sides of this equation by A, which would result in the equation B = C/A. Since A and C both represent known quantities, a simple calculation would give the solution for the unknown B. There is no defined operation of division for matrices. However, a comparable result can be obtained by multiplying both sides of an equation (such as equation 4-2 by the inverse of matrix A. The inverse (of matrix A, for example) is conventionally written as A−1 . Thus, the symbolic solution to equation 4-2 is generated by multiplying both sides of equation 4-2 by A−1 : A−1 AB = A−1 C

(4-8)

There are a couple of key points to note about this operation. The main point is that since the order of appearance of the matrices matters, it is important that the new matrix, the one we are multiplying both sides of the equation by, is placed at the beginning of the expressions on each side of the equation. The second key point is the accomplishment of a desired goal: on the left-hand side of equation 4-8 we have the expression A−1 A. We noted earlier that the key defining characteristic of the inverse of a matrix is that fact that when multiplied by the original matrix (that it is the inverse of), the result is a unit matrix. Thus equation 4-8 is equivalent to 1B = A−1 C

(4-9)

where [1] represents the unit matrix. Since the property of the unit matrix is that when multiplied by any other matrix, the result is the same as the other matrix, then [1]B = B, and equation 4-9 becomes B = A−1 C

(4-10)

Thus we have symbolically solved equation 4-2 for the unknown matrix B, the elements of which are the unknown variables of the original set of simultaneous equations. Performing the matrix multiplication of A−1 C will then provide the values of these unknown variables.

26

Chemometrics in Spectroscopy

Let us examine these symbolic transformations with a view toward seeing how they translate into the required arithmetic operations that will provide the answers to the original simultaneous equations. There are two key operations involved. The first is the inversion of the matrix, to provide the inverse matrix. This is an extremely intensive computational task, so much so that it is in general done only on computers, except in the simplest cases for pedagogical purposes, such as we did in our previous chapter. In this regard we are reminded of an old, and somewhat famous, cartoon, where two obviously professor-type characters are staring at a large blackboard. On the left side of the blackboard are a large number of mathematical symbols, obviously representing some complicated and abstruse mathematical derivations. On the right side of the blackboard is a similar set of symbols. In the middle of the blackboard is a large blank space, in the middle of which is written, in big letters: “AND THEN SOME MAGIC HAPPENS”, and one of the characters is saying to the other: “I think you need to be a bit more explicit here in step 10.” To some extent, we feel the same way about matrix inversions. The complications and amount of computation involved in actually doing a matrix inversion are enough to make even the most intrepid mathematician/statistician/chemometrician run for the nearest computer with a preprogrammed algorithm for the task. Indeed, there sometimes seem to be just about as many algorithms for performing a matrix inversion as there are people interested in doing them. In most cases, then, this process is in practice treated as a “black box” where “some magic happens”. Except for the theoretical mathematician, however, there is usually little interest in “being more explicit”, as long as the program gives the right answer. As is our wont, however, our previous chapter worked out the gory details for the simplest possible case, the case of a 2 × 2 matrix. For larger matrices, the amount of computation increases so rapidly with matrix size that even the 3 × 3 matrix is left to the computer to handle. But how can we tell then if the answer is correct? Well, there is a way, and one that is not too overwhelming. From the definition of the inverse of a matrix, you should obtain a unit matrix if you multiply the inverse of a given matrix by the matrix itself. In our previous chapter [1] we showed this for the 2 × 2 case. For the simultaneous equations at hand, however, the process is only a little more extensive. From the original matrix of coefficients in the simultaneous equations that we are working with, the one called A above, we find that the inverse of this matrix is ⎡

−1

A

−0375 = ⎣ 1375 −125

025 −025 05

⎤ 00625 −00625⎦ −0125

(4-11)

How did we find this? Well, we used some of our magic. The details of the computations needed were described in the previous chapter, for the 2 × 2 case; we will not even try to go through the computations needed for the 3 × 3 case we concern ourselves with here. However, having a set of numbers that purports to be the inverse of a matrix, we can verify whether or not it is the inverse of that matrix: all we need to do is multiply by the original matrix and see if the result is a unit matrix. We have done this for the 2 × 2 matrix in our previous chapter. An exercise for the reader is to verify that the matrix shown in equation 4-11 is, in fact, the inverse of the matrix A.

Matrix Algebra and Multiple Linear Regression: Part 1

27

That was the hard part. It now remains to calculate out the expressions shown in equation 4-10, to find the final values for the unknowns in the original simultaneous equations. Thus, we need to form the matrix product of A−1 and C: ⎡ ⎤ ⎡ ⎤ −0375 025 00625 −2 (4-12) A−1 C = ⎣ 1375 −025 −00625⎦ × ⎣ 6⎦ −125 05 −0125 14 This matrix multiplication is similar to the one we did before: we need to multiply a 3 × 3 matrix by a 3 × 1 matrix; the result will then also have dimensions of three rows and one column. The three rows of this matrix will thus be the result of these computations: C11 = −0375 ∗ −2 + 025 ∗ 6 + 00625 ∗ 14 = 075 + 15 + 0875 = 3125

(4-13a)

C21 = 1375 ∗ −2 + −025 ∗ 6 + −00625 ∗ 14 = −275 + −15 + −875 = −5125

(4-13b)

C31 = −125 ∗ −2 + 05 ∗ 6 + −0125 ∗ 14 = 25 + 3 + −175 = 375

(4-13c)

Thus, in matrix terms, the matrix C is ⎡

⎤ 3125 C = ⎣−5125⎦ 375

(4-14)

and this may be compared to the result we obtained algebraically in the last chapter (and found to be identical, within the limits of different roundings used). At first glance it would seem as though this approach has the additional characteristic of requiring fewer computations than our previous method of solving similar equations. However, the computations are exactly the same, but most of them are “hidden” inside the matrix inversion. It might also seem that we have been repetitive in our explanation of these simul taneous equations. This is intentional – we are attempting to explicate the relationship between the algebraic approach and the matrix approach to solving the equations. Our first solution (in the previous chapter) was strictly algebraic. Our second solution used matrix terminology and concepts, in addition to explicitly writing out all the arithmetic involved. Our third approach uses symbolic matrix manipulation, substituting numbers only in the last step.

28

Chemometrics in Spectroscopy

MULTIPLE LINEAR REGRESSION In Chapters 2 and 3, we discussed the rules related to solving systems of linear equations using elementary algebraic manipulation, including simple matrix operations. The past chapters have described the inverse and transpose of a matrix in at least an introductory fashion. In this installment we would like to introduce the concepts of matrix algebra and their relationship to multiple linear regression (MLR). Let us start with the basic spectroscopic calibration relationship: Concentration = Bias +

(Regression Coefficient 1) × (Absorbance at Wavelength 1) +

(Regression Coefficient 2) × (Absorbance at Wavelength 2)

Also written as:

Concentration = 0 + 1 A1 + 2 A2

(4-15)

In this example we state that the concentration of an analyte within a sample is a linear combination of two variables. These variables, in our case, are measured in the same units, that is Absorbance units. In this case the concentration is known as the dependent variable or response variable because its magnitude depends or responds to the values of the changes in Absorbances at Wavelengths 1 and 2. The Absorbances are the x-variables, referred to as independent variables, regressor variables, or predictor variables. Thus an equation such as equation 4-4 through 4-15 attempts to explain the relationship between concentration and changes in Absorbance. This calibration equation or calibration model is said to be linear because the relationship is a linear combination of multiplier terms or regression coefficients as predictors of the concentration (response or dependent variable). Note that the 1 and 2 terms are called Regression Coefficients, Multiplier Terms, Multipliers, or sometimes Parameters. The analysis described is referred to as Linear Regression, Least-Squares, Linear Least-Squares, or most properly, MLR. In more formal notation, we can rewrite Equation 4-15 as: Ecj = 0 + 1 A1 + 2 A2

(4-16)

where Ecj is the expected value for the concentration. Note: the difference between Ecj and cj is the difference between the predicted or expected value Ecj and the actual or observed value cj . This can be rewritten as: cj − Ecj = cj − 0 + 1 A1 + 2 A2

(4-17)

cj = 0 + 1 A1 + 2 A2 + j

(4-18)

and

where j is termed the Prediction Error, Residual Error, Residual, Error, Lack of Fit Error, or the Unexplained Error.

Matrix Algebra and Multiple Linear Regression: Part 1

29

We can also rewrite the equation in matrix form as: ⎡

⎤ c1 ⎢ c2 ⎥ ⎢ ⎥ ⎢•⎥ ⎢ C = ⎢ ⎥ ⎥ ⎢•⎥ ⎣•⎦ cN

⎡ 1 ⎢1 ⎢ ⎢1 ⎢ A = ⎢ ⎢• ⎣• 1

A11 A21 A31 • • AN 1

⎤ A12 A22 ⎥ ⎥ A32 ⎥ ⎥ • ⎥ ⎥ • ⎦ AN 2

⎡

⎡ ⎤ 0 = ⎣ 1 ⎦ 2

⎤ 1 ⎢ ⎥ ⎢ 2 ⎥ ⎢ 3 ⎥ ⎥ =⎢ ⎢ ⎥ ⎢•⎥ ⎣•⎦ N

(4-19)

This equation of the model in matrix notation is written as: C = A +

(4-20)

THE LEAST SQUARES METHOD The problem now becomes: how do we handle the situation in which we have more equations than unknowns? When there are fewer equations than unknowns it is clear that there is not enough information available to determine the values of the unknown variables. When we have more equations than unknowns, however, we would seem to have the problem of having too much information; how do we handle all this extra information and put it to use? For example, consider the following set of simultaneous equations: 1a + 1b + 0c = −2

(4-21a)

4a + 2b + 1c = 6

(4-21b)

6a − 2b − 4c = 14

(4-21c)

1a + 3b + −1c = −15

(4-21d)

This is a set of equations in three unknowns. The first three of these equations are the ones we dealt with above, and we have seen that the solution to the first three equations is a = 3125

(4-22a)

b = −5125

(4-22b)

c = 375

(4-22c)

However, when we replace a, b and c in equation 4-21d by those values, we find that 1 × 3125 + 3 × −5125 + −1 × 375 = −16 rather than the −15 that the equation specifies. If we were to use different subset of groups of three of these equations at a time, we would obtain different answers depending

30

Chemometrics in Spectroscopy

on which set of three equations we used. There seems to be an inconsistency here, yet in the set of four equations represented by equations 4-21 (a–d) all the equations have the same significance; there are no a priori criteria for eliminating any one of them. This is the situation we must handle. We cannot simply ignore one or more of these equations arbitrarily; dealing with them properly has become known variously as the Least Squares method, Multiple Least Squares, or Multiple Linear Regression. As spectroscopists, we are concerned with the application of these mathematical techniques to the solution of spectroscopic problems, particularly the use of spectroscopy to perform quantitative analysis, which is done by applying these concepts to a set of linear equations, as we will see. In this least squares method example the object is to calculate the terms 0 , 1 and 2 which produce a prediction model yielding the smallest or “least squared” differences or residuals between the actual analyte value cj , and the predicted or expected concentration Ecj . To calculate the multiplier terms or regression coefficients j for the model we can begin with the matrix notation: A� A = A� C

(4-23)

When solving for ˆ the expression becomes ˆ To illustrate the matrix ⎡ 2 1 j ⎢ ⎢ A� A = ⎢ j 1 × Aj1 ⎣ 1 × Aj2 j

= A� A−1 A� C

algebra involved for this problem we write 2 2 ⎤ ⎡ Aj1 × 1 Aj2 × 1 N j j A1•2 ⎥ ⎢A Aj1 ⎥ Aj1 × Aj1 Aj2 × Aj1 ⎥ = ⎢ •1 j ⎣ j j ⎦ A•2 Aj1 Aj2 Aj2 × Aj2 Aj1 × Aj2 j j

j

(4-24)

⎤ A2• Aj2 Aj1 ⎥ ⎥ j 2 ⎦ Aj2 j

(4-25) Then rewriting in summation notation we have N

12 = N

and

j=1 N

Aj1 × Aj2 =

Aj1 Aj2

(4-26)

j=1 N j=1

Aj1 =

Aj•

j

Note that A� C is also required for the computations (see equation 4-24) and is given as: ⎡ ⎤ ⎡ ⎤ 1 × Cj NCj j ⎢ ⎥ ⎢ A C ⎥ ⎢ ⎥ j1 j ⎥ (4-27) A� C = ⎢ j Aj1 Cj ⎥ = ⎢ j ⎣ ⎦ ⎣ A C ⎦ j2 j Aj2 Cj j j

Matrix Algebra and Multiple Linear Regression: Part 1

31

If we represent our spectroscopic data using the following symbols: j Cj N Aj1 Aj2

= Spectrum number = Actual concentration for each spectrum = Rank of each spectrum (1) = Absorbance at Wavelength 1 = Absorbance at Wavelength 2.

From this information we can calculate the ˆ (see equation 4-8) using ⎡ ⎤ c1

⎢c2 ⎥

⎢ ⎥ ⎢•⎥ ⎥ C = ⎢ ⎢•⎥ ⎢ ⎥ ⎣•⎦ cj ⎡

1 ⎢1 ⎢ ⎢1 A = ⎢ ⎢• ⎢ ⎣• 1

A11 A21 A31 • • Aj1

⎤ A12 A22 ⎥ ⎥ A32 ⎥ ⎥ • ⎥ ⎥ • ⎦ Aj2

(4-28)

⎡

⎤ NC j ⎢ Aj1 Cj ⎥ ⎥ A� C = ⎢ j ⎣ ⎦ Aj2 Cj j

If we then calculate the inverse of A� A, written as A� A−1 , the computations are nearly complete and we finally obtain ⎡ ⎤ ˆ0 ⎢ˆ⎥ ˆ = A� A−1 A� C = ⎣ (4-29) 1 ⎦ ˆ 2 which in conclusion gives the completed regression equation ECˆ = ˆ0 + ˆ1 A1 + ˆ2 A2

(4-30)

In our next installment, we will review the “how to” of the matrix operations for this example using numerical data. Authors’ note: This initial chapter dealing with matrix algebra and regression has been adapted for spectroscopic nomenclature from Shayle R. Searle’s book, Matrix Algebra Useful for Statistics (John Wiley & Sons, New York, 1982), pp. 363–368. Other particularly useful reference sources with page numbers are listed below as [1–3].

32

Chemometrics in Spectroscopy

REFERENCES 1. Draper, N.R. and Smith, H., Applied Regression Analysis (John Wiley & Sons, New York, 1981), pp. 70–87. 2. Kleinbaum, D.G. and Kupper, L.L., Applied Regression Analysis and Other Multivariable Methods (Duxbury Press, Boston, 1978), pp. 508–520. 3. Workman, J., Jr. and Mark, H., Spectroscopy 8(7), 16–19 (1993).

5

Matrix Algebra and Multiple Linear Regression: Part 2

In the previous chapter we presented the problem of fitting data when there is more information (in the form of equations relating the several variables involved) available than the minimum amount that will allow for the solution of the equations. We then presented the matrix equations for calculating the least squares solution to this case of overdetermined variables. How did we get from one to the other? As we described the situation, when there are more equations than unknowns, one possibility is to ignore some of the equations. This is unsatisfactory, for a number of reasons. In the first place, there is no a priori criterion for deciding which equations to ignore, so that any choice is arbitrary. Secondly, by rejecting some of the equations, we are also rejecting and wasting the work that went into the collection of the data represented by those equations. Thirdly, and perhaps most importantly, when we ignore some of the equations, we are also ignoring the (rather important) fact that the lack of perfect fit to all the equations is itself an important piece of information. What the set of equations is telling us in this case is that there is, in fact, not a perfect fit of the data, taken as a whole, of any of the equations in the set. Rather, there is some average equation, that in some sense gives a best fit to all of the data taken as a set, without favoring any particular subset of them. It is this “average” equation that we would like to be able to find. In the history of the development of mathematics, one important branch was the study of the behavior of randomness. Initially, there were no highfalutin ideas of making “science” out of what appeared to be disorder; rather, the investigations of random phenomena that lead to what we now know as the science of Statistics began as studies of the behavior of the random phenomena that existed in the somewhat more prosaic context of gambling. It was not until much later that the recognition came that the same random phenomena that affected, say, dice, also affected the values obtained when physical measurements were made. By the time this realization arose, it was well recognized that random phenomena were describable only by probabilistic statements; by definition it is not possible to state a priori what the outcome of any given random event will be. Thus, when the attention of the mathematicians of the time turned to the description of overdetermined systems, such as we are dealing with here, it was natural for them to seek the desired solution in terms of probabilistic descriptions. They then defined the “best fitting” equation for an overdetermined set of data as being the “most probable” equation, or, in more formal terminology, the “maximum likelihood” equation. Under the proper conditions (said conditions being that the errors that prevent all the data relationships from being described by a single equation are normally [1, 2] distributed) it can be proven mathematically that the “most probable” equation is exactly the one that is the “least square” equation. While we have discussed this point

34

Chemometrics in Spectroscopy

briefly in the past [3] it is, perhaps, appropriate at this point to revisit it, in a bit more detail. The basis upon which this concept rests is the very fact that not all the data follows the same equation. Another way to express this is to note that an equation describes a line (or more generally, a plane or hyperplane if more than two dimensions are involved. In fact, anywhere in this discussion, when we talk about a calibration line, you should mentally add the phrase “� � � or plane, or hyperplane � � � ”). Thus any point that fits the equation will fall exactly on the line. On the other hand, since the data points themselves do not fall on the line (recall that, by definition, the line is generated by applying some sort of [at this point undefined] averaging process), any given data point will not fall on the line described by the equation. The difference between these two points, the one on the line described by the equation and the one described by the data, is the error in the estimate of that data point by the equation. For each of the data points there is a corresponding point described by the equation, and therefore a corresponding error. The least square principle states that the sum of the squares of all these errors should have a minimum value; and as we stated above, this will also provide the “maximum likelihood” equation. It is certainly true that for any arbitrarily chosen equation, we can calculate what the point described by that equation is, that corresponds to any given data point. Having done that for each of the data points, we can easily calculate the error for each data point, square these errors, and add together all these squares. Clearly, the sum of squares of the errors we obtain by this procedure will depend upon the equation we use, and some equations will provide smaller sums of squares than other equations. It is not necessarily intuitively obvious that there is one and only one equation that will provide the smallest possible sum of squares of these errors under these conditions; however, it has been proven mathematically to be so. This proof is very abstruse and difficult. In fact, it is easier to find the equation that provides this “least square” solution than it is to prove that the solution is unique. A reasonably accessible demonstration, expressed in both algebraic and matrix terms, of how to find the least square solution is available. Even though regression analysis (one of the more common names for the application of the least square principle) is a general mathematical technique, when we are dealing with spectroscopic data, so that the equation we wish to fit must be fitted to data obtained from systems that follow Beer’s law, it is convenient to limit our discussion to the properties of spectroscopic systems. Thus we will couch our discussion in terms of quantitative analysis performed using spectroscopic data; then the dependent variable of the least square regression analysis (usually called the “Y” variable by mathematicians) will represent the concentration of analyte in the set of samples used to calibrate the system, and the independent (or “X”) variable will represent absorbance values measured by a suitable instrument in whichever spectral region we are dealing with. We will begin our discussion by demonstrating that, for a non-overdetermined system of equations, the algebraic approach and the least-square approach provide the same solution. We will then extend the discussion to the case of an overdetermined system of equations. Therefore this chapter will continue the multiple linear regression (MLR) discussion introduced in the previous chapter, by solving a numerical example for MLR. Recalling

Matrix Algebra and Multiple Linear Regression: Part 2

35

the basic ultraviolet, visible, near-infrared, and infrared use of MLR for spectroscopic calibration, we have Concentration = Constant term (or Bias) + �Regression coefficient 1� • �Absorbance at wavelength 1� + �Regression coefficient 2� • �Absorbance at wavelength 2� + · · · + �Regression coefficient N� • �Absorbance at wavelength N� Also written in equation form as: Concentration = �0 + �1 A�1 + �2 A�2 + · · · + �N A�N

(5-1)

By including an error term, we can write the equation as: Concentration = �0 + �1 A�1 + �2 A�2 + · · · + �N A�N + e And also in expanded matrix form as: ⎡ ⎤ ⎡ A11 A12 A13 A14 c1 ⎢ A21 A22 A23 A24 ⎢c2 ⎥ ⎢ ⎥ ⎢ ⎢•⎥ ⎢ • • • ⎢ ⎥ A=⎢ • c=⎢ ⎥ ⎢ • • • • • ⎢ ⎥ ⎢ ⎣•⎦ ⎣ • • • • cN AM1 AM2 AM3 AM4

• • • • • •

⎤ • A1N ⎥ • A2N ⎥ ⎥ • • ⎥ • ⎥ • ⎥ • • ⎦ • AMN

⎡ ⎤ �1 ⎢�2 ⎥ ⎢ ⎥ ⎢�3 ⎥ ⎢ ⎥ �=⎢ ⎥ ⎢∗⎥ ⎣•⎦ �N

(5-2) ⎡ ⎤ e1 ⎢e2 ⎥ ⎢ ⎥ ⎢e3 ⎥ ⎢ ⎥ e=⎢ ⎥ ⎢•⎥ ⎣•⎦ eN (5-3)

and in simplified matrix notation, the equation is c = a� + e

(5-4)

Because we have limited time and space, let us solve our problem using two wavelengths (or frequencies) and a basic calculator. To define the problem, we start with a set of calibration samples with the characteristics listed in Table 5-1: The system of equations for solving this problem can be written as 2�0 = �0 + �1 �0�75� + �2 �0�28�

(5-5a)

4�0 = �0 + �1 �0�51� + �2 �0�485�

(5-5b)

7�0 = �0 + �1 �0�32� + �2 �0�78�

(5-5c)

Table 5-1 Characteristics of the calibration samples Sample number 1 2 3

Concentration 2�0 4�0 7�0

Signal at wavelength 1

Signal at wavelength 2

0�75 0�51 0�32

0�28 0�485 0�78

36

Chemometrics in Spectroscopy

and in simplified matrix form as C = �A� • ���

(5-6)

and written in matrix form (with the constant term as the third column) as: ⎡ ⎤ ⎡ ⎤ ⎡ ⎤ 2�0 �0 0�75 0�28 1 C = ⎣ 4�0 ⎦ � � = ⎣ �1 ⎦ � A = ⎣0�51 0�485 1⎦ 7�0 �2 0�32 0�78 1 The augmented matrix formed by [A�C] is ⎡ 0�75 �A�C� = ⎣0�51 0�32

(5-7)

designated as: 0�28 0�485 0�78

1 1 1

⎤ 2�0 4�0⎦ 7�0

(5-8)

The first task is to use elementary matrix row operations to manipulate matrix [A�C] to yield zeros in rows II and III of column I. The row operations are to replace row II by row II minus 0.68 times of row I; that is. II = II − 0�68 × I; followed by replacing row III by row III minus 0.4267 times of row I; that is, III = III − 0�4267 × I. To complete our row operations we must accomplish placing zeros in columns I and II of row III by replacing row III by row III minus 2.242 times of row II: that is: III = III − 2�242 × II. These row operations yield (remember to keep as much precision as possible in your calculations): ⎡ ⎤ 0�75 0�28 1 2�0 ⎣0 0�2946 0�32 2�64 ⎦ (5-9) 0 0 −0�1442 0�2274 In summary, by using two series of row operations, namely III − 0�4267 I: and III = III − 2�242 II we have ⎡ ⎤ ⎡ 0�75 0�28 1 2�0 0�75 0�28 1 ⎣0�51 0�485 1 4�0⎦ ∼ ⎣0 0�2946 0�32 0�32 0�78 1 7�0 0 0 −0�1442

II = II − 0�68 I� III = ⎤ 2�0 2�64 ⎦ 0�2274

(5-10)

These two matrices (original and final) are row equivalent because by using simple row operations the right matrix was formed from the left matrix. The final matrix is equivalent to a set of equations as shown below: 0�75�1 + 0�28�2 + 1�0�0 = 2�0 0�2946�2 + 0�32�0 = 2�64 −0�1442�0 = 0�2274

(5-11a) (5-11b) (5-11c)

Now solving the system of equations yields (−0�1442��0 = 0�2274� �0 = −1�577; solv ing for �2 , we find (0.2946) �2 + 0�32�−1�577� = 2�64� �2 = 10�674; solving for �1 yields (0.75)�1 + 6�28�10�674� + 1�−1�577� = 2�0�1 = 0�784.

Matrix Algebra and Multiple Linear Regression: Part 2

37

And so, �0 = −1�577 �1 = 0�784 �2 = 10�674 Substituting into the original equations and calculating the differences between predicted and actual results, we find the results shown in Table 5-2. The foregoing discussion is all based on one important assumption: that the equation describing the relationship between the data does, in fact, include a constant term. If Beer’s law is strictly followed, however, when the concentration of all absorbing constituents is zero, then the absorbance (at all wavelengths, no less) is also zero: that is, the equation describing the relationship between the data generates a line that passes through the origin. If this condition holds, then the constant term of the equation is also exactly zero, and may be dropped from the equation. It has been shown possible to generate a least squares expression for this case also, that is, with the constant of the equation forced to be zero: it is merely necessary to formulate the expression for the prediction equation, corresponding to equation 5-11d as: Conc� = �1 A1 + �2 A2

(5-11d)

Starting from this expression, one can execute the derivation just as in the case of the full equation (i.e., the equation including the constant term), and arrive at a set of equations that result in the least square expression for an equation that passes through the origin. We will not dwell on this point since it is not common in practice. However, we will use this concept to fit the data presented, just to illustrate its use, and for the sake of comparison, ignoring the fact that without the constant term these data are overdetermined, while they are not overdetermined if the constant term is included – if we had more data (even only one more relationship), they would be overdetermined in both cases. Then, if the equation system is solved with no constant term (�0 �, we have the following results (you can either take our word for it or perform the row operations for yourself. Exercise for the reader: do those row operations.): �2 �0�2946� = 2�64, �2 = 8�9613; and �0�75� + 0�28�8�9613� = 2�0, �1 = −0�679. And so, �1 � = −0�679 �2 � = 8�9613 Table 5-2 Results after substituting into the original equations and calculating the differences between predicted and actual results (using manual row operations) Sample number 1 2 3

�0

+

�1 (A�1 �

+

�2 (A�2 �

= Predicted − Actual = Residual

−1�577 + 0.784(0.75) + 10.674(0.28) = −1�577 + 0.784(0.51) + 10.674(0.485) = −1�577 + 0.784(0.32) + 10.674(0.78) =

2.0 4.0 7.0

− − −

2.0 4.0 7.0

= = =

0 0 0

38

Chemometrics in Spectroscopy

Table 5-3 Results when there is no constant (bias) term after substituting into the original equations and calculating the differences between predicted and actual results �1 �A�1 �

+

�1 �A�2 �

=

Predicted

−

Actual

=

Residual

−0�679�0�75� −0�679�0�51� −0�679�0�32�

+ + +

8.9613(0.26) 8.9613(0.485) 8.9613(0.78)

= = =

2�0 4�0 6�78

− − −

2�0 4�0 7�0

= = =

0�0 0�0 −0�23

Sample number 1 2 3

and the results are shown in Table 5-3. Another exercise for the reader: Why is a bias term often used in regression for spectroscopic data?

THE POWER OF MATRIX MATHEMATICS Now let us see what happens when we use pure, unadulterated matrix power to solve this equation system, such that A� A�ˆ = A� C

(5-12)

as equation 4-23 showed us. When solving for the regression coefficients (��, we have ⎡

⎤ �0 ⎣ �1 ⎦ = �ˆ = �A� A�−1 A� C �2

(5-13)

Noting the matrix algebra for this problem (Equation 25 from reference [1]) ⎡

A2j0

⎢ ⎢ A� A = ⎢ j Aj0 Aj1 ⎣ Aj0 Aj2 j

j

⎤ ⎡ ⎤ Aj1 Aj0 Aj2 Aj0 A•2 N A•1 j j j ⎥ ⎢ ⎥ 2 2

⎢ A•1 Aj1 Aj2 Aj1⎥ Aj1 Aj2 Aj1⎥

= ⎥ ⎢ ⎥ (5-14) j j j j 2 ⎦ ⎣ j 2 ⎦ Aj1 Aj2 Aj2 A•2 Aj1 Aj2 Aj2 j

j

j

j

j

j

and substituting the numbers from our current example, we illustrate the following steps: ⎡

⎤ 1 0�75 0�28 A = ⎣ 1 0�51 0�485 ⎦ 1 0�32 0�78

(5-15)

and so the transpose of A (which is A� ) is ⎡ 1 A� = ⎣0�75 0�28

1 0�51 0�485

⎤ 1 0�32⎦ 0�78

(5-16)

Matrix Algebra and Multiple Linear Regression: Part 2

39

and to continue. A transpose (A� ) times A is ⎡

1×1+1×1+1×1 1 × 0�75 + 1 × 0�51 + 1 × 0�32 A� A = ⎣ 0�75 × 1 + 0�51 × 1 + 0�32 × 1 0�75 × 0�75 + 0�51 × 0�51 + 0�32 × 0�32 0�28 × 1 + 0�485 × 1 + 0�78 × 1 0�28 × 0�75 + 0�485 × 0�51 + 0�78 × 0�32 ⎤ ⎡ ⎤ 1 × 0�28 + 1 × 0�485 + 1 × 0�78 3 1�58 1�5450 0�75 × 0�28 + 0�51 × 0�485 + 0�32 × 0�78 ⎦ = ⎣ 1�58 0�925 0�707 ⎦ 0�28 × 0�29 + 0�485 × 0�485 + 0�78 × 0�78 1�545 0�707 0�922 (5-17) Next we need to calculate the inverse of [A� A], designated [A� A]−1 . Because A� A is an X3×3 problem, we had better use a computer program suitably equipped to calculate the inverse (2). ⎡ 3 ⎣1�58 1�545

1�58 0�925 0�707

⎤ ⎡ 1�545 1 0�707⎦ ∼ ⎣0 0�922 0

0 1 0

⎤ 0 0⎦ 1

(5-18)

Exercise for the reader: See if you are able to determine all the row operations required to find the inverse of A� A (We recommend you set aside the better part of an afternoon to work this one through!) The augmented form is written as ⎡

3 ⎣1�58 1�545

1�58 0�925 0�707

1�545 0�707 0�922

1 0 0

⎤ 0 0⎦ 1

0 1 0

(5-19)

Thanks to the power of computers we find that the inverse of A� A is ⎡

348�0747 −1 �A� A� = ⎣−359�3786 −307�7061

−359�3786 373�6609 315�6969

⎤ −307�7061 315�6969⎦ 274�639

(5-20)

Then the next step is to calculate ⎡

⎤

⎡

⎤ ⎡ Nc• 1 ⎥ ⎢ A c⎥ ⎢ ⎥ ⎢ •1 1 ⎥ ⎣ 0�75 A� c = ⎢ j A•1 c1 ⎥ = ⎢ = j ⎦ ⎣ A c ⎦ ⎣ 0�28 •2 2 A c j

j

⎡

A•0 c0

•2 2

1 0�51 0�485

j

⎤ ⎡ ⎤ 1�2� + 1�4� + 1�7� 13 = ⎣ 0�75�2� + 0�51�4� + 0�32�7� ⎦ = ⎣ 5�78 ⎦ 0�28�2� + 0�485�4� + 0�78�7� 7�96

⎤ ⎡ ⎤ 1 2�0 0�32 ⎦ • ⎣ 4�0 ⎦ 0�78 7�0 (5-21)

40

Chemometrics in Spectroscopy

To solve for the regression coefficients (�i �, we are required to calculate (A� A�−1 A� C as follows (see equation 5-13): ⎡ ⎤ ⎡ ⎤ 348�0747 −359�3786 −307�7061 13�0 373�6609 315�6969⎦ • ⎣ 5�78⎦ � = �A� A�−1 A� C = ⎣−359�3786 −307�7061 315�6969 274�639 7�96 ⎡ ⎤ 348�0707�13� + �−359�3786��5�78� + �−307�7061��7�96� = ⎣ �−359�3786��13� + 373�6609�5�78� + 315�6969�7�96� ⎦ (5-22) �−307�7061��13� + 315�6969�5�78� + 274�639�7�96� ⎡ ⎤ ⎡ ⎤ −1�577 �0

= ⎣ 0�786⎦ = ⎣�1 ⎦ 10�675 �2 And, checking our work, we arrive at Table 5-4. Now, if we took our original set of data, as expressed in equations 5-5a–5-5c, and added one more relationship to them, we come up with the following situation: 2�0 = b0 + b1 �0�75� + b2 �0�28�

(5-23a� )

4�0 = b0 + b1 �0�51� + b2 �0�485�

(5-23b� )

7�0 = b0 + b1 �0�32� + b2 �0�78�

(5-23c� )

8�0 = b0 + b1 �0�40� + b2 �0�79�

(2-23d� )

Now we have the situation we discussed earlier: we have four relationships among a set of data, and only three possible variables (even including the b0 term) that we can use to fit these data. We can solve any subset of three of these relationships, simply by leaving one of the four equations out of the solution. If we do that we come up with the following table of results (we forbear to show all the computations here; however, we do recommend to our readers that they do one or two of these, for the practice): b1

b0 Eliminating Eliminating Eliminating Eliminating

equation equation equation equation

5-1: −9�47843 5-2: −10�86455 5-3: −0�520039 5-4: −1�5777

b2

10�39215 10�15801 4�1461 0�78492

16�86274 10�73589 14�6100 10�675

Table 5-4 Results after substituting into the original equations and calculating the differences between predicted and actual results (using MATLAB calculations) Sample number 1 2 3

�0

+

�1 �A�1 �

+

�2 �A�2 �

= Predicted − Actual = Residual

−1�577 + 0.786(0.75) + 10.675(0.28) = −1�577 + 0.786(0.51) + 10.675(0.485) = −1�577 + 0.786(0.32) + 10.675(0.78) =

2.002 4.001 7.001

− − −

2.0 4.0 7.0

= = =

0.002 0.001 0.001

Matrix Algebra and Multiple Linear Regression: Part 2

41

The last entry in this table, the results obtained from eliminating equation 5-4, rep resents of course the results obtained from the original set of three equations, since eliminating equation 5-4 from the set leaves us with exactly that same set. However, even though there does not seem to be much difference between the various equa tions represented by equations 2a� –2d� , it is clear that the fitting equation depends very strongly upon which subset of these equations we choose to keep in our calculations. Thus we see that we cannot arbitrarily select any subset of the data to use in our computations; it is critical to keep all the data, in order to achieve the correct result, and that requires using the regression approach, as we discussed above. If we do that, then we find that the correct fitting equation is (again, this system of equations is simple enough to do for practice – the matrix inversion can be performed using the row operations as we described previously):

Regression results:

b0

b1

−0�685719

6.15659

b2 15.50951

Note, by the way, that if you thought that the regression solution would simply be the average of all the other solutions, you were wrong. By now some of you must be thinking that there must be an easier way to solve systems of equations than wrestling with manual row operations. Well, of course there are better ways, which is why we will refresh your memory on the concept of determinants in the next chapter. After we have introduced determinants we will conclude our introductory coverage of matrix algebra and MLR with some final remarks.

REFERENCES 1. Mark, H. and Workman, J., Statistics in Spectroscopy, (Academic Press, Boston, 1991), pp. 45–56; see also Mark, H. and Workman, J., Spectroscopy 2(9), 37–43 (1987). 2. Mark, H., Principles and Practice of Spectroscopic Calibration (John Wiley & Sons, New York, 1991), pp. 21–24. 3. Mark, H. and Workman, J., Statistics in Spectroscopy (Academic Press, Boston, 1991), pp. 271–281; see also H. Mark and J. Workman, Spectroscopy 7(3), 20–23 (1992).

This page intentionally left blank

6 Matrix Algebra and Multiple Linear Regression: Part 3 – The Concept of Determinants

In the previous chapter [1] we promised a discussion of an easier way to solve equation systems – the method of determinants [2]. To begin, given an X2×2 matrix [A] as � � a1 b1 A= (6-1) a2 b2 the determinant of A is designated by � �a A = �� 1 a2

� b1 �� b2 �

(6-2)

Note that the brackets [ ] used to denote matrices are converted to vertical lines to denote a determinant. To continue, then the determinant of A is calculated this way: Adet = a1 b2 − a2 b1

(6-3)

The determinant is found by cross-multiplying the diagonal elements in a matrix and subtracting one diagonal product from the other, such that � � �a b1 �� (6-4) = a1 b2 − a2 b1 Adet = �� 1 a2 b2 � A numerical example is given as follows: Given A, find its determinant: � � � � �0�75 0�28 � 0�75 0�28 � � If A = � then Adet = � 0�51 0�485 0�51 0�485� = 0�75 × 0�485 − 0�28 × 0�5 = 0�364 − 0�141 = 0�221

(6-5)

To use determinants to solve a system of linear equations, we look at a simple application given two equations and two unknowns. For the equation system C1 = �1 Ak11 + �2 Ak12

(6-6a)

C2 = �1 Ak21 + �2 Ak22

(6-6b)

we denote �1 and �2 as unknown regression coefficients. By algebraic manipulation,

we can eliminate the �2 term from the equation system by multiplying the first equation

44

Chemometrics in Spectroscopy

by Ak22 and the second equation by Ak12 . By subtracting the two equations, we arrive at equations 6-6 through 6-7d: Ak22 C1 = Ak22 �1 Ak11 + Ak22 �2 Ak12

(6-7a)

�−�Ak12 C2 = Ak12 �1 Ak21 + Ak12 �2 Ak22

(6-7b)

Ak21 C1 − Ak12 C2 = Ak21 �1 Ak11 − Ak12 �1 Ak21

(6-7c)

Ak21 C1 − Ak12 C2 = Ak21 Ak11 − Ak12 Ak21 �1

(6-7d)

and

If the (Ak22 Ak11 − Ak12 Ak2 � term is nonzero, then we can divide this term into the above equation (6-7d) to arrive at �1 =

Ak22 C1 − Ak12 C2 Ak22 Ak11 − Ak12 Ak21

Note the denominator can be written as the determinant � � �Ak11 Bk12 � � � �Ak21 Bk21 �

(6-8)

(6-9)

referred to as the determinant of coefficients. We can also write the numerator as the determinant: � � �C1 Ak12 � � � (6-10) �C2 Ak22 � and so, � � C1 � � C2

�1 = � �Ak11 � �Ak21

� Ak12 �� Ak22 �

� Ak12 �� Ak22 �

(6-11)

We can also solve for �2 by algebraic manipulation of the equation system. Elimination of the �1 term is accomplished by multiplying the first equation by Ak21 and the second equation by Ak11 and subtracting the results, dividing by the common term, and lastly, by converting both the numerator and the denominator to determinants, finally arriving at equation 6-12. � � �Ak11 C1 � � � �Ak21 C2 � � � �2 = (6-12) �Ak11 Ak12 � � � �Ak21 Ak22 �

Matrix Algebra and Multiple Linear Regression: Part 3

45

To summarize what is referred to as Cramer’s rule, we can use the following general expressions given a system of two equations (6-13a and 6-13b) in two unknowns such that C1 = �1 Ak11 + �2 Ak12

(6-13a)

C2 = �1 Ak21 + �2 Ak22

(6-13b)

We can generalize a solution to this system of equations by using the following deter minant notation: � � � � � � �Ak11 Ak12 � �C1 Ak12 � �Ak11 C1 � � � D�1 = � � � � D = �� �C2 Ak22 � � D�2 = �Ak21 C2 � Ak21 Ak21 � And so, if D = 0, then we can solve for �1 , and �2 , using the relationships � � � � �C1 Ak12 � �C2 Ak22 � D�1 � �1 = =� � D �Ak11 Ak12 �� �Ak21 Ak22 �

(6-14)

and

�2 =

D�2 = D

� �Ak11 � �Ak21

� � �Ak11 �Ak21

� C1 �� C2 �

� Ak12 �� Ak22 �

(6-15)

There are, of course, additional rules for solving larger equation systems. We will address this subject again in later chapters when we discuss multivariate calibration in greater depth.

REFERENCES 1. Workman, J., Jr. and Mark, H., Spectroscopy 9(1), 16–19 (1994). 2. Britton, J.R. and Bello, I., Topics in Contemporary Mathematics (Harper & Row, New York, 1984), pp. 445–451.

This page intentionally left blank

7

Matrix Algebra and Multiple Linear Regression:

Part 4 – Concluding Remarks

Our discussions on MLR in previous chapters are all based on one important assumption: that the equation describing the relationship between the data does include a constant term. If Beer’s law is strictly followed, however, when the concentration of all absorbing constituents is zero, then the absorbance (at all wavelengths, no less) is also zero, that is the equation describing the relationship between the data generates a line that passes through the origin. If this condition holds, then the constant term of the equation is also exactly zero, and may be dropped from the equation. It has been shown possible to generate a least square expression for this case also, that is with the constant of the equation forced to be zero: it is merely necessary to formulate the expression for the prediction equation, corresponding to equation 7-1 as: Conc = b1 A1 + b2 A2

(7-1 )

Starting from this expression, one can execute the derivation just as in the case of the full equation (i.e., the equation including the constant term), and arrive at a set of equations that result in the least square expression for an equation that passes through the origin. We will not dwell on this point since it is not common in practice. We will use this concept to fit the data presented, just to illustrate its use, and for the sake of comparison, ignoring the fact that without the constant term these data are overdetermined, while they are not overdetermined if the constant term is included – if we had more data (even only one more relationship) they would be overdetermined in both cases. If we take our original set of data, as expressed in equations 7-5a–7.5c [1], and add one more relationship to them, we come up with the following situation: 20 = b0 + b1 075 + b2 028

(7-2a )

40 = b0 + b1 051 + b2 0485

(7-2b )

70 = b0 + b1 032 + b2 078

(7-2c )

80 = b0 + b1 040 + b2 079

(7-2d )

We now have the situation we discussed earlier: we have four relationships among a set of data, and only three possible variables (even including the b0 term) that we can use to fit these data. We can solve any subset of three of these relationships, simply by leaving one of the four equations out of the solution. If we do that we come up with the

48

Chemometrics in Spectroscopy

following table of results (we forbear to show all the computations here; however, we do recommend to our readers that they do one or two of these, for the practice):

Eliminating Eliminating Eliminating Eliminating

equation equation equation equation

7-1: 7-2: 7-3: 7-4:

b0

b1

−947843 −1086455 −0520039 −15777

10.39215 10.15801 4.1461 0.78492

b2 16.86274 10.73589 14.6100 10.675

The last entry in this table, the results obtained from eliminating equation 7-4, of course represents the results obtained from the original set of three equations, since eliminating equation 7-4 from the set leaves us with exactly that same original set. However, even though there does not seem to be much difference between the various equations represented by equations 7-2a –7-2d , clearly the fitting equation depends very strongly upon which subset of these equations we choose to keep in our calculations. Thus we see that we cannot arbitrarily select any subset of the data to use in our computations; it is critical to keep all the data, to achieve the correct result, and that requires using the regression approach, as we discussed above. If we do that, then we find that the correct fitting equation is (again, this system of equations is simple enough to do for practice – the matrix inversion can be performed using the row operations as we described previously):

Regression results:

b0

b1

b2

−0685719

6.15659

15.50951

Note, by the way, if you thought that the regression solution would simply be the average of all the other solutions, you were incorrect. With this chapter we will suspend our coverage of elementary matrix operations until a later chapter.

A WORD OF CAUTION We have noticed recently, a growing tendency for the chemical/spectroscopic community to draw the inference that the term “chemometrics” is virtually equivalent to “quanti tative analysis algorithms”. This misconception seems to be due to the overwhelming concentration of interest in that aspect of the application of chemometric techniques. This perceived equivalency is, of course, incorrect and non-existent in reality. The purview of chemometrics is much wider than that single application area, and encompasses a wide variety of techniques; including algorithms not only for quantitative and qualitative chemical analysis, but also for methods for analyzing, categorizing and generally dealing with data in a variety of ways (just look at the topic list included in the Analytical Chemistry reviews issue when Chemometrics is included). We ourselves have to plead guilty to some extent to promoting this misconception. While discussing and explaining the underlying concepts, we have also inherently spent much time and attention on that single topic, in much the same way that many other authors do.

Matrix Algebra and Multiple Linear Regression: Part 4

49

However, we do recognize and wish to caution our readers to recognize the fact that Chemometrics does in fact include this variety of methodologies alluded to above. We do, in fact, hope to eventually discuss these other concepts. Two items prevent us from just jumping in chin first, however. The first item is that there are, in fact, useful and important things that need to be said about the application of the quantitative analysis algorithms. The second item is the fact that while we are knowledgeable concerning some of the other areas of chemometric interest, we are not and could not possibly be experts in all such areas. We have discussed this between ourselves, and have decided that the only reasonable way to deal with this limitation is to entertain submissions from our readership. Anyone who has particular expertise in a topic that falls under the wider definition of “chemometrics” is welcome to submit one (or more) chapters dealing with that topic. We only request that you try to keep your discussions both simple and complete, using, as we say, only words of one syllable or less.

REFERENCE 1. Workman, J., Jr. and Mark, H., Spectroscopy 9(1), 16–19 (1994).

This page intentionally left blank

8

Experimental Designs: Part 1

The next several chapters will deal with the philosophy of experimental designs. Exper imental design is at the very heart of the scientific method; without proper design, it is well-nigh impossible to glean high-quality information from experimental data col lected. No amount of sophisticated processing or chemometrics can create information not presented within the data. Every scientist has designed experiments. So what is there left for us to say about that topic that chemometrics/statistics can shed some light on? Well, quite a bit actually, since not all experiments are designed equally, but some are definitely more equal than others (to steal a paraphrase). Another way to say it is that every experiment is a designed experiment, but some designs are better than others. In point of fact, the sciences of both statistics and chemometrics each have their own approach to how experiments should be designed, each with a view toward mak ing experimental procedures “better” in some sense. There is a gradation between the two approaches, nevertheless there is also somewhat of a distinction between what might be thought of as classical “statistical experimental design” and the more currently fashionable experimental designs considered from a chemometric point of view. These differences in approach reflect differences in the nature of the information to be obtained from each. Experimental designs, and in particular “statistical” experimental designs, are used in order to achieve one or more of the following goals: 1) Increase efficiency of resource use, that is, obtain the desired information using the fewest possible necessary experiments (this is usually what is thought of when “statistical experimental designs” are considered). This aspect of experimentation is particularly important when the experiment is large to begin with, or if the experiment uses resources that are rare or expensive, or if the experiment is destructive, so that materials (especially expensive ones) are used up. 2) Determine which variables or phenomena (“factors” in statistical/chemometric par lance) in an experiment are the “important” ones. This has two aspects: first is an effect large enough that we can be sure it is real, and not due simply to noise (or error) alone (i.e., “statistically significant”). We have treated this question to some extent in our previous chapters, and the book from it (both titled “Statistics in Spectroscopy”). The second aspect is, if the effect of a factor is indeed real, is it of sufficiently large magnitude to be of practical importance? While the answer to this question is important to understanding the outcome of the experiment, it is not a statistical question, and we will give it fairly short shrift.

52

Chemometrics in Spectroscopy

3) Accommodate noise and/or other random error. 4) Allow estimates to be made of the magnitude of the noise and/or other random error, if for no other reason than to compare our results to so as to tell if they are statistically significant. 5) Allow estimates to be made of the sensitivity to variations in the several factors. This can help decide whether any of the variations seen are of practical importance. A good design also allows these estimates of sensitivity to be made against an error background that is reduced compared to the actual error. This is accomplished by causing the effects of the factors to be effectively “averaged”, thus reducing the effect of error by the square root of the number of items being averaged. 6) Optimize some characteristic of the experimental system. To achieve these goals, certain requirements are imposed on the design and/or the data to be collected. The maximum amount of information can be obtained when: 1) The standard requirements for the behavior of the errors are met, that is, the errors associated with the various measurements are random, independent, normally (i. e., Gaussian) distributed, and are a random sample from a (hypothetical, perhaps) pop ulation of similar errors that have a mean of zero and a variance equal to some finite value of sigma-squared. 2) The design is balanced. This requirement is critical for certain types of designs and unimportant in others. Balance, in the sense used here, means that the values of a given experimental variable (factor) occurs in combination with all of the values of every other factor. For example, common variables in chemical experimentation are temperature and pressure. For a balanced design, experiments should be carried out where the material is held at low temperature, and at both high and low pressure. Additionally, experiments should be carried out where the material is held at high temperature, and at both high and low pressure. If a third variable, such as con centration of a reactant, is to be studied, then high and low pressure and high and low temperature should coexist with both the high and the low concentrations. The foregoing would seem to imply that a balanced experiment would require all possible combinations of conditions. While all-possible-combinations is certainly one way to achieve this balance, the advan tage of “statistical” deigns comes from the fact that clever ways have been devised to achieve balance while needing far fewer experiments than the all-possible-combinations approach would require (Table 8-1). As an illustration of this, let us consider the three aforementioned variables: tem perature, pressure, and concentration of reactant. An all-possible-combinations design would require eight experiments, with the following set of conditions in each experiment (where H and L represent the high and the low temperatures, pressures, etc.): However, to achieve balance, it is not necessary to carry out eight experiments; balance can be achieved with only four experiments with the conditions suitably set (Table 8-2). Check it out: High reactant concentration occurs in combination with each (high and low) temperature, and with each pressure; similarly for low reactant concentration.

Experimental Designs: Part 1

53

Table 8-1 An all-possible-combinations design of three factors, needing eight experiments and sets of conditions Experiment number 1 2 3 4 5 6 7 8

Temperature

Pressure

Concentration

L L L L H H H H

L L H H L L H H

L H L H L H L H

Table 8-2 Balanced design for three factors, needing only four experiments Experiment number 1 2 3 4

Temperature

Pressure

Concentration

L L H H

L H L H

L H H L

You will find the same situation for the other variables. This is not to say that there are no benefits to the larger experimental design, but we are making the point that balance can be achieved with the smaller one, and for those designs where balance is an important consideration, much work (and resources, and MONEY) can be saved. Balance is not always achievable in practice due to physical constraints on the mea surements that can be made. Certain designs do not require balance, and in fact to enforce balance would mitigate some of the benefits of the design. In particular, there are some designs where future experiments to be performed are determined by the results of the past experiments. To enforce balance here would require extra, unnecessary experimentation that did not contribute to the main goal of the whole venture. The various designs that have been generated can be classified into one of several categories. One way to classify experimetal designs is as follows: 1) 2) 3) 4)

Classical designs Screening designs Analytical designs Optimization designs.

In one sense, it is possible to think of the categories involved as “building blocks” for designs, which can then be combined in various ways which depend upon the information that you want to obtain which, in turn, determines the nature of the data to collect. These

54

Chemometrics in Spectroscopy

general categories, by the way, are not mutually exclusive. It is even possible to consider some types of designs as extensions of others, or, vice versa, as subsets, or special cases of other types of designs. Some of these main categories are A) B) C) D) E)

Factorial designs Fractional factorial designs Nested designs Blocked designs Response surface designs.

The key to all “statistical experimental” designs is planning. A properly planned experi ment can achieve all the goals set forth above, and in fewer runs than you might expect (that’s where achieving the goal of efficiency comes in). However, there are certain requirements that must be met: The experiment must be executed according to the plan! All the planning in the world is of naught if carrying out the experiment results in blunders (e.g., even something as crude as dropping a key sample on the floor – and look at how often that has been done!). The statistical literature contains examples (unfortunately) where large experiments, that cost millions of dollars to perform, were completely ruined by carelessness on the part of the personnel actually carrying it out. As noted above, the variations in the data representing the error must meet the usual conditions for statistical validity: they must be random and statistically independent, and it is highly desirable that they be homoscedastic and Normally distributed. The data should be a representative sampling of the populations that the experiment is supposed to explore. Blunders must be eliminated, and all specified data must be collected. The efficiency of these experimental designs has another side effect: any missing or defective data has a disproportionate effect relative to the amount of information that can be extracted from the final data set. When simpler experimental designs are used, where each piece of data is collected for the sole purpose of determining the effect of one variable, loss of that piece of data results in the loss of only that one result. When the more efficient “statistical” experimental designs are used, each piece of data contributes to more than one of the final results, thus each one is used the equivalent of many times and any missing piece of data causes the loss of all the results that are dependent upon it. These types of experimental designs also have some limitations. The first is the exaggeration of the effect of missing or defective data on the results, as mentioned above. The second is the fact that until the entire plan is carried out, little or no information can be obtained. There are generally few, if any, “intermediate results”; only after all the data is available can any results at all be calculated, and then all of them are calculated at once. This phenomenon is related to the first caveat: until each piece of data is collected, it is “missing” from the experiment, and therefore the results that depend upon it cannot be calculated. The simplest possible experimental design would almost not be recognized as an “experimental design” at all, but does serve as a prototype situation (as we like to use for pedagogical purposes). The situation arises when there is one variable (factor) to investigate, and the question is, does this factor have an effect on the property studied? We have introduced this situation earlier, in our discussion of hypothesis testing, as in

Experimental Designs: Part 1

55

our previous Statistics in Spectroscopy book [1–3]. We will discuss how we treated this situation previously, then change our point of view to see how we would do it from the point of view of an “experimental design”.

REFERENCES 1. H. Mark, and J. Workman, “Statistics in Spectroscopy; Elementary Matrix Algebra and Multiple Linear Regression: Conclusion”, Spectroscopy 9(5), 22–23 (June, 1994). 2. H. Mark, and J. Workman, “Statistics in Spectroscopy’, Spectroscopy 4(7), 53–54 (1989). 3. H. Mark, and J. Workman, Statistics in Spectroscopy (Academic Press, Boston, 1991), chapter 18.

This page intentionally left blank

9

Experimental Designs: Part 2

As we have mentioned in the last chapter, “Experimental Design” often takes a form in scientific investigations, such that some of experimental objects have been exposed to one level of the variable, while others have not been so exposed. Oftentimes this situation is called the “experimental subject” versus the “control subject” type of experiment. In the face of experimental error, or other source of variability of the readings, both the “experimental” and the “control” readings would be taken multiple times. That provides the information about the “natural” variability of the system against which the difference between the two can be compared. Then, a t-test is used to see if the difference between the “experimental” and the “control” subjects is greater than can be accounted for by the inherent variability of the system. If it is, we conclude that the difference is “statistically significant”, and that there is a real effect due to the “treatment” applied to the experimental subject. Of course there are variations on this theme: the difference between the “experimental” and the “control” subjects can be due to different amounts of something applied to the two types of object, for example. That is how we have treated this type of experiment previously. We will now consider a somewhat different way to formulate the same experiment; the purpose being to be able to set up the experimental design, and the analysis of the data, in such a way that it can be generalized to more complicated types of experiments. In order to do this, we recognize that the value of any individual reading, whether from the experimental subject or the control subject, can be expressed as the sum of three quantities. These three quantities arise from a careful consideration of the nature of the data. Given that a particular measurement belongs either to the experimental group or to the control group, then the value of the data collected can be expressed as the sum of these three quantities: 1) The grand mean of all the data (experimental + control)

2) The difference between the mean of the data group (experimental or control) and the

grand mean of the data 3) The difference between the individual reading and the mean reading of its pertinent group. This can then be expressed mathematically as: � � � � Xij = X + X i − X + Xij − X i

(9-1)

58

Chemometrics in Spectroscopy

where, Xij represents each individual datum.

X i represents the mean of the particular data group (experimental or control) that the

individual datum belongs to. X represents the grand mean of all the data (from both groups). By rearranging equation 9-1, we can also express it as follows, wherein the fact that it is a mathematical identity becomes apparent: � � � � Xij = X − X + X − X + Xij (9-2) We have previously shown that through the operation called “partitioning the sums of squares”, the following equality holds [1]: �2 � 2 � 2 �� X −X (9-3) Xi = X + Note that what we call the grand mean here is simply called the mean in the prior discussion. That is because in the prior discussion there was no further splitting of the data into subgroups. In the current discussion we have indeed split the data into subgroups; and we note that what was previously the total difference from the mean now consists of two contributions: the difference of each subgroup’s mean from the grand mean, and the difference of each datum’s value from its subgroup’s mean. We might expect, and it turns out to be so (again we leave the proof as an “exercise for the reader”), that sum of squares of the differences of each datum’s value from the grand mean can also be partitioned; thus,: �2 � � � 2 � 2 �� �2 Xij − X i Xij = X + Xi − X + (9-4) We had previously discussed the situation (from a slightly different point of view) where more than two subgroups of data existed. In that case we noted that we could generate two estimates of sigma, the within-group standard deviation. One estimate is calculated from the pooled within-group standard deviation. The other is calculated from the standard deviation between the means of the various subgroups. This quantity, you recall, is equal to the within-group standard deviation divided by the square root of n, the number of data used in the calculation of each subgroup’s mean. However, the second calculation is correct only if the differences between the means is due to the random variations of the data itself, and there are no external influences. If such influences exist, then the second calculation (from the between-group means) will estimate a larger value for sigma than the first calculation (the pooled within-group standard deviations). This was then used as the basis of a statistical hypothesis test: if the value of sigma calculated from the between-groups means is statistically significantly larger than the value of sigma calculated from with the groups, then we have evidence to conclude that there are indeed, external influences acting upon the data, and we used an F -test to determine whether there was more scatter between the means than could be accounted for by the random variations within the subgroups. In the case at hand, with only two subgroups, we can proceed the same way. The difference is that now, with only two subgroups, there is only one degree of freedom

Experimental Designs: Part 2

59

available for the difference between the subgroups. No matter; an F -test with one degree of freedom is possible. Thus, to analyze the data from the model of equation 9-4, we calculate the mean square between the subgroups, and the mean square within the subgroups and perform an F -test (rather than a t-test as before) between these two mean squares. We would recommend doing it formally, with an ANOVA table, but this is the basic calculation. The conclusions drawn will be identical to those drawn by use of the t-test. Check it out: the tabled values of F for one and n degrees of freedom is equal to the square of the value of t for n degrees of freedom. We might also note here, almost parenthetically, that if the hypothesis test gives a statistically significant result, it would be valid to calculate the sensitivity of the result to the difference between the two groups (i.e., divide the difference in the means of the two groups by the difference in the values of the variable that correspond to the “experimental” and “control” groups). As an example of using an experimental design together with its associated analysis of variance to obtain a meaningful result, we have here an example based on some real data that we have collected. The problem was interesting: to troubleshoot a method of (wet) chemical analysis. A large quantity of sample was available, and had been well-ground and mixed. Suitable data was collected to permit performing a straightforward one-way analysis of variance. To start with, 5 g of sample was dissolved in 100 ml of water, and 20 repeat analyses were performed. The resulting values are shown in Table 9-1. The entry in the third row, second column was noted to have been measured under abnormal conditions. Since an assignable cause for this discrepant value was available, the reading was discarded. The statistics for the remaining data were Mean = 5.01, SD = 0.327. This value for the standard deviation was accepted as the best available approximation to the population value for . The next step was to take several different aliquots from a large sample (a different sample than used previously) and collect multiple readings from each of them. Six aliquots were placed in each of six flasks, and six repeat measurements were made on each of these six flasks. Each aliquot consisted of 10 g of test sample/100 ml water. The results are shown in Table 9-2. The value for the pooled within-flask standard deviation, while somewhat higher than for the twenty repeat readings, is not so high as to be worrisome. Strictly speaking, we should have done an F -test between the variance from the two sets of results to see if there is any extra variance there, but we will ignore that question for now, because the important point here is the highly statistically significant value of the “between” flasks standard deviation, indicating some extra source of variation was superimposed on the analytical value.

Table 9-1 Results from 20 repeat readings of 5 g of sample dissolved in 100 ml water 5.12 5.28 4.97 5.20 4.50

5.60 5.14 3.85 4.69 5.12

5.18 4.74 5.39 4.49 5.61

4.71 4.72 4.94 4.91 4.99

60

Chemometrics in Spectroscopy

Table 9-2 Results of repeat readings of six aliquots in six flasks (from 10-g samples) Flask #

Means: SDs:

1

2

3

4

5

6

7.25 7.68 7.76 8.10 7.50 7.58

10.07 9.02 9.51 10.64 10.27 9.64

5.96 6.66 5.87 6.95 6.54 6.29

7.10 6.10 6.27 5.99 6.32 5.54

5.74 6.90 6.29 6.37 5.99 6.58

4.74 6.75 6.71 6.51 5.95 6.50

7.64 0.28

9.85 0.58

6.37 0.42

6.22 0.51

6.31 0.41

6.19 0.77

Pooled SD = 0.52, “Between” SD = 1.46 Expected “Between” SD = 0.212 F = 47 F (crit) = F (0.95, 5, 30) = 2.53

Having found a statistically significant “between” flasks standard deviation, the next step was to formulate hypotheses as to the possible physical causes of this situation. The list we arrived at was the following: • • • •

Inhomogeneous sample Drift between sets of readings Sampling error Something else.

The first physical cause considered was the possibility of an inhomogeneous sample. To eliminate this as a possibility, the sample was ground before aliquots were taken. The sample size was still 10 g of sample per 100 ml of water. In this case, however, time constraints permitted only three replicate readings per flask. The results are shown in Table 9-3. We note that there is still much larger difference between the different flasks’ readings that can be accounted for by the within-flask repeatability. Therefore we press onward to consider another possible cause of the variation; in this case we consider the possibility of inhomogeneity of the sample, at a scale not affected by grinding. For example, the sample might contain small specks of material that are too small to be ground further, Table 9-3 Results of repeat readings of six aliquots in six flasks (from 10-g samples ground)

Means: SDs:

6.57 6.27 6.35

5.06 6.27 5.88

8.07 7.82 8.52

4.93 5.64 5.19

4.78 5.50 5.99

6.23 7.37 5.27

6.39 0.16

5.74 0.61

8.19 0.35

5.25 0.36

5.43 0.61

7.29 1.01

Pooled SD = 0.58, “Between” SD = 1.14 Expected “Between” SD = 0.33 F = 113 F (crit) = F (0.95, 5, 12) = 3.10

Experimental Designs: Part 2

61

Table 9-4 Results from using 10 × larger (100-gram) samples

Means: SDs:

8.29 8.12 8.72 8.54

8.61 8.72 8.42 8.76

10.04 11.67 11.38 10.19

8.86 9.02 9.29 8.63

8.42 0.26

8.63 0.15

10.82 0.82

8.94 0.26

Pooled SD = 0.46, “Between” SD = 1.10 Expected “Between” SD = 0.23 F = 23 F (crit) = F (0.95, 3, 12) = 3.49

but which are large enough to measurably affect the analysis. In this case, the expected distribution of the sampling variation of such particles would be the Poisson distribution [2]. In such a case, if we take a larger sample, we would expect the standard deviation to decrease as the square root of the sample size. Thus, if we take samples ten times larger than previously, the standard deviation of the “between” readings should become approximately one-third of the previous value. Therefore, for the next test, 100 g samples each were dissolved in 1 liter of water. The results are shown in Table 9-4. Note that the “between” standard deviation is almost identical to the previous value; we conclude that inhomogeneity of the sample is not the problem. The possibility of drift between sets of readings was ruled out by virtue of the fact that many of the steps of the analytical procedure were done simultaneously on the several readings of the different aliquots. The possibility of drift between readings was ruled out by repeating the readings in different orders; the same values were obtained regardless of the order of reading. This left “something else” as the possible cause of the variability. When we considered the nature of the test, which was sensitive to parts per million of organic materials, we realized that one possibility was contamination of the glassware by the soap used to clean it. We next cleaned all glassware with chromic acid cleaning solution, and reran the tests, with the result as shown in Table 9-5. Removal of the extraneous source of variability did indeed reduce the “between-flasks” variance to a level that is now explainable (in the statistical sense) by the underlying random variations attributable to the within-flask variability. Table 9-5 Results after cleaning glassware with chromic acid

Means: SDs:

4.65 5.03 4.38

5.98 4.61 4.49

5.19 3.96 4.92

4.97 4.43 4.79

4.62 4.94 3.37

3.93 4.60 5.95

4.68 0.33

5.16 0.73

4.69 0.64

4.73 0.27

4.31 0.83

4.84 1.03

Pooled SD = 0.69, “Between” SD = 0.27 Expected “Between” SD = 0.39 F = 047 F (crit) = F (0.95, 5, 12) = 3.10

62

Chemometrics in Spectroscopy

Table 9-6 Types of experimental designs Number of levels

Number of factors Single

Multiple

Two

Experimental versus control subjects

One-at-a-time designs Factorial designs Fractional factorial designs Nested designs Special designs

Multiple

Sensitivity testing Simple regression

Response surface designs Multiple regression

End of example From the prototype experiment, we can generate many variations of the basic scheme. The two main ways that the model shown in equation 9-4 can be varied is to increase the number of factors and to increase the number of levels of each factor. A given factor must have at least two levels (even if one of the levels is an implied zero), and may have any number greater than two. Table 9-6 lists the types of designs that fall into each of these categories. The types of designs used by scientists in simple settings, not usually considered “statistical” designs, are the “experimental versus control” designs (discussed above), the one-at-a-time designs (where each factor is individually changed from its “control” value to its “experimental” value, then restored when the next fac tor is changed), and the simple regression (often used in calibration work when only one physical variable is affected – in chemistry, electrochemical and chromatographic applications come to mind). The table is not exhaustive, although it does include a majority of experimental designs that are used. One-at-a-time designs are the usual “non-statistical” type of experiments that are often carried out by scientists in all disciplines. Not included explicitly, however, are experimental designs that are generated from combinations of listed items. For example, a multi-factor experiment may have several levels of some of the factors but only two levels of other factors. Also, due to the nature of the physical factors involved, the values of some of the factors may not be under the experimenter’s control. Thus, some factors may be nested, while others may not be.

REFERENCES 1. Mark, H. and Workman, J., Statistics in Spectroscopy (Academic Press, Boston, 1991), pp. 80–81. 2. Mark, H. and Workman, J., Spectroscopy 5(3), 55–56 (1991).

10

Experimental Designs: Part 3

We continue with this chapter specifically dealing with experimental design issues. When we leave the realm of the simplest designs, we find that the experiments, and the analyses of the data therefrom, acquire characteristics not existing in the simpler designs, and beyond obvious extensions of them. For example, consider a two-factor design with each factor at two levels. This is also a form of all-possible-combinations experiment. One item we note here is that there is more than one way to describe the form of an experiment, and we include a short digression here to explicate this multiplicity of ways of describing an experiment. In this particular case, we have two factors, each at two levels. We can describe it as a listing of values corresponding to each experiment (Table 10-1). Alternatively, we can describe it as the experiment number that will correspond to each set of combinations of factors (Table 10-2): Whichever way we choose to describe the design, it (and the others of this type) has some attractive features. We will illustrate these features with a numerical example. For our example, we will imagine an experiment where the scientist is interested in determining the influence of temperature and of catalyst on the yield of a chemical reaction. The questions to be answered are: does the concentration of catalyst make a difference, and does the type of catalyst make a difference? The experiment is to consist of trying each of the four available catalysts and three solvents, and determining the yield. The experiment can be described by Table 10-3. In a more complicated case, where a physical variable such as temperature, which can be assigned meaningful physical values, was the physical variable and the sensitivity of the yield to temperature was of concern, we would then need to maintain (or control) the information regarding the actual temperatures. For our first look at this experiment we will examine the behavior of the experiment under two sets of conditions. The first scenario gives a set of conditions with the results obtained under the following assumptions: 1) There is no influence of solvent 2) None of the catalysts have an effect 3) There are no random influences on the experiment. The second scenario has similar conditions, but with one change: 1) There is no influence of solvent 2) One of the three catalysts has an effect 3) There are no random influences on the experiment.

64

Chemometrics in Spectroscopy

Table 10-1 All-possible-combinations experiment organized as a list of values Experiment number 1 2 3 4

Factor #1

Factor #2

L L H H

L H L H

Table 10-2 All-possible-combinations experiment organized as a table where the body of the table contains the experiment number corresponding to each set of experimental conditions

L H

Factor #1 1 3

2 4

L H

Factor #2 1 2

3 4

Table 10-3 Conditions for the experiment consisting of determining the yield of a chemical reaction with different solvents and temperatures Catalyst number 1 2 3 4

Solvent #1

Solvent #2

Solvent #3

1 4 7 10

2 5 8 11

3 6 9 12

In both experiments, Conditions 1 and 2 together mean that all results from the experi ment will be the same in the first scenario, and all results except the ones corresponding to the “effective” catalyst will be the same; while that one will differ. Condition 3 means that we do not need to use any statistical or chemometric considerations to help explain the results. However, for pedagogical purposes we will examine this experiment as though random error were present, in order to be able to compare the analyses we obtain in the presence and in the absence of random effects. The data from these two scenarios might look like that shown in Table 10-4. For each scenario, the statistical analysis of this type of experimental design would be a two-way analysis of variance. This is predicated on the construction of the experiment, which includes some implicit assumptions. These assumptions are 1) The influence of the factors changing between the rows is independent of the influence of the factors changing between the columns.

Experimental Designs: Part 3

65

Table 10-4 Hypothetical data under two different scenarios, for the experiment examining the effect of temperature and catalyst on yield; with no random variations affecting the data Catalyst number

1 2 3 4

First scenario

Second scenario

Solvent number

Solvent number

1

2

3

1

2

3

25 25 25 25

25 25 25 25

25 25 25 25

25 25 35 25

25 25 35 25

25 25 35 25

2) The influence of the factors changing between the columns is independent of the influence of the factors changing between the rows. 3) Any error (in these first two scenarios assumed zero) is random, has a mean value of zero, and is Normally distributed. If these assumptions hold, then each quantity in the data table can be expressed as the

sum of the following four factors:

1) 2) 3) 4)

The The The The

grand mean of all the data

influence of the value of the factor corresponding to each row

influence of the value of the factor corresponding to each column.

variation superimposed by any random phenomena affecting the data.

This being the case, quantities computed for a two-way analysis of variance are the

following:

1) The grand mean of all the data

2) The mean of each row, and the difference of each row mean from the grand mean (this estimates the influence of the values of the factor corresponding to the rows) 3) The mean of each column, and the difference of each column mean from the grand mean (this estimates the influence of the values of the factor corresponding to the columns) 4) Any difference between the actual data and the corresponding values calculated from the grand mean and the influences of the row and columns factors (this estimates the error variability). In Tabel 10-5, we present the standard representation of this breakdown of the data. There are two important points to note about the results in this table: first the data, shown in the body of the table in Part A, is in fact equal to the sum of the following quantities: 1) the grand mean (shown in Part A)

2) + row differences from the grand mean (shown in Part B)

66

Chemometrics in Spectroscopy

Table 10-5 Part A – ANOVA for the errorless data from Table 10-4 Catalyst number

First scenario Solvent number 1

2

3

1 2 3 4

25 25 25 25

25 25 25 25

25 25 25 25

Col. means:

25

25

25

∗

Second scenario

Row means

Solvent number

Row means

1

2

3

25 25 25 25

25 25 35 25

25 25 35 25

25 25 35 25

25 25 35 25

25

27.5

27.5

27.5

27.5∗

Grand mean

Table 10-5 Part B – RESIDUALS for ANOVA from Table 10-4 after correcting for row and column means Catalyst number

First scenario Solvent number 1

2

3

1 2 3 4

0 0 0 0

0 0 0 0

0 0 0 0

Mean diff. from grand mean:

0

0

0

Second scenario

Row diffs

0 0 0 0

Solvent number 1

2

3

0 0 0 0

0 0 0 0

0 0 0 0

0

0

0

Row diffs

−25 −25 7.5 −25

3) + column differences from the grand mean (shown in Part B) 4) + residuals (shown in the body of Part B). The second point is that the mean of the residuals, representing the error portion of the data, are zero; the data is accounted for entirely by the systematic variations due to the variations between the rows and the variations between the columns (of course, the column differences happen to be zero in this data). Now the really interesting stuff happens when we do in fact have error in the data. Let us look at what happens to these two scenarios when there is a small amount of random error variability superimposed on the data. Now the experimental conditions for the two scenarios are as follows: Scenario #3: 1) There is no influence of solvent 2) None of the catalysts have an effect 3) There is a random due to error on the experiment.

Experimental Designs: Part 3

67

Scenario #4: 1) There is no influence of solvent 2) One of the three catalysts has an effect 3) The same random error exists as in Scenario #1. For these two situations, let us suppose each error has the value as shown in Table 10-6 for the corresponding datum. The values in Table 10-6 were selected randomly, and have a mean of zero and a standard deviation of unity. When these error values are superimposed on the data, we arrive at the Table 10-7. When we subject this data to the same ANOVA calculations as the errorless data, we arrive at the following results (Table 10-8): It is instructive to compare the values in these tables with the corresponding values in the ANOVA tables for the errorless data. In particular, note that in the table corresponding to Scenario 1, even though there is no underlying systematic variations in the data, both the row and the column means are perturbed by the random variations superimposed on the data. How then, can we differentiate these differences from the ones due to real systematic variations such as are present in Scenario 2? The answer, of course, is to do a statistical hypothesis test, but as it stands, we do not seem to have enough information available for such a test. We can compute variances between rows and also between columns, in order to have the mean squares for the corresponding differences, but what are we going to compare these mean squares to? In particular, what are we going to use

Table 10-6 For Scenarios 3 and 4 each error has the following value for the corresponding datum −03583 −09583 0.0416 −10583

0.8416 −12583 −13583 0.4416

0.5416 1.4416 1.4416 0.2416

Table 10-7 Hypothetical data under two different scenarios; for the experiment examining the effect of solvent and catalyst on yield, random variations (from Table 10-6) have zero mean and unity standard deviation Catalyst number

1 2 3 4

Third scenario

Fourth scenario

Solvent number

Solvent number

1

2

3

1

2

3

25.8416 23.7416 23.6416 25.4416

24.6416 24.0416 25.0416 23.9416

25.5416 26.4416 26.4416 25.2416

25.8416 23.7416 33.6416 25.4416

24.6416 24.0416 35.0416 23.9416

25.5416 26.4416 36.4416 25.2416

68

Table 10-8 Part A – DATA: ANOVA for the hypothetical data containing error with mean equal 0 and standard deviation (S) equal to unity Catalyst number

Third scenario

Fourth scenario

Solvent number 1

2

3

1 2 3 4

25.8416 25.7416 25.6416 25.4416

24.6416 24.0416 25.0416 23.9416

25.5416 26.4416 26.4416 25.2416

Col. means:

25.6666

24.4166

25.9166

Grand mean

Row means

1

2

3

25.3416 24.7416 25.0416 24.875

25.8416 25.7416 33.6416 25.4416

24.6416 24.0416 35.0416 23.9416

25.5416 26.4416 36.4416 25.2416

25.3416 24.7416 35.0416 24.875

25∗

27.1666

26.9166

28.4166

27.5∗ Chemometrics in Spectroscopy

∗

Solvent number

Row means

Experimental Designs: Part 3

Table 10-8 Part B – RESIDUALS for the hypothetical data containing error with mean equal 0 and standard deviation (S) equal to unity Catalyst number

Third scenario

Fourth scenario

Solvent number 1

2

3

1 2 3 4

0.8333 −06666 −10666 0.9

−01166 −01166 0.5833 −035

−07166 0.7833 0.4833 −055

Col. diff from grand mean

−03333

−05833

0.9166

Row diff. from grand mean 0.3416 −02583 0.0416 −0125

Solvent number 1

2

3

0.8333 −06666 −10666 0.9

−01166 −01166 0.5833 −035

−07166 0.7833 0.4833 −055

−03333

−05833

0.9166

Row diff from grand mean −21583 −27583 75416 −2625

69

70

Chemometrics in Spectroscopy

to represent the error, to see if the row mean squares or the column mean squares are larger than can be accounted for by the error of the data? The answer to this question is in the residuals. While the residuals might not seem to bear any relationship to either the original data or the errors (which in this case we know because we created them and they are listed above), in fact the residuals contain the variance present in the errors of the original data. However, the value of the error sum of squares is reduced from that of the original data, because of the subtraction of some fraction of the error variation from the total when the row and column means were subtracted from the data itself. This reduction in the sum of squares can be compensated for by making a corresponding compensation in the degrees of freedom used to calculate the mean square from the sum of squares. In this data the sum of squares of the residuals is 5.24 (check it out). The number of degrees of freedom in these residuals is calculated by starting with the total (which is twelve, one from each piece of data in the experiment) and subtracting one degree of freedom for each quantity calculated from and subtracted from the data. What are these? Well, there is one grand mean, four row means, and three column means. The number of degrees of freedom lost = r − 1c − 1 = 4 − 13 − 2 = 6. Thus there is a loss of six degrees of freedom from the twelve, leaving six for the residuals. The mean square for the residuals is thus 5.24/6, or 0.877, and as a check, the square root of that value, 0.934 is an estimate of the error (which we know is unity).

11 Analytic Geometry: Part 1 – The Basics in Two and Three Dimensions

Analytic geometry is a branch of mathematics in which geometry is described through the use of algebra. Rene Descartes (1596–1650) is credited for conceptualizing this mathematical discipline. Recalling the basics, we can express the points of a plane as a pair of numbers with x-axis and y-axis coordinates, designated by (x, y). Note that the x-axis coordinate is termed the “abscissa”, and the y-axis the “ordinate”.

THE DISTANCE FORMULA In two dimensions (x and y), the distance between two points (x1 , y1 ) and (x2 , y2 ) in two-dimensional space (as shown in Figure 11-1) is given by the Pythagorean theorem as D2 = x2 − x1 2 + y2 − y1 2 = x2 − x1 2 + y2 − y1 2

(11-1)

and D=

√ x2 − x1 2 + y2 − y1 2

(11-2)

Note: This relationship holds even when x1 or y1 or both are negative (also shown in Figure 11-1). In three dimensions (x, y, z), we describe three lines at right angles to one another, designated as the x, y, z axes. Three planes are represented as xy, yz, and zx, and the distance between two points (x1 , y1 , z1 ) and (x2 , y2 , z2 is given by D2 = x2 − x1 2 + y2 − y1 2 + z2 − z1 2 = x2 − x1 2 + y2 − y1 2 + z2 − z1 2

(11-3)

and D=

√

x2 − x1 2 + y2 − y1 2 + z2 − z1 2

(11-4)

72

Chemometrics in Spectroscopy Y

(x2, y2)

X

(x1, y1)

Figure 11-1 The distance between two points in a two-dimensional coordinate space is deter mined using the Pythagorean theorem.

DIRECTION NOTATION For two-dimensional problems, given a line with respect to two axes x and y, there is a set of angles and that are designated as the x direction angle and y direction angle, respectively. Thus, as illustrated by using Figures 11-2a and 11-2b, a clearly defined line segment can be described given the angles and on the coordinate axes x and y. The only restriction that applies here is that both angles and must be ≥ 0 and ≤ 180 .

THE COSINE FUNCTION The cosine function applied to Figures 11-2a and 11-2b is given as cos =

x2 − x1 d

(11-5a)

cos =

y2 − y1 d

(11-5b)

and

(a)

(b) Y

Y

β

β α X X

α

Figure 11-2 Two illustrations of the x-direction angle ( and y-direction angle ( for a two-dimensional coordinate system.

Analytic Geometry: Part 1

73

where, d=

√

x2 − x1 2 + y2 − y1 2

(11-6)

Note that cos a and cos p are referred to as the direction cosines of the line segment described. To summarize in expanded notation: x2 − x1 cos = √ x2 − x1 2 + y2 − y1 2

(11-7a)

and cos = √

y2 − y1 x2 − x1 2 + y2 − y1 2

(11-7b)

Example: To find the direction cosines and corresponding angles for a line segment AB, where A is (3, 5) and B is (2, 7); check your work using cos2 + cos2 = 10, and draw a graphic of the line segment (Figure 11-3). The solution proceeds as follows: x2 − x1 = 2 − 3 = −1

(11-8a)

y2 − y1 = 7 − 5 = 2

(11-8b)

and

Therefore, the distance (d) is given by √

x2 − x1 2 + y2 − y1 2 √ √ d = −12 + 22 = 5

d=

(11-9a) (11-9b)

From the formulas above, we can determine that √ cos = −1/ 5 Y

B

β = 26.57° α = 116.5° A

X

Figure 11-3 The x-direction angle and y-direction angle for a line segment, where A is (3, 5) and B is (2, 7) (see example in text).

74

Chemometrics in Spectroscopy

and the corresponding angle is given as √ = cos−1 −1/ 5 = 11657 We also know that √ cos = 2/ 5 therefore the angle is given by √ = cos−1 2/ 5 = 2657 Checking our work using the formula cos2 + cos2 = 10, we find that cos2 11657 + cos2 2657 = 020 + 080 = 10

DIRECTION IN 3-D SPACE To continue our discussion of direction angles, we will use the same nomenclature: x, designated by ; y, designated by ; and z, newly designated by . We can determine the cosine of any direction angle, given the corresponding x, y, z coordinates for designated points in space as: cos = x2 − x1 /d

(11-10a)

cos = y2 − y1 /d

(11-10b)

cos = z2 − z1 /d

(11-10c)

and

and

where, d=

√

x2 − x1 2 + y2 − y1 2 + z2 − z1 2

(11-11)

It follows algebraically that cos 2 + cos 2 + cos 2 = 10

(11-12)

Example: Find the direction cosines and corresponding angles for a line segment AB where A is (2, −1, 4) and B is (4, 1, 2). To solve, use x2 − x1 = 4 − 2 = 2

Analytic Geometry: Part 1

75

and y2 − y1 = 1 − −1 = 2 and z2 − z1 = 2 − 4 = −2 √

x2 − x1 2 + y2 − y1 2 + z2 − z1 2 √ √ d = 22 + 22 + −22 = 12 = 346

d=

and cos = 2/346 = 0577 cos = 2/346 = 0577 cos = −2/346 = −0577 To find the direction angles corresponding to the above we use = cos−1 0577 = 5476 = cos−1 0577 = 5476 = cos−1 −0577 = 12523 Checking the calculations, we use cos2 + cos2 + cos2 = 10 or 0333 + 0333 + 0333 = 100

DEFINING SLOPE IN TWO DIMENSIONS The slope m of a line segment between two points is given as: m = y2 − y1 /x2 − x1 = tan

(11-13)

where is the x direction angle and 0 < 360 . This well-known expression is also equivalent to the tangent of the x direction angle for the line segment defined by the two points on the line. Thus the slope of the line given in Figure 11-4 is tan120 = −174. Just store this information away for the next several chapters as we build a pre chemometrics view of analytic geometry.

76

Chemometrics in Spectroscopy Y

θ = 120°

X

Figure 11-4 Illustration of the slope of a line given an x-direction angle of 120 .

RECOMMENDED READING We recommend a standard text on vector analytic geometry. One good example is 1. White, P.A., Vector Analytic Geometry (Dickenson, Belmont, CA, 1966).

12

Analytic Geometry: Part 2 – Geometric Representation

of Vectors and Algebraic Operations

We continue with our pre-chemometrics review of analytic geometry, noting the term “vector” in all cases can be represented by a matrix of r × c dimensions, where r = # of rows and c = # of columns. The operations defined below will be employed in future discussions.

VECTOR MULTIPLICATION (SCALAR × VECTOR) If M represents a vector with components (or elements) as (Mx , My , then sM (where s is a real number, also termed a “scalar”) is defined as the vector represented by (sMx , sMy ); and the length of sM is s times the length of M. One can relate the direction angles of M to those of sM as follows: For the case where s > 0 (s is a positive, real number), then cos sM = cos M

(12-1a)

cos sM = cos M

(12-1b)

and

So the vectors sM and M have the exact same direction. For the case where s < 0 (where s is a negative, real number), then cos sM = −cos M

(12-1c)

cos sM = −cos M

(12-1d)

and

In this case, the vectors sM and M have the exact opposite directions. (Note: When

s = 0, there is no definition for the vector or direction.)

Example problem. If M = 1 5, then 2M (where s = 2) = 2 × 1 2 × 5 = 2 10,

represented in Figure 12-1 as the line segment from point (0, 0) to (2, 10). (Note: The

expression −2M = −2 −10 is represented by the line segment from point (0, 0) to

−2 −10.]

78

Chemometrics in Spectroscopy

(2, 10) 2M segment (0, 0) to (2, 10) (1, 5) M segment (0, 0) to (1, 5)

–2M segment (0, 0) to (–2, –10)

(–2, –10)

Figure 12-1 An example of scalar × vector multiplication: if M = 1 5, then 2M = 2 10 and −2M = −2 −10.

VECTOR DIVISION (VECTOR ÷ SCALAR) Vector division is represented as vector multiplication by using a fractional multi plier term. For example, if s = 1/2, then sM = 05 25; if s = −1/2, then sM = −05 −25, and so forth.

VECTOR ADDITION (VECTOR + VECTOR) Given M = Mx , My ), where M = 1 3; and N = Nx , Ny ), where N = 3 1, then M + N = MX + Nx My + Ny

(12-2)

The geometric representation is shown in Figure 12-2 for 1 + 3 3 + 1 = 4 4.

M + N = (4, 4)

M = (1, 3)

N = (3, 1)

Figure 12-2 An example of vector + vector addition: If M = 1 3 and N = 3 1, then M + N = 4 4.

Analytic Geometry: Part 2

79

VECTOR SUBTRACTION (VECTOR − VECTOR) Given M = Mx , My ), where M = 1 3, and N = Nx , Ny ), where N = 3 1, then M − N = Mx − Nx My − Ny The geometric representation of M − N = 1 − 3 3 − 1 = −2 2 is shown in Figure 12-3. In our next chapter we will look at the problem of representing higher dimensional space with fewer dimensions; it will be a precursor to discussions of the dimensional aspects of multivariate algorithms.

M – N = (–2, 2)

–N

M = (1, 3)

N = (3, 1)

Figure 12-3 An example of vector-vector subtraction: If M = 1 3 and N = 3 1 then M −N = −2 2.

This page intentionally left blank

13

Analytic Geometry: Part 3 – Reducing Dimensionality

For this chapter, we will reduce three-dimensional data to one-dimensional data using the techniques of projection and rotation. The (x, y, z) data will be projected onto the (x, z) plane and then rotated onto the x axis. This chapter is purely pedagogical and is intended only to demonstrate the use of projection and rotation as geometric terms.

REDUCING DIMENSIONALITY The exercise for this column is to reduce a point on a vector in 3-D space to a point on a vector in 2-D space, then to further reduce the point on a vector in 2-D space to a point on a vector in 1-D space – all the while maintaining as much information as possible. So (x, y, z) is reduced to (x, z), which is further reduced to (x). This process can be represented in symbolic language as (x, y, z) → (x, z) → x.

3-D TO 2-D BY PROJECTION Let us calculate some of the angles relative to the vector in 3-D space as shown in Figure 13-1. To calculate these angles, we refer to Chapter 1, and if we proceed with our calculations we find = cos−1 07071 = 45

(13-1)

and cos =

y2 − y1 2−0 = √ = 07071 d 8

= cos−1 07071 = 45

(13-2)

where, d=

�

x2 − x2 2 + y2 − y2 2 =

�

2 − 02 + 2 − 02 =

√ 8

82

Chemometrics in Spectroscopy z (2, 2, 6)

α

y

β

α

x

Figure 13-1 A point (X, Y , Z) = (2, 2, 6) located along a vector in 3-D space. Both the angle (the angle to the x-axis) and the angle (the angle to the y-axis), as illustrated in the figure are shown as a projection of the 3-D-vector (2,2,6) onto the (x, y) plane, and the proper calculations for both and from what is then a 2-D vector are correct as given in equations 13-1 and 13-2.

Because the third dimension is represented by the z axis, we calculate the z-direction angle on the (x, z) plane as : � = cos

−1

x2 − x1

� x2 − x1 2 + z2 − z1 2

�

� −1

= cos

�

2−0

�

2 − 02 + 6 − 02

= cos−1 03162 = 7157

(13-3)

Now look at Table 13-1 , which describes the trigonometric functions of a right triangle (Figure 13-2). If we apply Table 13-1 to this problem, we can calculate the length of a vector using trigonometric functions. Figure 13-3 illustrates the geometric problem for solving the length of the vector A to B or from points on the (x, z) axis (0, 0) to (2, 6). The angle calculated in equation 13-3 is represented in Figures 13-3 and 13-4; the angle shown in Figure 13-1 is not discussed. Because the third dimension is represented by the z-axis, we calculate the x-direction angle on the (x z) plane as : The correct calculation for this angle () is given in equation 13-3. To calculate the length of the horizontal vector for the projection of vector AB onto the (x, z) plane, we can use sin = opp/hyp

Table 13-1 Trigonometric functions of a right triangle opposite hypotenuse adjacent cos = hypotenuse opposite tan = adjacent sin =

hypotenuse opposite hypotenuse sec = adjacent adjacent cot = opposite csc =

Analytic Geometry: Part 3

83 Hypotenuse Opposite

θ Adjacent

Figure 13-2 A right triangle showing adjacent (adj.), hypotenuse (hyp.) and opposite sides relative to angle .

B

z

(2, 6)

θ hyp

A

adj

D

x

opp

Figure 13-3 The geometric problem associated with calculating the length of a vector AB, given a point (x, z) = (2, 6) in 2-D space. Note that the angle is equal to 90 − 7157 = 183 .

z L = 6.33

α = 71.57°

x

Figure 13-4 Illustration of two-dimensional reduction to one dimension by an x-directional rotation of 7157 .

which becomes hyp = opp/ sin = 2/ sin1843 = 633 Therefore, we can project the AB vector in 3-D space onto 2-D space by using a projection onto the (x, z) plane, resulting in a point on a vector (on the 2-D (x, z) plane) the vector being 6.33 units in length and having an X-direction angle equal to 7157 (as in Figure 13-4).

84

Chemometrics in Spectroscopy

2-D INTO 1-D BY ROTATION By rotating the vector in 2-D space over 7157 in the X-direction, we can align it to the X axis as a 1-D line 6.33 units in length (as shown in Figure 13-5). z

L = 6.33

x

Figure 13-5 By projecting a vector in (x, y, z) space onto a plane in (x, z) space, and by an x-directional rotation of 7157 in the (x, z) plane, we have the reduction of a point on a vector in 3-D space to a point on a vector in 1-D space.

In our next chapter, we will be applying the lessons reviewed over these past three chapters toward a better understanding of the geometric concepts relative to multivari ate regression.

14

Analytic Geometry: Part 4 – The Geometry of Vectors

and Matrices

In this chapter, we plan to use the information presented over the past three chapters to illustrate the geometry of vectors and matrices; these concepts will continue to be discussed routinely throughout this series in relation to regression vectors.

ROW VECTORS IN COLUMN SPACE Let us begin by representing a row matrix M = 1 2 3 in column space as shown in Figure 14-1. Note that the row vector M = 1 2 3 projects onto the plane defined by columns 1 and 2 as a point (1, 2) or a vector (straight line) with a C1 direction angle () equal to � = cos

−1

C12 − C11 d

�

� = cos

−1

1−0 √ 5

� (14-1)

cos−1 04472 = 6343 and a C2 direction angle () equal to � = cos

−1

C22 − C21 d

�

� = cos

−1

2−0 √ 5

� (14-2a)

cos−1 08944 = 2657 where d=

� � √ C12 − C11 2 + C22 − C21 2 = 12 + 12 = 5

(14-2b)

COLUMN VECTORS IN ROW SPACE �

� 1 2 can be represented 3 4 by 2-D row space as shown in Figure 14-2. Note that each column in the matrix can be represented by a column vector as shown in the figure. A matrix consisting of more than one row, such as M =

86

Chemometrics in Spectroscopy

Column 3

Row vector M = [1, 2, 3]

Column 2

β

Column 1

α

Figure 14-1 A representation of a row vector M = 1 2 3 in column space, and the projection of this vector onto the plane represented by Columns 1 and 2.

Row 2 4

Column 2

Column 1

3

2

1 Row 1

0 0

1

2

3

4

�

� 12 Figure 14-2 The representation of column vectors in row space of matrix M = . 34

PRINCIPAL COMPONENTS FOR REGRESSION VECTORS Figure 14-3a shows the projection of two column vectors – C1 = 1 3 and C2 = 3 1 onto their vector sum (or principal component (PC1)). We note that the product 1 3 × 3 1 = 1 × 3 3 × 1 = 3 3 . The vector sum of the two column vectors passes through the point (3, 3). but the projection of each column onto PC1 gives a vector with a length equal to line segments B + C as shown in Figure 14-3b.

Analytic Geometry: Part 4

87

(a)

(b) 4

4 PC1

Column 1

3

PC1

Column 1

3

B 2

2

D

E A

C 1 ∠D

Column 2

1

Column 2

∠α ∠β

0

∠C

0 0

1

2

3

4

0

1

2

3

4

Figure 14-3 (a) The representation of two columns of a matrix in row space. The vector sum of the two column vectors is the first principal component (PC1). (b) A close-up view of Figure 14-3a, illustrating the line segments, direction angles, and projection of Columns 1 and 2 onto the first principal component.

To determine the geometry for Figures 14-3a and 14-3b, we begin by calculating the length of line segment E (Column 1) by using the Pythagorean theorem as E 2 = Hyp2 = 3−02 + 1−02 = 32 + 12 = 10 √ Therefore: E = 10 = 3162

(14-3)

Then the angle C can be determined using opp 1 = adj 3

(14-4a)

1 = 18435 3

(14-4b)

tanC = and tan−1

So ∠C = 18435 , ∠D = 18435 , and ∠ + ∠ − 2 × 18435 = 90 . Thus, both ∠ and ∠ are each equal to 26565 . It follows that the projection of the vectors represented by Columns 1 and 2 onto the vector PC1 yields a right triangle defined by the three line segments C + B, D, and E. The length of PC1 (the hypotenuse) is equal to line segments C + B and is given by cos =

adj E 3162 ⇒ cos = ⇒ 08944 = = 35353 hyp C +B hyp

(14-5)

So the length of the hypotenuse (segments C + B) is 3.5353. We can check our work by calculating the opposite side (D) length as tan =

opp D opp ⇒ tan = ⇒ 0500 = = 15810 adj E 3162

(14-6)

88

Chemometrics in Spectroscopy

And by using the Pythagorean theorem we can calculate the length of the hypotenuse: 31622 + 158102 = 353522

(14-7)

By representing a row vector in column space, or a column vector in row space, we can illustrate the geometry of regression. These concepts combined with matrix algebra will be useful for further discussions of regression. In Chapters 15–20, we will digress from these topics and revisit experimental design concepts. Readers may wish to study additional materials related to the subject of analytical geometry and regression. We recommend two sources of such information below.

RECOMMENDED READING 1. Beebe, K.R. and Kowalski, B.R., Analytical Chemistry 59(17), 1007A–1017A (1987). 2. Fogiel, M., ed., The Geometry Problem Solver (Research and Education Association, New York, 1987).

15 Experimental Designs: Part 4 – Varying Parameters to Expand the Design

We have discussed experimental designs in previous papers [1–4], and in Chapters 8–10. In those previous chapters, the designs we discussed were, with the exception of one particularly interesting design (representing a special case of a more general type of design that we will discuss later), rather simple and plain, in the sense that the designs included only small numbers of levels of the various factors of interest, and were basically considerations of “all possible combination” of those factors – the types of experiments that scientists have been designing “forever” without any thought or consideration that they were “statistical experimental designs”. Obviously, though, since they represent special cases of wider classes of designs, they must also come under that umbrella. So what is special about the experimental designs that we call “statistical” or “chemometric” designs? Actually, very little, until we take a look at what happens when we need to scale these designs up to larger sample numbers or more complex designs. Before we do that, let us consider the various types of experiments, and the nature of the factors that are used in those experiments, involved. Someone doing an experiment is generally trying to learn about the effect of some phenomenon on some quantity that can be measured. While there are cases that do not fit the description we are about to present, one very common type of experiment involves changing (or allowing the change of) some parameter, and then measure the effect of that change. If there is only one such parameter, the situation is pretty straightforward, but things start getting interesting when two or more possible parameters are involved. Intuitively, the first instinct is to measure the results that are obtained for all possible combinations of the available values of the parameters. In Chapter 8, we looked at some experiments that involved two parameters (factors), each at two levels. In Chapter 10, we briefly looked at a three-factor, two-level design, with attention to how it could be represented geometrically. The use of the term “three factor, two level” to describe the design means that each factor was present at two levels, that is, the corresponding parameters were each permitted to assume two values. There are several ways we can expand a design such as this: we can increase the number of factors, the number of levels of each factor, or we can do both, of course. There are other differences than can be superimposed over the basic idea of the simple, all-possible combinations of factors, such as to consider the effect of whether we can control the levels of the factors (if we can then do things that are not possible to do if we cannot control the levels of the factors), whether the “levels” correspond to physical characteristics that can be evaluated and the values described have real physical meaning (temperature, for example, has real physical meaning, while catalyst type does not, even though different catalysts in an experiment may all have different degrees of effectiveness, and reproducibly so).

90

Chemometrics in Spectroscopy

Another consideration is whether all the factors can be changed independently through their range of possible values, or whether there are limits on the possible values. The most obvious limiting situation is the case of mixtures, where all the components of a mixture must sum to 100%. Other limitations might be imposed by the physical (or chemical) behavior of the materials involved: solubility as a function of temperature, for example, or as a function of other materials present (maximum solubility of salt in water–alcohol mixtures, for example, will vary with the ratio of the two solvents). Other limits might be set by practical considerations such as safety; except for specialized work by scientists experienced in the field, few experimenters would want to work, for example, with materials at concentrations above their explosive limits.

REFERENCES 1. 2. 3. 4.

Mark Mark Mark Mark

H. H. H. H.

and and and and

Workman, Workman, Workman, Workman,

J., J., J., J.,

Spectroscopy Spectroscopy Spectroscopy Spectroscopy

9(8), 26–27 (1994). 9(9), 30–32 (1994). 6(1), 13–16 (1991). 10(1),17–20 (1995).

16

Experimental Designs: Part 5 – One-at-a-time Designs

In Chapter 15, which was based on reference [1] we began our discussions of factorial designs. If we expand the basic n-factor two-level experiment by increasing the number of factors, maintaining the restriction of allowing each to assume only two values, then the number of experiments required is 2n , where n is the number of factors. Even for experiments that are easy to perform, this number quickly gets out of hand; if eight different factors are of interest, the number of experiments needed to determine the effect of all possible combinations is 256, and this number increases exponentially. The other obvious way we might want to expand the experiment is to increase the number of levels (values) that some or all of the factors take. In this case, the number of experiments required increases even faster than 2n . So, for example, if each factor is at three levels, then the number of experiments needed is 3n (for eight factors, corresponding to our previous calculation, this comes to 6,561 experiments!). In the general case, the number of experiments needed is i ni , where ni is the number of levels of the ith factor. It should be clear at this point that the problem with this scenario is the sheer number of experiments needed, which in the real world translates into time, resources, and expense. Something must be done. Several “somethings” have been done. The intuitive experimenter, expert in his partic ular field of science but untrained in “statistical” designs, simplifies the whole process by throwing out all the combinations, and uses what are known as simply “one-at-a-time” designs [2]. Five variations of this basic design are described, but basically these are only useful when the random noise or error is small (compared to the expected magnitude of the effects), and involve the experimenter changing one variable (factor) at a time to see which one(s) cause the greatest effect. Sometimes those are then examined in greater detail, by varying them over larger ranges, and/or at values lying within the original range. This solves the problem of the proliferation of experiments, since the number of experiments needed is now only 1+i ni instead of i ni , a much smaller number. It also provides a first-order indication of the effect of each of the factors. The difficulty now is the possibility of throwing out the baby with the bathwater, so to speak, by losing all information about the actual noise level, and information about any possible synergistic or inhibitory interactions between the factors. Thus, when statisticians got into the act, there saw a need to retain the information that was not included in the one-at-a-time plans, while still keeping the total number of experiments manageable; the birth of “statistical experimental designs”. Several types of “statistical experimental designs” have been developed over the years, with, of course,

92

Chemometrics in Spectroscopy

innumerable variations. However, they can be placed into a fairly small group of main design types: 1) 2) 3) 4)

Factorial Fractional factorial Sequential a) Latin square b) Graeco-latin square c) Latin and Graeco-latin cubes 5) Model-building 6) Response surface. By far the most statistical energy has been spent on the design and analysis of factorial designs. Books dealing with such designs (e.g., [3, 4]) spend a good part of their space discussing the variations required to accommodate such considerations as replication, blocking, how to deal with situations where the experiment itself is destructive (so that the same specimen is never available for retesting), whether the experimental conditions can be reproduced at will, and whether the experimental factors (or the desired response) can be assigned meaningful numerical values. Each of these considerations dictates the types of designs that can be considered and how they must be implemented. For our current discussions, however, we have been taking the path of discussing ways to reduce the required number of experiments, while still retaining the advantages of obtaining several types of information about the system under consideration. The simplest such type of design is the sequential design, simplest if for no other reason than that the type of design it replaces is one of the simplest designs itself. We will discuss this type of design in Chapter 17.

REFERENCES 1. 2. 3. 4.

Mark, H. and Workman, J., Spectroscopy 10(9), 21–22 (1995). Daniel, C., Journal of American Statistical Association 68(342), 353–360 (1973). Davies, O.L., The Design and Analysis of Industrial Experiments (Longman, New York, 1978). Box, G.E.P. Hunter, W.G. and Hunter, J.S., Statistics for Experimenters (John Wiley & Sons, New York, 1978).

17 Experimental Designs: Part 6 – Sequential Designs

We begin our discussion of resource-conserving (for want of a better generic term) experimental design with a look at sequential designs. This is the first of the types of experimental designs that have as one of their goals, a reduction in the required number of experiments, while still retaining the advantages of obtaining several types of information about the system under consideration. The simplest such type of design is the sequential design, simplest if for no other reason than that the type of design it replaces is one of the simplest designs itself. This design is the simple test for comparison of means, using the Z-test or the t-test as the test statistic; we have discussed these in our previous column series and book: “Statistics in Spectroscopy” (now in its second edition [1]). The standard t-test (or Z-test) specifies a predefined number of measurements to be made, either for a single condition or for a pair of conditions (i.e., sample-versus “control”). The difference between the two states is compared to the experimental error evidenced in the data, and a decision made based on whether the difference between the states is “large enough”, compared to the noise (or error). For a sequential test, the number of experiments is not predefined. Rather, experiments are performed sequentially (surprise!), and the series terminated as soon as enough data is available that a decision can be made as to whether the difference is “large enough”. True, it is theoretically possible for such a sequence of experiments to be indefinitely long; in practice, however, it is far more common for the situation to become decidable after fewer experiments than are required for the case of a fixed number of experiments. So how does this “magic” experimental design work? The best available discussion we know of is in reference [2]. The standard concept behind this experimental design is illustrated in Figure 17-1. As this figure shows, the “universe” is divided into three regions: the region (A) is the region of acceptance of the null hypothesis; region C is the region of acceptance of the alternative hypothesis. The middle region, B, is the region of continuation: as long as values fall into this region, we must continue with the experiments, since there is not enough information to make a decision. Figure 17-2 shows how this works for two typical cases. First a single experiment is performed, and the results noted. If these results put it into the region of continuing the project (virtually inevitable after only one experiment), then a second experiment is performed, and so forth. Figure 17-2 shows typical results for two possible sequences of experiments: the one indicated by the crosses enters the region of acceptance of the alternative hypothesis after seven experiments, the one indicated by the circles enters the region of acceptance of the null hypothesis after nine experiments. Obviously, the actual number of experiments required will depend on both the nature of the experiments and the definition of the two regions of acceptance. The x-axis represents, clearly, the number of experiments that have been carried out. The y-axis represents a function of

94

Chemometrics in Spectroscopy

A

f (α, β)

B

C

Number of experiments

Figure 17-1 Standard concept behind sequential experimental design (see text for definition of function f ( )).

1 A

B f (α, β)

2

C

5 10 Number of experiments

15

Figure 17-2 Typical results for two possible experimental sequences.

the results of the experiments. Important to note at this point is the fact that, in one way or another, the quantity plotted along the y-axis is a function, not of the result of an experiment, but on one way or another, the cumulative results of all the experiments done up to that point. The key point, then, is how the lines separating the different regions are defined. The total answer will depend, of course, on which statistic is being plotted and on the details of the nature of the hypothesis test being done (e.g., two-tailed versus one-tailed, etc.). For an illustration we consider the sequential test of the hypothesis of the mean of a sample being the same as that of a given population, with the standard deviation known. In the case of fixed sample size, this would be done using a statistical hypothesis test with the Z statistic as the test statistic, and the probability level set simply to . For a sequential test, both the theory and the computations are a bit more complicated. In the case at hand, the defining limits are constructed as shown in Figure 17-3. The expected value of any given measurement is, of course 0 , the population mean. Then the expected value of the sum of n readings, which we label T , equals n for each value

Experimental Designs: Part 6

95

f (α, β)

A

B C

h0 Number of experiments

Figure 17-3 The relationship between the expected value of the statistic and the lines separating the regions of acceptance and rejection from the region indicating continuation of the experiment.

of n, and plotting these sums as a function of n gives the central straight line shown in Figure 17-3; this line represents the expected value of the sum, and has a slope equal to 0 . As can be seen, data that agrees with the null hypothesis will follow this line and eventually move into region A, the region of acceptance of the null hypothesis. The lines separating the two regions are defined by their slope and intercepts. If we let represent the minimum difference from 0 we wish to detect, then the slope of the lines (which is common to the two lines: they are parallel) equals 0 + /2. The y-intercepts, which we designate h, are h0 = − ln1 − / 2 / h1 = ln1 − / 2 / We note several interesting points about these expressions. First, the positions of the lines of demarcation depend, as we would expect, on both the minimum expected departure from 0 we wish to detect and . It also depends upon a quantity that is a logarithm, and the logarithm of the quantity no less, that we have always previously dismissed. While a discussion of properly belongs in the realm of elementary statistics, at this point it is worthwhile to go back to some of those discussions, to examine how this impacts our current interests. We will proceed along with this digression in our next chapter.

REFERENCES 1. Mark, H. and Workman, J., Statistics in Spectroscopy, 1st ed. (Academic Press, New York, 1991). 2. Davies, O.L., The Design and Analysis of Industrial Experiments (Longman, New York, 1978).

This page intentionally left blank

18

Experimental Designs: Part 7 – �, the Power of a Test

In Chapter 17 and reference [1], we started discussing the way a series of experiments could be designed so that the decision to perform another experiment could be based on the outcomes of the experiments already done. We saw there that we needed to be able to tell if we could stop because the result had become statistically significant; and we also saw that we needed a way to tell if we could stop because we had reached the statistically significant conclusion that there is no real difference between the sample and the (hypothetical) reference population. This is necessary, indeed crucial, otherwise we could continue experimenting endlessly, waiting for a statistically significant result when there was no real difference to detect so that none would be expected. The first stopping criterion is straightforward, it is simply the standard hypothesis test, based on probabilities that we have previously discussed of a sample coming from the hypothesized population P0 [2]. The second stopping criterion, however, seems to fly in the face of our previous discussions on the topic, where we said that you could not prove two populations the same. However, the reason for the second statement is that the difficulty in proving that a sample came from a given population is easier to see if we reword the statement of it by making it a double negative, and ask whether we can prove that it did not come from a different population? Now the nature of the difficulty becomes clearer: we have no information about the nature of the “different” population that we want to test against. Now that we can see the problem, we can find a point of attack against it. We can hypothesize a population Pa with any given characteristics we want, and then consider the consequences of dealing with that alternate population. In particular, we consider the probabilities of either accepting or rejecting our original null hypothesis (based on P0 if, in fact, our sample came from the alternate population Pa . The probability of coming to the incorrect conclusion that the sample came from P0 when it really came from Pa is called the probability (compare with the probability, which is the probability of drawing the incorrect conclusion that a sample did not come from P0 when it really did). This is known in statistical parlance as the “power” of the statistical test. Thus, in performing a statistical hypothesis test, we would normally consider only the ordinary tests against the alpha error as a means of determining statistical significance. However, as we have seen, that leaves completely open the number of samples needed. The power of a test gives us a criterion which will allow determining the number of samples. To redefine the term: the power of a statistical test is the probability of obtaining a statistically significant result given that in fact the null hypothesis is false. Ordinarily to show a non-significant result is easy: just use few enough samples. To show that you have obtained a non-significant result when there is a high probability of obtaining a significant result for a false hypothesis is convincing indeed, and also gives us the basis for determining the number of samples needed. On the other hand, we do not want to go overboard and use so many samples that we get statistically significant results for

98

Chemometrics in Spectroscopy

tiny, unimportant differences. As we will see below, the power of the test does allow us to specify the minimum number of samples required, but this number can quickly get out of hand, and show up tiny differences, if we are not careful on how we specify the requirements. The problem with defining criteria for such a test is that it depends on the probability, which is difficult to determine (although we could arbitrarily specify a value, such as 95%). It also depends on the smallest difference you need to detect, the number of samples, the variability of the data (which at least can be determined from the data, the same way it is done for determining ), and the probability of detecting the given difference at a specified alpha- significance level. Thus what we do is to work backwards, so to speak. Since we want to find the number of samples corresponding to different probabilities for , and D (the difference between the data and 0 , we first find the difference corresponding to given values of the other quantities. This can be seen more easily in Figure 18-1. To summarize Figure 18-1 in words, the top curve represents the characteristics of a population P0 with mean 0 . Also indicated in Figure 18-1 is the upper critical limit, marking the 95% point for a standard hypothesis test H0 that the mean of a given sample is consistent with 0 . A measured value above the critical value indicates that it would be “too unlikely” to have come from population P0 , so we would conclude that such a reading came from a different population. Two such possible different, or alternate, populations are also shown in Figure 18-1, and labeled P1 and P2 . Now, if in fact a random sample was taken from one of these alternate populations, there is a given probability, whose value depends on which population it came from, that it would fall above (or below) the upper critical limit indicated for H0 . The shaded areas in Figure 18-1 indicate the probabilities for a random sample falling below the critical value for H0 , when one of those alternate populations is in fact the correct population from which the sample was taken. As can be seen, these probabilities are 50% for population P1 and roughly 5% for population P2 . These probabilities are

P0 Upper critical limit for P0 Mean = µ 0 P1

P2

Figure 18-1 Characteristics of population P0 with mean 0 and alternate populations P1 and P2 (Note that the X-axes have been offset for clarity).

Experimental Designs: Part 7

99

the probabilities of (incorrectly) concluding that the data is consistent with H0 , for the two cases. This same topic is continued in our next chapter.

REFERENCES 1. Mark, H. and Workman, J., Spectroscopy 11(2), 43 (1996). 2. Mark, H. and Workman, J., Statistics in Spectroscopy, 1st ed. (Academic Press, New York, 1991).

This page intentionally left blank

19

Experimental Designs: Part 8 – �, the Power of a Test

(Continued)

Continuing from our previous discussion in Chapter 18 from reference [1], analogous to making what we have called (and is the standard statistical terminology) the error when the data is above the critical value but is really from P0 , this new error is called the error, and the corresponding probability is called the probability. As a caveat, we must note that the correct value of can be obtained only subject to the usual considerations of all statistical calculations: errors are random and independent, and so on. In addition, since we do not really know the characteristics of the alternate population, we must make additional assumptions. One of these assumptions is that the standard deviation of the alternate population Pa is the same as that of the hypothesized population P0 , regardless of the value of its mean. The existence of the probability provides us with the tool for determining what is called the power of the test, which is just 1 − , the probability of coming to the correct conclusion when in fact the data did not come from the hypothesized population P0 . This is the answer to our earlier question: once we have defined the alternate population Pa , we can determine the probability of a sample having come from Pa , just as we can determine the probability of that sample having come from P0 . So how does this help us determine n? As we know from our previous discussion of the Central Limit Theorem [2], the standard deviation of a sample from a population decreases from the population standard deviation as n increases. Thus, we can fix 0 and a and adjust the and probabilities by adjusting n and the critical value. Normally, it is convenient to adjust the critical value to be equidistant from 0 and a , and then adjust n so that that critical value represents the desired probability levels for and . As an example, we can set alpha- and beta- levels to the same value, which makes for a simple computation of the number of samples needed, at least for the simple case we have been considering: the comparison of means. If we use the 95% value for both (a very stringent test), which corresponds to a Z-value of 1.96 (as we know), then if we let D represent the difference in means between the two values (sample data and population mean), and S is the precision of the data, we find that √ D >= 392 S/ n

(19-1)

so that n = 392S/D2 = 15 S/D2

(19-2)

In words, we would need 15 samples for 95% confidence on both alpha and beta, to distinguish a difference of the means equal to the precision of the measurement, and the number increases as the square of any decrease in difference we want to detect.

102

Chemometrics in Spectroscopy

To compute the power for a hypothesis test based on standard deviation, we would have to read off the corresponding probability points from a chi-square table; for 95% confidences on both alpha and beta, the square root of the ratio of 2 (0.95, v) and 2 (0.05, v (v = the degrees of freedom, close enough to n for now) is the ratio of standard deviations that can be distinguished at that level of power. Similarly to the case of the means, v would also be related to the square of that ratio, but 2 would still have to be read from tables (or computed numerically). As an example, for 35 samples, the precision of the instrument could not be tested to be better than � √ 486/216 = 225 = 15 (19-3) or 1.5 times the precision of the reference method with that amount of power, and as before, n will increase as the square of any improvement we want to demonstrate. The ratio of 2 (.95, v to 2 (.05, v does decrease as v increases, but not nearly as fast as the square increases: it is a losing fight. Thus, the use of the concept of the Power of a Test allows specification of the number of samples (although it may turn out to be very high), and by virtue of that forms the basis for performing experiments as a sequential series.

REFERENCES 1. Mark, H. and Workman, J., Spectroscopy 11(6), 30–31 (1996). 2. Mark, H. and Workman, J., Spectroscopy 3(1), 44–48 (1988).

20

Experimental Designs: Part 9 – Sequential Designs

Concluded

Our previous two chapters based on references [1, 2] describe how the use of the power concept for a hypothesis test allows us to determine a value for n at which we can state with both - and -% certainty that the given data either is or is not consistent with the stated null hypothesis H0 . To recap those results briefly, as a lead-in for returning to our main topic [3], we showed that the concept of the power of a statistical hypothesis test allowed us to determine both the and the probabilities, and that these two known values allowed us to then determine, for every n, what was otherwise a “floating” quantity, D. At this point it should be starting to become clear what is going on. If a given set of , and D allow us to determine n, then similarly, a corresponding set of , and n allow us to determine D. Thus for a given and , n and D are functions of each other, and it then becomes a simple matter (at least in principle, in practice the math involved is extremely hairy) to determine the functionality. In fact the actual situation is considerably more complicated to determine mathemat ically. In our previous discussions, we have made a number of simplifying assumptions which cannot be used if we wish to calculate correct values for our expressions, and for which the actual situation must be incorporated into the math. The first of these assumptions is the use of the Normal distribution. When we perform an experiment using a sequential design, we are implicitly using the experimentally determined value of s, the sample standard deviation, against which to compare the difference between the data and the hypothesis. As we have discussed previously, the use of the experimental value of s for the standard deviation, rather than the population value of , means that we must use the t-distribution as the basis of our comparisons, rather than the Normal distribution. This, of course, causes a change in the critical value we must consider, especially at small values of n (which is where we want to be working, after all). The other key assumption that we sort of implied was that the comparison of standard deviation is constant. Of course we know that as n changes, the comparison value changes as the square root of n. This is on top of and in addition to the changes caused by the use of the t rather than the Normal (Z) distribution. So how is this related to the nature of the graph used for the sequential experimental design? We forgo the detailed math here, in deference to trying to impart an intuitive grasp of the topic, and we have already presented the equations involved [3]. The limits of the allowable values around the hypothesized values close in on it as n increases. This behavior is shown in Figure 20-1. If, in fact, we were to plot the mean of the population as a function of n, it would be a horizontal line, just as shown. The mean of the actual data would vary around this horizontal line (assuming the null hypothesis was correct), at smaller and smaller distances, as n increased.

104

Chemometrics in Spectroscopy

Upper critical limit Mean (µ0)

Lower critical limit

n

Figure 20-1 The limits of the allowable values around the hypothesized value close in on it as n increases.

If the null hypothesis was wrong, then the data would vary around a line offset from the line representing 0 , and get closer and closer to it, instead. Eventually, at some value of n, this line would cross the converging lines representing the critical limits around 0 , indicating the result. This is the basic picture, shown in Figure 20-2. For a sequential experimental plan, the sequence is terminated at the first significant experiment, as shown. The details differ, however. By convention, instead of plotting the mean, 0 , as a function of n, the sum of the data, which has a theoretical value of n∗ 0 , is used. Clearly this line will slope upward with a slope of 0 , instead of being horizontal, as will the data plot. The rest of the conceptual picture is the same, however. As we saw previously in reference [3], the slope of the line represented by n∗ 0 is paralleled by the confidence limits for the sum of the data, as represented by the equations in that

First significant reading Upper critical limit Mean (x) Mean (µ0)

Lower critical limit

n

Figure 20-2 If the null hypothesis was wrong, then the data would vary around a line offset from the line representing 0 and get closer and closer to that line.

Experimental Designs: Part 9

105 n × (x )

n × (µ 0)

First significant point

Upper critical limit

Lower critical limit

n

Figure 20-3 The approach of the upper line, representing the probability, corresponds to the approach of the curved lines to the n × 0 line (representing the null hypothesis).

column; thus, at the point where the line representing the successive mean values from the experimental design crosses the confidence limit in Figure 20-2, so does the line representing the successive sums eventually cross the line specified by the equations in reference [3], and illustrated in Figure 20-3 here. According to the derived equations, as we saw previously, the actual confidence limits representing the and probabilities are straight lines parallel to each other but not parallel to the line representing n∗ 0 . The approach of the upper line, representing the probability, corresponds to the approach of the curved lines, shown in Figure 20-3, to the n∗ 0 line (representing the null hypothesis) there. The line representing , however, being parallel to the line, departs from the null hypothesis. This can be interpreted as stating, as we have previously implied, that it is always harder to “prove” the null hypothesis than to disprove it.

REFERENCES 1. Mark, H. and Workman, J., Spectroscopy 11(6), 30–31 (1996). 2. Mark, H. and Workman, J., Spectroscopy 11(8), 34 (1996). 3. Mark, H. and Workman, J., Spectroscopy 11(4), 32–33 (1996).

This page intentionally left blank

21 Calculating the Solution for Regression Techniques: Part 1 – Multivariate Regression Made Simple

For the next several chapters we will illustrate the straightforward calculations used for multivariate regression (MLR), principal components regression (PCR), partial least squares regression (PLS), and singular value decomposition (SVD). In all cases we will use the same notation and perform all mathematical operations using MATLAB (Matrix Laboratory) software [1, 2]. We have already discussed and shown many of the manual methods for calculating the matrix algebra in references [3–6]. Let us begin by identifying a simple data matrix denoted by A. A is used to represent a set of absorbances for three samples as a rows × columns matrix. For our example each row represents a different sample spectrum, and each column a different data channel, absorbance or frequency. We arbitrarily designate A for our example as ⎡ ⎤ ⎡ ⎤ A11 A12 1 7 (21-1) A = ⎣ A21 A22 ⎦ = AI×K = ⎣ 4 10 ⎦ A31 A32 6 14 Thus, the integers 1 and 7 represent the instrument signal for two data channels (fre quencies 1 and 2) for sample Spectrum #1, 4 and 10 represent the same data channel signals (e.g., frequencies 1 and 2) for sample Spectrum #2, and so on. If we arbitrarily set our concentration c vector representing a single component to be a single column of numbers as ⎡ ⎤ ⎡ ⎤ 4 c11 (21-2) cr×c = ⎣ c21 ⎦ = cI×1 = ⎣ 8 ⎦ c31 11 we now have the data necessary to calculate the matrix of regression coefficients b which is given by b11 −1 b = = A� A A� c = A+ c = pˆ (21-3) b21 This b (also known as pˆ = the prediction vector) is often referred to as the regression vector or set of regression coefficients. Note that A� A−1 A� is referred to as the pseu doinverse of A designated as A+ . Note that there is one regression coefficient for each frequency (or data channel). The matrix of predicted values is easily obtained as Matrix A (the data matrix) × Vector b (the regression coefficients) = Vector c (the predicted values). This is shown in matrix notation as A × b = c

(21-4)

108

Chemometrics in Spectroscopy

Now using the MATLAB command line software, we can easily demonstrate this solution (for the multivariate problem we have identified) using a series of simple matrix operations as shown in Table 21-1 below: Table 21-1 Matrix operations in MATLAB to compute equations 21-1–21-4 Command line

Comments

� A = [1 7;4 10;6 14]

Enter the A matrix

� A= 1 7 4 10 6 14

Display the A matrix

� c = [4;8;11]

Enter the concentration vector c

c=

Display the concentration vector c

4 8 11 � b = invA�∗ A∗ A�∗ c

Calculate the regression vector [Note: The inverse applies only to (A�∗ A)]

b= 0.7722 0.4662

Display the regression vector b

� A∗ b ans = 4.0356 7.7509 11.1601

Predict the concentrations [Note: A residual (or difference) exists between the predicted and the actual concentrations 4, 8, and 11].

REFERENCES 1. MATLAB software for Windows from The MathWorks, Inc., 24 Prime Park Way, Natick, Mass. 01760-1500. Internet:[email protected] 2. O’Haver, T.C., Chemometrics and Intelligent Laboratory Systems 6, 95 (1989). 3. Workman, J. and Mark, H., Spectroscopy 8(9), 16 (1993). 4. Workman, J. and Mark, H., Spectroscopy 9(1), 16 (1994). 5. Workman, J. and Mark, H., Spectroscopy 9(4), 18 (1994). 6. Mark, H. and Workman, J., Spectroscopy 9(5), 22 (1994).

22 Calculating the Solution for Regression Techniques: Part 2 – Principal Component(s) Regression Made Simple

For the next several chapters in this book we will illustrate the straight forward cal culations used for multivariate regression. In each case we continue to perform all mathematical operations using MATLAB software [1, 2]. We have already discussed and shown the manual methods for calculating most of the matrix algebra used here in references [3–6]. You may wish to program these operations yourselves or use other software to routinely make these calculations. As in Chapter 21, we begin by identifying a simple data matrix denoted by A. A is used to represent a set of absorbances for three samples as a rows × columns matrix. For our example each row represents a different sample spectrum, and each column a different data channel, absorbance or frequency. We arbitrarily designate A for our example as ⎡

A11 A = ⎣A21 A31

⎤ ⎡ A12 1 A22 ⎦ = AI×K = ⎣4 A32 6

⎤ 7 10⎦ 14

(22-1)

Thus, 1 and 7 represent the instrument signal for two data channels (frequencies 1 and 2) for sample spectrum #1; 4 and 10 represent the same data channel signals (e.g., frequencies 1 and 2) for sample spectrum #2, and so on. We now have the data necessary to calculate the singular value decomposition (SVD) for matrix A. The operation performed in SVD is sometimes referred to as eigenanal ysis, principal components analysis, or factor analysis. If we perform SVD on the A matrix, the result is three matrices, termed the left singular values (LSV) matrix or the U matrix; the singular values matrix (SVM) or the S matrix; and the right singular values matrix (RSV) or the V matrix. We now have enough information to find our Scores matrix and Loadings matrix. First of all the Loadings matrix is simply the right singular values matrix or the V matrix; this matrix is referred to as the P matrix in principal components analysis terminology. The Scores matrix is calculated as The data matrix A × the Loadings matrix V = Scores matrix T

(22-2)

Note: the Scores matrix is referred to as the T matrix in principal components analysis terminology. Let us look at what we have completed so far by showing the SVD calculations in MATLAB as illustrated in Table 22-1.

110

Chemometrics in Spectroscopy

Table 22-1 Matrix operations in MATLAB to compute the SVD of data matrix A Command line

Comments

� A = [1 7;4 10;6 14] A= 1 7 4 10 6 14

Enter the A matrix Display the A matrix

� [U,S,V] = svd(A);

Perform SVD on the A matrix

�U U= 03468 09303 01193 05417 -0.0949 -0.8352 07656 -0.3543 05369

Display the U matrix or the left singular values (LSV) matrix

�S S= 198785 0 0 16865 0 0

Display the S matrix or the singular values (SV) matrix

�V V= 03576 -0.9339 09339 03576

Display the V matrix or the right singular values (RSV) matrix (Note: this is also known as the P matrix or Loadings matrix)

� T = A*V T= 68948 15690 107691 -0.1600 152198 -0.5976

Calculate the Scores Matrix or the T matrix

If we arbitrarily set our concentration c vector representing a single component to be a single column of numbers as ⎡ ⎤ ⎡ ⎤ c11 4 cr×c = ⎣ c21 ⎦ = cI×1 = ⎣ 8 ⎦ (22-3) c31 11 We can now use S, V, and T to calculate the following; A reconstruction of the original data matrix A is computed by using the preselected number of principal components (i.e., columns in our T and V matrices) as A estimated = T × V �

(22-4)

The set of regression coefficients (i.e., the regression vector) is calculated as b (regression vector) = V × S−1 × U � × c

(22-5)

Calculating the Solution for Regression Techniques: Part 2

111

Table 22-2 Matrix operations in MATLAB to compute equations 22-4–22-6 Command line

Comments

� Aest = T*V�

Estimate the A data matrix

� Aest = 10000 70000 40000 100000 60000 140000

Display the estimate for A

� b = V(:,1:2)*inv(S(1:2,1:2))*U(:,1:2)’*c;

Calculate the regression vector [Note: The inverse operation refers only to the singular values matrix S. The calculation to determine b can only be performed using two columns in each of the V, S, and U matrices; this number is equivalent to the number of latent variables (or principal components) used.

b= 07722 04662

Display the regression vector

� cest = (T*V� )*b

Predict the concentrations [Note: This computation is equivalent to (Aest × b)].

cest = 40356 77509 111601

Display the concentration vector [Note: For this example of PCR a residual (or difference) exists between the predicted and the actual concentrations 4, 8, and 11].

The predicted or estimated values of c are computed as c (estimated) = T × V � × b

(22-6)

Now using the MATLAB command line software, we can easily demonstrate this solution (for the multivariate problem we have identified) using a series of matrix operations as shown in Table 22-2.

REFERENCES 1. MATLAB software from The MathWorks, Inc., 24 Prime Park Way, Natick, Mass. 01760-1500. Internet: [email protected]. 2. O’Haver, T.C., Chemometrics and Intelligent Laboratory Systems 6, 95 (1989). 3. Workman, J. and Mark, H., Spectroscopy 8(9), 16 (1993). 4. Workman, J. and Mark, H., Spectroscopy 9(1), 16 (1994). 5. Workman, J. and Mark, H., Spectroscopy 9(4), 18 (1994). 6. Mark, H. and Workman, J., Spectroscopy 9(5), 22 (1994).

This page intentionally left blank

23

Calculating the Solution for Regression Techniques:

Part 3 – Partial Least Squares Regression Made Simple

For the past three chapters we have described the most basic calculations for MLR, PCR, and PLS. Our intent is to show basic computations for these regression methods while avoiding unnecessary complexity which could confuse rather than instruct. There are of course a number of difficulties in taking this simplistic approach; namely the assumptions made for our simple cases do not always hold, and poorly behaved matrices are the rule rather than the exception. We have not yet discussed the concepts of rank, collinearity, scaling, or data conditioning. Issues of graphical representation and details of computational methods and assessing model performance are forthcoming. We ask that you abide with us over the next several chapters as we intend to delve much more deeply into the details and problems associated with regression methods. For this chapter we will illustrate the straightforward calculations used for PLS regres sion utilizing singular value decomposition. For PLS a special case of SVD is used. You will notice that the PLS form of SVD includes the use of the concentration vector c as well as the data matrix A. The reader will note that the scores and loadings are determined using the concentration values for PLS-SVD whereas only the data matrix A is used to perform SVD for principal components analysis. The SVD and PLS SVD will be the subject of several future chapters so we will only introduce its use here and not its derivation. All mathematical operations are completed using MATLAB soft ware [1, 2]. As previously discussed the manual methods for calculating the matrix algebra used within these chapters on the subject is found in references [3–7]. You may wish to program these operations yourselves or use other software to routinely make the calculations. As in our last installment we begin by identifying a simple data matrix denoted by A. A is used to represent a set of absorbances for three samples and three data channels, as a rows × columns matrix. For our example each row represents a different sample spectrum, and each column a different data channel, absorbance or frequency. We arbitrarily designate A for our example as ⎡ A11 Ar×c = ⎣A21 A31

A12 A22 A32

⎤ ⎡ A13 1 A23 ⎦ = AI×K = ⎣4 A33 6

7 10 14

⎤ 9 12⎦ 16

(23-1)

Thus, 1, 7, and 9 represent the instrument signal for three data channels (frequencies 1, 2, and 3) for sample spectrum #1; 4, 10, and 12 represent the same data channel signals (e.g., frequencies 1, 2, and 3) for sample spectrum #2, and so on.

114

Chemometrics in Spectroscopy

If we arbitrarily set our concentration c vector representing a single component to be a single column of numbers as ⎡ ⎤ ⎡ ⎤ 4 c11 cr×c = ⎣ c21 ⎦ = cI×1 = ⎣ 8 ⎦ (23-2) c31 11 We now have both the data matrix A and the concentration vector c required to calculate PLS SVD. Both A and c are necessary to calculate the special case of PLS singular value decomposition (PLSSVD). The operation performed in PLSSVD is sometimes referred to as the PLS form of eigenanalysis, or factor analysis. If we perform PLSSVD on the A matrix and the c vector, the result is three matrices, termed the left singular values (LSV) matrix or the U matrix; the singular values matrix (SVM) or the S matrix; and the right singular values matrix (RSV) or the V matrix. We now have enough information to find our PLS Scores matrix and PLS Loadings matrix. First of all the PLS Loadings matrix is simply the right singular values matrix or the V matrix; this matrix is referred to as the P matrix in principal components analysis and partial least squares terminology. The PLS Scores matrix is calculated as The data matrix A × the PLS Loadings matrix V = PLS Scores matrix T

(23-3)

Note: the PLS Scores matrix is referred to as the T matrix in principal components analysis and partial least squares terminology. Let us look at what we have completed so far by showing the PLS SVD calculations in MATLAB as illustrated in Table 23-1. We can now use S, V, and T to calculate the following: A reconstruction of the original data matrix A is computed by using the preselected number of factors (i.e., columns in our T and V matrices) as A estimated = T × V

(23-4)

The set of regression coefficients (i.e., the regression vector) is calculated as b regression vector = V × S−1 × U × c

(23-5)

The predicted or estimated values of c are computed as c estimated = T × V × b

(23-6)

This expression is equivalent to c estimated = A estimated × b = A × b

(23-7)

or can be used to predict a single sample spectrum a using the expression c estimated = a estimated × c = a × b

(23-8)

Now using the MATLAB command line software, we can easily demonstrate this solution (for the multivariate problem we have identified) using a series of matrix operations as shown in Table 23-2.

Calculating the Solution for Regression Techniques: Part 3

115

Table 23-1 Matrix operations in MATLAB to compute the PLS SVD calculations of data matrix A (see equations 23-1–23-3) Command line

Comments

A = 1 7 9 4 10 12 6 14 16

Enter the A matrix

A= 1 7 9 4 10 12 6 14 16

Display the A matrix

c = [4;8;11]

Enter the c vector

c= 4 8 11

Display the c vector

[U,S,V] = SVDPLS(A,c,3);

Perform PLS SVD on the A matrix. This is a CPAC(7) version of the PLS SVD algorithm.

U U= 03817 -0.9067 -0.1797 05451 00638 08359 07465 04170 -0.5186

Display the U matrix or the left singular values (LSV) matrix

S S= 295796 -0.2076 00000 00000 19904 -0.0367 00000 00000 02038

Display the S matrix or the singular values (SV) matrix

V V= 02446 09345 02588 06283 00506 -0.7764 07386 -0.3525 05747

Display the PLS V matrix or the right singular values (RSV) matrix (Note: this is also known as the P matrix or PLS Loadings matrix)

T = A∗ V T= 112894 -1.8839 -0.0034 161236 00138 01680 220801 06750 -0.1210

Calculate the PLS Scores Matrix or the T matrix

116

Chemometrics in Spectroscopy

Table 23-2 Matrix operations in MATLAB to compute equations 23-4–23-8) Command line

Comments

Aest = T∗ V

Estimate the A data matrix

Aest = 10000 70000 90000 40000 100000 120000 60000 140000 160000

Display the estimate for A

b = V∗ invS∗ U∗ c

Calculate the PLS regression vector [Note: The inverse operation refers only to the singular values matrix S. The calculation to determine b is performed using three columns in each of the V, S, and U matrices; this number is equivalent to the number of latent variables (or PLS factors) used.

b= 11667 -0.6667 08333

Display the regression vector

cest = T∗ V ∗ b

Predict the concentrations [Note: This computation is equivalent to (Aest × b)].

cest = 40000 80000 110000

Display the concentration vector [Note: For this simple example of PLS no residual (or difference) exists between the predicted and the actual concentrations 4, 8, and 11].

REFERENCES 1. MatLab software Version 4.2 for Windows from The MathWorks, Inc., 24 Prime Park Way, Natick, Mass. 01760-1500. Internet: [email protected]. 2. O’Haver, T.C., Chemometrics and Intelligent Laboratory Systems, 6, 95 (1989). 3. Workman, J. and Mark, H., Spectroscopy 8(9), 16 (1993). 4. Workman, J. and Mark, H., Spectroscopy 9(1), 16 (1994). 5. Workman, J. and Mark, H., Spectroscopy 9(4), 18 (1994). 6. Mark, H., and Workman, J., Spectroscopy 9(5), 22 (1994). 7. Center for Process Analytical Chemistry, University of Washington, Seattle, WA, m-script library, 1993 (Contact Mel Koch or Dave Veltkamp for current versions).

24 Looking Behind and Ahead: Interlude

We depart from discussion of our usual topics in this chapter. Over the years since beginning writing on this topic, there has been a spate of telephone calls where the callers, after introducing themselves, said something that could generically be rendered as: “By chance I came across a copy of one of your articles, and am interested in reading more about this subject. Are there any more articles like this, and what are they, and how can I get them?” After discussing this between ourselves, we decided that we have reached a point where it is worthwhile to present our readers with a complete set of the chemometrics writings published to date. Those of you who have been reading our work for a long time will recall that the column series “Chemometrics in Spectroscopy” is a continuation of our previous column series, “Statistics in Spectroscopy”. Statistics in Spectroscopy was published from 1986 to 1992, with some preliminary articles in 1985. The columns from the earlier series, “Statistics in Spectroscopy”, have been collected and published in their entirety as a book (with minor editorial changes appropriate to the change in format from a series of columns to a book) of the same name, now in its second edition. So much for the past; what about the discussion? The last few chapters have been presenting the “nuts and bolts” of some of the more common chemometric techniques for performing quantitative chemometric/spectroscopic calibration, even getting down to the level of a “cookbook” of actual code (written for the MATLAB Matrix Algebra multivariate analysis software). The following chapters will deal first with completing a discussion on the various chemometric techniques in current use, and then to go “under the hood” with them to emphasize the underlying mathematical and theoretical framework that these methods rest upon. One upcoming topic will be a description of the so-called “statistical design of experiments” methodologies, emphasizing those techniques that tend to be obscure, but are more useful than they are dealt with in mainstream Chemometric discussions.

This page intentionally left blank

25 A Simple Question: The Meaning of Chemometrics Pondered

In a 1997 paper, Steve Brown and Barry Lavine state, “Chemometrics is not a subfield of Statistics. Although statistical methods are employed in Chemometrics, they are not the primary vehicles for data analysis” [1]. Parenthetically, we recommend this article as a very nice nonmathematical introduction for the average chemist as to what Chemometrics is, and how it can be used. As far as the quote is concerned, we have to both agree and disagree. On the one hand, we have to recognize the de facto truth that many users of Chemometric techniques are not aware of the Statistical backgrounds of the techniques, and indeed, we sometimes suspect that even the developers of those techniques may also not be aware of, or at least, give the statistical considerations their proper weight. Having said that, we will issue some disclaimers a little further on, because there are some legitimate and justifiable reasons for the existence of this situation. However, ignoring the existence of this situation means that nobody is paying the attention that would eventually lead to the condition being corrected, which would result in a better theoretical understanding of the techniques themselves, with a concomitant improvement in their reliability and definition of their range of applicability. This leads us to the other hand, which, it should be obvious, is that we feel that Chemometrics should be considered a subfield of Statistics, for the reasons given above. Questions currently plaguing us, such as “How many MLR/PCA/PLS factors should I use in my model?”, “Can I transfer my calibration model?” (or more importantly and fundamentally: “How can I tell if I can transfer my calibration model?”), may never be answered in a completely rigorous and satisfactory fashion, but certainly improvements in the current state of knowledge should be attainable, with attendant improvements in the answers to such questions. New questions may arise which only fundamental statistical/probabilistic considerations may answer; one that has recently come to our attention is, “What is the best way to create a qualitative (i.e., identification) model, if there may be errors in the classifications of the samples used for training the algorithm?” Part of the problem, of course, is that the statistical questions involved are very difficult, and have not yet been solved completely and rigorously even by statisticians. Another part of the problem is that very few first-class statisticians are interested in, or perhaps even aware of, the existence of our subdiscipline or its problems. Thus of necessity we push on and muddle through in the face of not always having a completely firm, mathematically rigorous foundation on which to base our use of the techniques we deal with (here comes our disclaimer). So we use these techniques anyway because otherwise we would have nothing: if we waited for complete rigor before we did anything, we would likely be waiting a long, long time, maybe indefinitely, for a solution that might never appear, and in the meanwhile be helpless in the face of the real (and real-world) problems that confront us.

120

Chemometrics in Spectroscopy

But that does not mean that we should not fight the good fight while we are trying to solve current problems, or let that effort distract us. This means two things. The first is to do as we have been doing, and use our imperfect tools and our imperfect understanding of them, to continue to solve problems as best we can. But the second thing we need to do is what we have not been doing, which is to improve our understanding of the tools we use. In this endeavor, more widespread and better understanding and application of the fundamental statistical/probabilistic basis of our chemometric algorithms is crucial. Maybe one of the things we need to accomplish this is to recruit more first-class statisticians into our ranks, so that they can pay proper attention to the fundamentals, and explain them to the rest of us. Also each of us should pay attention and put some effort into learning more about these fundamentals ourselves. Then we could ourselves better understand the phenomena we see occurring in our data and analyses thereof, and then maybe eventually learn how to deal with them properly. In order to appreciate how understanding new statistical concepts can help us, let us look at an example of where we can better apply known statistical concepts, to understand phenomena currently afflicting us. To this end, let us pose the seemingly innocuous question: “When doing quantitative calibration, why is it that we use the formulation of the problem that makes the constituent values the dependent (i.e., the Y ) variable, and make the spectroscopic data the X (or independent) variable, called the Inverse Beer’s Law formulation (sometimes called the P-matrix formulation)?” (For that matter, why is the formulation that we most commonly use called “Inverse Beer’s Law” instead of the direct “Beer’s Law”?) Now, we are sure that everybody reading this chapter thinks they know the answer. Now, if you are among those readers, then you are wrong already, because there are multiple answers to this question, all of them correct, and each of them incomplete. Let us dispose of the most common answer first. This answer is the one given in most of the discussions about the relative merits of the two formulations, e.g. [2], and is essentially a practical one: we use the Inverse Beer’s Law formulation because by doing so, we need to only determine the concentration(s) of the analyte(s) of interest. In the Beer’s law formulation, you must determine the concentrations of all components in a mixture, whether they are of interest or not. Of course, there is benefit to that also; as Malinowski points out, you can determine the number of components in a mixture and their spectra, as well as their concentrations, by proper application of the techniques of factor analysis in such a case [3]. The second answer is similar, but even more simplistic. Figure 25-1 shows a graphical depiction of a two-wavelength calibration situation: the values on the two wavelength axes determine the point on the calibration plane from which to strike a line to the concentration axis. The situation, however, is symmetric; so why don’t we consider the possibility of using the value along one of the wavelength axes along with the concentration value to determine the value along the other wavelength axis? In theory this could be done, but the reason we do not do it is the same as the answer to the main question above: we do not care; this case is of no interest to us. As chemists, we are interested in determining quantities of chemical interest, and we use the spectroscopic values as a mean of attaining this goal; the reverse calculation is of no interest to us as chemists. None of these answers deal with fundamentals. So finally we get to the substantive part of the discussion, the one that connects with our original diatribe concerning the goal

A Simple Question: The Meaning of Chemometrics Pondered

121

Calibration plane CONC

+

WL 2

WL 1

Figure 25-1 Symbolic graphical depiction of a two-wavelength calibration.

and role of Statistics in Chemometric calculations, the one that will give us an answer to our original question that is based on fundamental considerations, and therefore the one that is the purpose of this whole discussion. To fully appreciate the point we have to go back a bit and look at the historical development of spectroscopic quantitative analysis. Back when we were in school and taking academic courses in Analytical Chemistry, spectroscopy was only one of many techniques presented (and one of the “minor” ones, at that). Now, we can not really compare our experiences with what is being done currently because we are somewhat out of touch with academia, but back then what we now call the Beer’s Law formulation (i.e., making the constituent concentration the X-variable) was the one presented and taught, and we were required to use it. Of course, as an academic exercise the system was simplified: there was only one analyte in a pure solvent, so in principle it would seem that we could have put either variable on the X-axis. Nowadays, standard practice would impel us to put the analyte concentration on the Y -axis even in this simplified situation (whether it belonged there or not). What has changed between then and now? Well in fact considerable has changed, in both the nature of the situation surrounding the analysis and the instruments we use to do the measurements. Back in the days of our academic exercises, spectrometers were based on vacuum-tube technology (remember them? – or are we dating ourselves?), were noisy, drifted terribly, and were full of all manner of error sources. The samples we used to calibrate the instrument, on the other hand, were made synthetically, by weighing the analyte on an analytical balance and dis solving it in the fixed volume of a volumetric flask. Both of these items were considered to be the highest-precision, highest-accuracy measuring devices available. Therefore, in those days, the accuracy of the spectroscopic measurements were considered to be far inferior to the accuracy of the training samples. In those days, Statistics was more highly regarded than it is now, and the analytical chemists then knew the fundamental requirements of doing calibration work. There are several; we need not go into all of them now, but the one that is pertinent to our current discussion is the one that states that, while the Y -variable may contain error, the X-variable must be known without error. Now, in the real world this is never true, since all quantities are the result of some measurement, which will therefore have error

122

Chemometrics in Spectroscopy

associated with it. In practice, however, it is sometimes possible to reduce the error to a sufficiently small value that it approximates zero well enough for the calibration calculations to work. What happens if we do not manage to keep the X error “sufficiently small”? Let us examine a situation which is just complicated enough to show the effects; three sets of data are presented in Table 25-1, that we will use, along with some of the statistics Table 25-1 Three sets of data illustrating the effect of errors in X and in Y on the results obtained by calibration (A) No error Sample #

X

Y

1 2 3 4

0 0 10 10

0 0 10 10

Intercept = 0 Slope = 1 Correlation coeff = 1 SEE = 0 PRESS = 0 (B) Error in Y Sample #

X

Y

1 2 3 4

0 0 10 10

−1 1 9 11

X

Y

−1 1 9 11

0 0 10 10

Intercept = 0 Slope = 1 Correlation coeff = 0.98058 SEE = 1.4142 PRESS = 2.000 (C) Error in X Sample # 1 2 3 4 Intercept = 0.19231 Slope = 0.96154 Correlation coeff = 0.98058 SEE = 1.38675 PRESS = 1.92018

A Simple Question: The Meaning of Chemometrics Pondered (a)

123

(b)

Y

Y Correct model

Correct model

X

X

(c) Correct model Y Calculated model

X

Figure 25-2 Graphical representation of three regression situations. (a) no error. (b) Error in y only. (c) Error in x only. See text for discussion.

associated with calibration calculations based on those data. Graphical representations of the three data sets are displayed in Figures 25-2A through 25-2C, so that the respective models can be compared to the data. We present univariate data, since that shows the effects we wish to illustrate, and is the simplest example that will do so. The biggest advantage to a scenario like this is that we know the “right” answer, because we can make it whatever we want it to be. In this case, the right answer is that the intercept is zero and the slope is 1 (unity). Table 25-1A represents this condition with four samples whose data follow that model without error. The data in Table 25-1A are the prototype data upon which we will build data containing error, and investigate the effects of errors in Y and in X. We use four data points, in coincident pairs, so that when we introduce error, we can retain certain important properties that will result in the same model being the correct one for the data. Along with the data, we show the results of doing the calibration calculations on the data. For Table 25-1A, the slope and the intercept are as we described, the error (which we measure as both the Standard Error of Estimate [SEE] and using cross-validation [the PRESS statistic, using the leave-one-out algorithm]) is zero (naturally), and the correlation coefficient is unity – a necessary concomitant of having zero error.

124

Chemometrics in Spectroscopy

Now in Table 25-1B, we introduce error into the Y variable. We do so by adding +1 to one each of the high and low values, and −1 to each of the other high and low values. This maintains symmetry and keep the average position of the pairs of points remains the same, which guarantees that the correct model for the data does not change. This is in accordance with theory and is borne out when the calibration calculations are performed: the model is identical, even though the error (SEE) is no longer zero and the correlation coefficient is no longer unity. Go ahead: redo the calculations and check this out for yourself. Now, the purists and the sharper-eyed among us may argue that another requirement of regression theory is that the errors follow a Normal (i.e., Gaussian) distribution and that these errors are not distributed properly. We counter this argument by pointing out that there is not enough data to tell the difference; there is no significance test that can be used to demonstrate that the data either do or do not follow any predetermined distribution. Finally, and of most interest, is the data in Table 25-1C. Here we have taken the same errors as in Table 25-1B and applied them to the X variable rather than the Y variable. By symmetry arguments, we might expect that we should find the same results as in Table 25-1B. In fact, however, the results are different, in several notable ways. In the first place, we arrive at the wrong model. We know that this model is not correct because we know what the right model is, since we predetermined it. This is the first place that what the statisticians have told us about the results are seen. In statistical parlance, the presence of error in the X variable “biases the coefficient toward zero”, and so we find: the slope is decreased (always decreased) from the correct value (of unity, with this data) to 096+. So the first problem is that we obtain the wrong model. The next item we will look at is the correlation coefficient. The correlation coeffi cient for Table 25-1C is identical to that in Table 25-1B. There is nothing particularly noteworthy about this, except that the correlation coefficient is useless as a means of distinguishing between the two cases: obviously, since we obtain the same result in both situations, we cannot tell from the value of the correlation coefficient which situation we are dealing with. Now we come to the Standard Error of Estimate and the PRESS statistic, which show interesting behavior indeed. Compare the values of these statistics in Tables 25-1B and 25-1C. Note that the value in Table 25-1C is lower than the value in Table 25-1B. Thus, using either of these as a guide, an analyst would prefer the model of Table 25-1C to that of Table 25-1B. But we know a priori that the model in Table 25-1C is the wrong model. Therefore we come to the inescapable conclusion that in the presence of error in the X variable, the use of SEE, or even cross-validation as an indicator, is worse than useless, since it is actively misleading us as to the correct model to use to describe the data. This is for univariate data; what happens in the case of multivariate (multiwavelength) spectroscopic analysis. The same thing, only worse. To calculate the effects rigorously and quantitatively is an extremely difficult exercise for the multivariate case, because not only are the errors themselves are involved, but in addition the correlation structure of the data exacerbates the effects. Qualitatively we can note that, just as in the univariate case, the presence of error in the absorbance data will “bias the coefficient(s) toward zero”, to use the formal statistical description. In the multivariate case, however, each coefficient will be biased by different amounts, reflecting the different amounts of noise (or error, more generally) affecting the data at different wavelengths. As mentioned above, these

A Simple Question: The Meaning of Chemometrics Pondered

125

effects will be exacerbated by intercorrelation between the data at different wavelengths. The difficulty comes when you realize that it is not simply the correlations between pairs of wavelengths that are operative in this regard, but also the intercorrelation effects of the data when the wavelengths are taken 3, 4, n at a time. This is what has made the problem so intractable. Now, we are sure that there are some readers who will read this and say something along the lines of “well, all you need do is do a PCA/PLS analysis and get rid of all those effects”. Actually, there might be a germ of truth to that – if you can always do all your calibration modeling using only the first two or three PCA or PLS factors. Beyond that you will run into what we might almost call the Law of Conservation of Error (except for the fact that, as we all know, error is much easier to create than destroy!). In special cases, however, such as PCA and PLS, the total error really is constant, so that we quickly get into territory where the noise that you pushed out of the first couple of factors reappears, and affects the higher factors even more than the original noise affected the original data. So in the long-gone days of our academic lives, the chemical measurements, being based on high-accuracy gravimetric and volumetric techniques, were indeed the proper ones to put on the X-axis. Contrast that with the current state of technology: instruments have improved enormously, and rather than making up training samples by simple gravi metric dilutions, we often obtain our training, or reference, values through complicated analytical methodologies, which are themselves fraught with so much error that even in favorable cases, the error can be 5–10% of the analytical value. In our current practice, therefore, the error in the reference lab values really is greater than the error in the absorbance data. For this reason it is now appropriate to reverse the positions of the concentration and absorbance values relative to their place in the calculation schema. So it is the changing nature of the world and the types of analyses we do that dictate how we go about organizing the calculations we use to do them. This comes from fundamental considerations of the behavior of the modeling process, which the science of Statistics can tell us about.

REFERENCES 1. Lavine, B.K. and Brown, S., Today’s Chemist at Work 6(9), 29–37 (1997). 2. Brown, C.W., Spectroscopy 1(4), 32–37 (1986). 3. Malinowski, E.R., Factor Analysis in Chemistry, 2nd ed. (John Wiley & Sons, New York, (1991).

This page intentionally left blank

26

Calculating the Solution for Regression Techniques:

Part 4 – Singular Value Decomposition

In Chapters 21–23 and in this chapter, we have described the most basic calculations for MLR, PCR, and PLS. To reiterate, our intention is to demonstrate these basic computations for each mathematical method presently, and then to delve into greater detail as the chapters progress; consider these articles linear algebra bytes. For this chapter we will illustrate the basic calculation and mathematical relationships of different matrices for the calculations of Singular Value Decomposition or SVD. You will note from previous chapters that SVD is used for modern computations of principal components regression (PCR) and partial least squares regression (PLSR), although slightly different forms of SVD are used for each set of computations. Recall for PCR we simply used SVD and for PLS a special case of SVD that we called PLS SVD was used. You will also recall that the PLS form of SVD includes the use of the concentration vector c as well as the data matrix A. The reader will note that the scores (T) and loadings (V) are determined using the concentration values for PLS SVD whereas only the data matrix A is used to perform SVD for principal components analysis. All mathematical operations used for this chapter are completed using MATLAB software for Windows [1]. As previously discussed the manual methods for calculating the matrix algebra used within these chapters is found in references [2–5]. You may wish to program these operations yourselves or use other software to routinely make the calculations. As in previous installments we begin by identifying a simple data matrix denoted by A. A is used to represent a set of absorbances for three samples and three data channels, as a rows × columns matrix. For our example each row represents a different sample spectrum, and each column a different data channel, absorbance or frequency. We arbitrarily designate A for our example as ⎡ ⎤ ⎡ ⎤ 1 7 9 A11 A12 A13 Ar×c = ⎣ A21 A22 A23 ⎦ = AI×K = ⎣ 4 10 12 ⎦ (26-1) A31 A32 A33 6 14 16 Thus, 1, 7, and 9 represent the instrument signal for three data channels (frequencies 1, 2, and 3) for sample spectrum #1; 4, 10, and 12 represent the same data channel signals (e.g., frequencies 1, 2, and 3) for sample spectrum #2, and so on. Given any data matrix A of arbitrary size (as rows × columns) the matrix A can be written or defined using the computation of Singular Value Decomposition [6–8] as A = USV = U × S × V

(26-2)

where U is the left singular values matrix, V is the loadings matrix, and S is the diagonal matrix containing information on the variance described by each principal component

128

Chemometrics in Spectroscopy

(as the S matrix columns). It is important to note when reviewing the use of SVD in the literature that many references define the scores matrix (T) as U × S. Keep in mind that the scores can be calculated as U×S=A×V=T

(26-3)

and it holds that the original data matrix A can be reconstructed as U × S × V = T × V = A × V × V = A × I = A

(26-4)

We can demonstrate the interrelationships between the different matrices resulting from the SVD calculations by the use of MATLAB as shown in Table 26-1. By studying the relationships between the various matrices resulting from the com putation of SVD, one can observe that there are several ways to compute the same Table 26-1 Simple SVD performed on matrix A using MATLAB; other matrix relation ships are also shown (see equations 26-1 through 26-4) Command line

Comments

A = [1 7 9;4 10 12;6 14 16]

Enter the A matrix

A= 1 7 9 4 10 12 6 14 16

Display the A matrix

[U,S,V] = svd(A)

Calculate the SVD of A

U= 03821 09061 -0.1814 05451 -0.0624 08361 07463 -0.4183 -0.5178

Display the U matrix, also known as the left singular values matrix, and rarely referred to as the scores matrix. The scores matrix is most often denoted as U × S or A × V which as it turns out are exactly the same.

S= 295803 0 0 0 19907 0 0 0 02038

Display the S matrix or the singular values matrix. This diagonal matrix contains the variance described by each principal component. Note: the squares of the singular values are termed the eigenvalues.

V= 02380 -0.9312 02762 06279 -0.0694 -0.7752 07410 03579 05681

Display the V matrix or the right singular values matrix; this is also known as the loadings matrix. Note: this matrix is the eigenvectors corresponding to the positive eigenvalues.

U*S*V = ans = 10000 70000 90000 40000 100000 120000 60000 140000 160000

U*S*V is equivalent to the original data matrix A derived using the SVD computation

Calculating the Solution for Regression Techniques: Part 4

129

Table 26-1 (Continued) Command line

Comments

T = A*V T= 113024 18038 -0.0370 161231 -0.1243 01704 220748 -0.8328 -0.1055

The scores matrix (often designated as T) can be calculated as A × V

U*S ans = 113024 18038 -0.0370 161231 -0.1243 01704 220748 -0.8328 -0.1055

As mentioned in the text of the article, the scores matrix T can also be calculated as U × S.

T*V ans = 10000 70000 90000 40000 100000 120000 60000 140000 160000

As we have stated, the original data matrix A can be estimated as the scores matrix (T) × the transpose of the loadings matrix (V ) as shown.

A*V*V ans = 10000 70000 90000 40000 100000 120000 60000 140000 160000

Just another way to estimate the original data matrix A. In this case, V times the transpose of V (itself) is a diagonal matrix with a value of ones along the diagonal, such as shown below. Note: this matrix of ones along the diagonal is called an identity matrix or (I). 10000 00000 00000 00000 10000 00000 00000 00000 10000

final results, making it somewhat difficult to follow the literature. However, knowing these inner mathematical relationships can help clarify our understanding of the different nomenclature. We will compare and contrast some of the literature and the use of different terms in later installments; right now just tuck this information away for future reference.

REFERENCES 1. MatLab software for Windows from The MathWorks, Inc., 24 Prime Park Way, Natick, Mass. 01760-1500. Internet: [email protected]. 2. Workman, J. and Mark, H., Spectroscopy 8(9), 16 (1993). 3. Workman, J. and Mark, H., Spectroscopy 9(1), 16 (1994). 4. Workman, J. and Mark, H., Spectroscopy 9(4), 18 (1994). 5. Mark, H. and Workman, J., Spectroscopy 9(5), 22 (1994). 6. Mandel, J., American Statistician 36, 15 (1982). 7. Golub, G.H. and Van Loan, Charles F., Matrix Computations, 2nd ed. (The Johns Hopkins University Press Baltimore, MD, 1989), pp. 427, 431. 8. Searle, S.R., Matrix Algebra Useful for Statistics (John Wiley & Sons, New York, 1982), p. 316.

This page intentionally left blank

27 Linearity in Calibration

Those who know us know that we have always been proponents of the approach to calibration that uses a small number of selected wavelengths. The reasons for this are partly historical, since we became involved in Chemometrics through our involvement in near-infrared spectroscopy, back when wavelength-based calibration techniques were essentially the only ones available, and these methods did yeoman’s service for many years. When full-spectrum methods came on the scene (PCR, PLS) and became popu lar, we adopted them as another set of tools in our chemometric armamentarium, but always kept in mind our roots, and used wavelength-based techniques when necessary and appropriate, and we always knew that they could sometimes perform better than the full spectrum techniques under the proper conditions, despite all the hype of the proponents of the full-spectrum methods. Lately, various other workers have also noticed that eliminating “extra” wavelengths could improve the results, but nobody (including ourselves) could predict when this would happen, or explain or define the conditions that make it possible. The advantages of the full-spectrum methods are obvious, and are promoted by the proponents of full-spectrum methods at every opportunity: the ability to reduce noise by averaging data over both wavelengths and spectra, noise rejection by rejecting the higher factors, into which the noise is preferentially placed, the advantages inherent in the use of orthogonal variables, and the avoidance of the time-consuming step of performing the wavelength selection process. The main problem was to define the conditions where wavelength selection was superior; we could never quite put our finger on what characteristics of spectra would allow the wavelength-based techniques to perform better than full-spectrum methods. Until recently. What sparked our realization of (at least one of) the key characteristics was an on-line discussion of the NIR discussion group [1] dealing with a similar question, whereupon the ideas floating around in our heads congealed. At the time, the concept was proposed simply as a thought experiment, but afterward, the realization dawned that it was a relatively simple matter to convert the thought experiment into a computer simulation of the situation, and check it out in reality (or at least as near to reality as a simulation permits). The advantage of this approach is that simulation allows the experimenter to separate the effect under study from all other effects and investigate its behavior in isolation, something which cannot be done in the real world, especially when the subject is something as complicated as the calibration process based on real spectroscopic data. The basic situation is illustrated in Figure 27-1. What we have here is a simulation of an ideal case: a transmission measurement using a perfectly noise-free spectrometer through a clear, non-absorbing solvent, with a single, completely soluble analyte dissolved in it. The X-axis represents the wavelength index, the Y -axis represents the measured absorbance. In our simulation there are six evenly spaced concentrations of analyte, with simulated “concentrations” ranging from 1 to 6 units, and a maximum simulated

132

Chemometrics in Spectroscopy 1.6 1.4 1.2 1

0.8 0.6 0.4 0.2 301

289

277

265

253

241

229

217

205

193

181

169

157

145

133

121

97

109

85

73

61

49

37

25

1

13

0 –0.2

Figure 27-1 Six samples worth of spectra with two bands, without (left) and with (right) stray light. (see Color Plate 1)

absorbance for the highest concentration sample of 1.5 absorbance units. Theoretically, this situation should be describable, and modeled by a single wavelength, or a single factor. Therefore in our simulation we use only one wavelength (or factor) to study. For the purpose of our simulation, the solute is assumed to have two equal bands, both of which perfectly follow Beer’s law. What we want to study is the effect of non linearities on the calibration. Any nonlinearity would do, but in the interest of retaining some resemblance to reality, we created the nonlinearity by simulating the effect of stray light in the instrument, such that the spectra are measured with an instrument that exhibits 5% stray light at the higher wavelengths. Now, 5% might be considered an excessive amount of stray light, and certainly, most actual instruments can easily exhibit more than an order of magnitude better performance. However, this whole exercise is being done for pedagogical purposes, and for that reason, it is preferable for the effects to be large enough to be visible to the eye; 5% is about right for that purpose. Thus, the band at the lower wavelengths exhibits perfect linearity, but the one at the higher wavelengths does not. Therefore, even though the underlying spectra follow Beer’s law, the measured spectra not only show nonlinearity, they do so differently at different wavelengths. This is clearly shown in Figure 27-2, where absorbance versus concentration is plotted for the two peaks. Now, what is interesting about this situation is that ordinary regression theory and the theory of PCA and PLS specify that the model generated must be linear in the coefficients. Nothing is specified about the nature of the data (except that it be noise-free, as our simulated data is); the data may be non-linear to any degree. Ordinarily this is not a problem because any data transform may be used to linearize the data, if that is desirable. In this case, however, one band is linearly related to the concentrations and one is not; a transformation, blindly applied, that linearized the absorbance of the higher-wavelength band would cause the other band to become non-linear. So now, what is the effect of this all on the calibration results that would be obtained? Clearly, in a wavelength-based approach, a single wavelength (which would be theo retically correct), at the peak of the lower-wavelength band, would give a perfect fit to the absorbance data. On the other hand, a single wavelength at the higher-wavelength band would give errors due to the nonlinearity of the absorbance. The key question then becomes, how would a full-wavelength (factor-based) approach behave in this situation?

Linearity in Calibration

133

1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0 1

2

3

4

5

6

Figure 27-2 Absorbance versus concentration, without (upper) and with (lower) stray light.

In the discussion group, it was conjectured that a single factor would split the dif ference; the factor would take on some character of both absorbance bands, and would adjust itself to give less error than the non-linear band alone, but still not be as good as using the linear band. Figure 27-3 shows the factor obtained from the PCA of this data. It seems to be essentially Gaussian in the region of the lower-wavelength band, and somewhat flattened in the region of the higher-wavelength band, conforming to the nature of the underlying absorbances in the two spectral regions. Because of the way the data was created, we can rely on the calibration statistics as an indicator of performance. There is no need to use a validation set of data here. Validation sets are required mainly to assess the effects of noise and intercorrelation. Our simulated data contains no noise. Furthermore, since we are using only one wavelength or one factor, intercorrelation effects are not operative, and can be ignored. Therefore the final test lies in the values obtained from the sets of calibration results, which are presented in Table 27-1. Those results seem to bear out our conjecture. The different calibration statistics all show the same effects: the full-wavelength approach does seem to be sort of “split the difference” and accommodate some, but not all, of the non-linearities; the algorithm 0.2 0.18 0.16 0.14 0.12 0.1 0.08 0.06 0.04 0.02

Figure 27-3 First principal component from concentration spectra.

157

151

145

139

133

127

121

115

109

97

103

91

85

79

73

67

61

55

49

43

37

31

25

19

7

13

1

0

134

Chemometrics in Spectroscopy

Table 27-1 Calibration statistics obtained from the three calibration models discussed in the text Linear wavelength SEE Corr. Coeff. F

0 1

Non-linear wavelength

Principal component

0237 09935 305

00575 09996 5294

uses the data from the linear region to improve the model over what could be achieved from the non-linear region alone. On the other hand, it could not do so completely; it could not ignore the effect of the nonlinearity entirely to give the best model that this data was capable of achieving. Only the single-wavelength model using only the linear region of the spectrum was capable of that. So we seem to have identified a key characteristic of chemometric modeling that influences the capabilities of the models that can be achieved: not nonlinearity per se, because simple nonlinearity could be accommodated by a suitable transformation of the data, but differential nonlinearity, which cannot be fixed that way. In those cases where this type of differential, or non-uniform, nonlinearity is an important characteristic of the data, then selecting those wavelengths and only those wavelengths where the data are most nearly linear will provide better models than the full-spectrum methods, which are forced to include the non-linear regions as well, are capable of. Now, the following discussion does not really constitute a proof of this condition (in the mathematical sense), but this line of reasoning is fairly convincing that this must be so. If, in fact, a full-spectrum method is splitting the difference between spectral regions with different types and degrees of nonlinearity, then those regions, at different wavelengths, themselves must have different amounts of nonlinearity, so that some regions must be less nonlinear than others. Furthermore, since the full-spectrum method (e.g., PCR) has a nonlinearity that is, in some sense, between that of the lowest and highest, then the wavelengths of least nonlinearity must be more linear than the full-spectrum method and therefore give a more accurate model than the full-spectrum algorithm. All that is needed in such a case, then, is to find and use those wavelengths. Thus, when this condition of differential nonlinearity exists in the data, modeling tech niques based on searching through and selecting the “best” wavelengths (essentially we’re saying MLR) are capable of creating more accurate models than full-wavelength methods, since almost by definition this approach will find the wavelength(s) where the effects of nonlinearity are minimal, which the full-spectrum methods (PCA, PLS) cannot do.

REFERENCE 1. The moderator of this discussion group was Bruce Campbell. He can be reached for information, or to join the discussion group by sending a message to: [email protected]. New members are welcome.

28

Challenges: Unsolved Problems in Chemometrics

We term the issues we plan to discuss in this chapter as “unsolved” problems, but that may be incorrect. It may be, perhaps, more accurate to call them “Unaddressed Problems in Chemometrics”. Calling them “unsolved” implies that attempts have been made to solve them, but those attempts were unsuccessful, possibly because these problems are too difficult, or possibly because maybe we are not smart enough. Calling them “unaddressed” on the other hand, really gets to the heart of the matter: a number of problems have come to our attention that nobody seems to be paying any heed to. It may very well turn out that some of these problems are too difficult to solve at the current state of the art in Chemometrics, and maybe we are really not smart enough, but at this point we do not know, and we will never know if nobody tries. Our attention was drawn to these problems via various routes. Some arose from our own work on various projects. Some arose from discussions in the on-line discussion group. Some have been floating around in the backs of our minds for what seems like forever, but only recently crystallized into something concrete enough to write down in a coherent manner so that it could be explained to somebody else. Answers – we have none, only questions. We bring up these points to stir up some discussion, and maybe even a little controversy, and certainly with the hope that we can prod some of our compatriots “out there” to tackle some of these. Conspicuous by its absence is the question of calibration transfer, even though we consider it unsolved in the general sense, in that there is no single “recipe” or algorithm that is pretty much guaranteed to work in all (or at least a majority) of cases. Nevertheless, not only are many people working on the problem (so that it is hardly “unaddressed”), but there have been many specific solutions developed over the years, albeit for particular calibration models on particular instruments. So we do not need to beat up on this one by ourselves. So what are these problems? 1) The first one we mention is the question of the validity of a test set. We all know and agree (at least, we hope that we all do) that the best way to test a calibration model, whether it is a quantitative or a qualitative model, is to have some samples in reserve, that are not included among the ones on which the calibration calculations are based, and use those samples as “validation samples” (sometimes called “test samples” or “prediction samples” or “known” samples). The question is, how can we define a proper validation set? Alternatively, what criteria can we use to ascertain whether a given set of samples constitutes an adequate set for testing the calibration model at hand? A very limited version of this question, does in fact, sometimes appear, when the question arises of how many samples from a given calibration set to keep in reserve for

136

Chemometrics in Spectroscopy

the validation process. Answers range from one (at a time, in the PRESS algorithm) to half the set, and there is no objective, scientific criterion given for any of the choices that indicate whether that amount is optimum. Each one is justified by a different heuristic criterion, and there is never any discussion of the failings of any particular approach. For example, while the PRESS algorithm is appealing, it does not even test the calibration model: if anything, for n samples it tests n different models, none of which is the one to be used, and so forth. Another shortcoming of PRESS is that if each sample was read multiple times, then a computer program that simply removes one reading at a time does not remove the effect of that sample from the data. Even so, at best any of these answers treat only one aspect of the larger question, which includes not only how many samples, but which ones? A properly taken random sample is indeed representative of the population from which it comes. So one subquestion here is, how should we properly sample? The answer is “randomly” but how many workers select their validation samples in a verifiably random manner? How can someone then tell if their test set is then valid, and against what criteria? Some of this goes back to the original question of obtaining a proper and valid set of calibration samples in the first place, but that is a different, although related problem. We can turn that question around in the same way: what are the criteria for telling if a calibration sample set is a valid set? Maybe both problems have the same solution, but we do not know because nobody is working on either one. But to pose the question more directly: how can we tell if any set of samples constitute a valid test set? Even if they were chosen in a proper random manner, are there any independent tests for their validity? What characteristics should the criteria for deciding be based on, and what are the criteria to use? 2) The next problem we bring up for discussion is the definition of “validation”. Now, we are sure there are some who will complain that we are arguing terminology rather than substance. However, we think that agreement on what terms mean has substantive consequences, especially in modern times when standards-setting organizations (e.g., ASTM) and government agencies are taking an interest in what we do. As we will see below, there is the question of the time required to validate, so on the one hand, if we recognize that verifying the accuracy of a given model at the time that model is created may or may not be a sufficient test of its long-term behavior and we may need to include long-term testing procedures. On the other hand, if government agencies create regulations for how models are to be validated, which presumably they are likely to do on the basis of what we ourselves decide is required, do we want to be constrained to not being able to declare that we have created a model until months or years have passed? Such questions involve much more than terminology, especially if the government decides that “validation” is, in fact, whatever we claim it is. As we hinted above, the most common use of the term “validation” involves simply retaining some samples separately from the main set of calibration samples and using those as a more-or-less independent test of the accuracy of the calibration model obtained. However, this definition is not universally agreed to. When the subject came up in the on-line discussion group, the following comment was made by Richard Kramer of the discussion group [1]: The issue Howard raises is an important one. However, I disagree with his characterization of validation and with the resulting conclusion. It all depends upon

Unsolved Problems in Chemometrics

137

what one means by the concept of validation. If validation means the ongoing validation of a plurality of alternative models (my preferred meaning), it DOES become the means of selecting one model over others. And importantly, it permits selection of models which exhibit the best performance with respect to time-related properties such as robustness. It is not uncommon to observe that the model which initially appears to be optimum is the one whose performance degrades most rapidly as time passes. Validation over time also provides a means of gaining insight into which portions of the data might contain more confusion than information and would be best discarded. In particular, it can be interesting to look at the data residuals over time. It is not uncommon to find that the residuals in some parts of the data space increase more rapidly, over time, than the residuals in other parts of the data space. Generally excluding (or de-weighting) the former from the model can improve the model’s performance, short term and long term. Certainly Richard raises valid points, and you can hardly fault his prescription for monitoring and improving the results. However, is that considered, or should that be considered a requirement for validation, or even a necessary part of the validation process? The response comment to Richard at the time was as follows: I think Rich & I agree more than we disagree. If you use his definition of validation then what he says follows. However, that definition is not the one in common use – the MUCH more common definition is simply the one that tells you to separate your calibration samples & keep some out of the calibration calculations, then use those to validate. Once you’ve gone to the trouble to collect data over time then your options expand greatly. Not only can you use that data for ongoing validation, you can also include those new readings in the calibration calculations. There are at least two ways to do this: 1) As Richard implies, one way is to gradually replace the older data with the new as it becomes available. This has been standard practice for a long time, for example in the agricultural industry, where old samples will never be seen again. A grain elevator, e.g., will never again have to measure another sample from the 1989 crop year. 2) The other obvious extension, which is more useful for the case where you may still have to measure samples with the same characteristics as the old ones, is to simply keep adding to and expanding the calibration set as new samples become available. The new samples then not only allow you to test for robustness, but inclusion of such samples will actually make the calibration more robust. I think we all know this intuitively, but I have also been able to prove this mathematically. So validation may not only involve the time frame required to perform it, it may also involve questions of the models (or at least the number of models) being tested. So there we have it: what exactly is “validation”? 3) The next unsolved problem we bring up is the question of error in the classification of training samples when calibrating an instrument to do identification. We mentioned

138

Chemometrics in Spectroscopy

this briefly in a recent column, but it is worth some more discussion. The problem appears to arise primarily in medical applications, so as a non-proprietary example, let us imagine we are interested in identifying the degree of burn of a burn victim: that is whether the subject has a 1st, 2nd or 3rd degree burn. The distinctions are medically important, and furthermore there are qualitative differences between them despite the fact that they arise out of the quantitative difference in the amount of heat involved. In these respects this typifies other medical situations. We could take spectra of the burned areas from subjects who have been burned, but there is a certain amount of subjectivity in assigning the degree of burn in a given case, and occasionally two physicians will disagree on the designation of the degree of burn in some cases. Clearly, if they disagree, they both cannot be correct, so if we use one or the other’s diagnosis, the training classification will also occasionally be in error. While there is certainly a progression in the intensity and severity of the burn as we go from 1st to 3rd degree burns, we cannot simply use a quantitative scale, for a number of reasons: a quantitative scale of that sort is not agreed to by all physicians, it would be, at best, highly nonlinear, and most importantly, there are real qualitative differences between tissue subjected to the different extents of damage, besides the potential quantitative ones. Because of this, a straightforward quantitative approach would not suffice, even if one could be developed. We need methods to deal with the existence of errors in the training classifications when training instruments to do automated identification. 4) The final problem we bring up is based on the question of modeling based on individual wavelengths versus full-spectrum methods and the modern variations on those themes. Basically the question can be put: “How far should we go in eliminating wavelengths?”. As we discussed in a recent column, as well as in times past, our backgrounds are from the days of pre-PCA/PLS/PCR/NN calibration modeling, and we there learned the value of wavelength-based models (principally MLR, or P-matrix as it’s sometimes called), which we only recently crystallized into something concrete enough to write down in a coherent manner so that it could be explained to somebody else. (does that sound familiar?) The full-spectrum methods (PLS, PCR, K-matrix, etc.) have their advantages and, as we recently discussed, so do the individual-wavelength methods. The users of the full-spectrum approaches have in recent years taken an empirical, ad hoc approach to the question of wavelength elimination, finding that there was benefit to it, even if there were no explanations of the reasons for that benefit. Our initial reaction was something on the order of: why not go the whole way and eliminate all the wavelengths except those few that are needed to do the analysis (i.e., go to the limit of wavelength elimination, which essentially brings it back to MLR)? However, now that we know what the benefit of MLR-type modeling is, it is clear that eliminating all those wavelengths is counterproductive, because it throws the baby out with the bathwater, so to speak. Ideally, we should like to devise criteria for determining how many wavelengths, and which wavelengths, to keep and which to eliminate, to obtain the optimum balance between the noise-reduction capabilities of the fill-spectrum methods and the linearity-maximization capabilities of the individualwavelength approaches.

Unsolved Problems in Chemometrics

139

Well, there we have it: our list of current unsolved/unaddressed problems. Hop to it, readers!!!

REFERENCE 1. Chemometrics discussion group moderated by Bruce Campbell. He can be reached for infor mation, or to join the discussion group by sending a message to: [email protected]. New members are welcome.

This page intentionally left blank

29

Linearity in Calibration: Act II Scene I

When we first published our chapter “Linearity in Calibration” as an article in Spectroscopy magazine [1] we did not quite realize what a firestorm we were going to ignite, although, truth be told, we did not expect everybody to agree with us, either. But if so many actually took the trouble to send their criticisms to us, then there must also be a large “silent majority” out there that are upset, perhaps angry, and almost certainly misunderstanding what we said. We prepared responses to these criticisms, but they became so lengthy that we could not print them all in a single published column, and thus the topic is included in several smaller chapters. At this point in our discussion, let us raise the question of the linearity of spectro scopic data as a general topic. There are a number of causes of nonlinearity that most chemists and spectroscopists are familiar with. Let us define our terms. When speak ing of “linearity” the meaning of the term depends on your point of view, and your interests. An engineer is concerned, perhaps, with the linearity of detector response as a function of incident radiant energy. To a chemist or spectroscopist, the interest is in the linearity of an instrument’s readings as a function of the concentration of an analyte in a set of samples. In practice, this is generally interpreted to mean that when measuring a transparent, non-scattering sample, the response of the instrument can be calculated as some constant times the concentration of the analyte (or at least some function of the instrument response can be calculated as a constant times some other function of the concentration). In spectroscopic usage, that is normally interpreted as meaning the condition described theoretically by Beer’s Law, that is the instrument response function is the negative exponential of the concentration: I = k Io e−bC

(29-1)

where I = k= Io = b= C=

the the the the the

radiation passing through the sample multiplying constant radiation incident on the sample product of the pathlength and absorbtivity concentration of the analyte.

When other types of samples are measured, the resulting data is usually known to be nonlinear (except possibly in a few special cases), so those measurements are of no interest to us here. Thus, in practice, the invocation of “linearity” implies the assumption that Beer’s Law holds, therefore discussions of nonlinearity are essentially about those phenomena that cause departures from Beer’s law.

142

Chemometrics in Spectroscopy

These include 1) Chemical causes a) Hydrogen bonding b) Self-polymerization or condensation c) Interaction with solvent d) Self-interaction 2) Instrumental causes a) Nonlinear detector b) Nonlinear electronics c) Instrument bandwidth broad compared to absorbance band d) Stray light e) Noncollimated radiation f) Excessive signal levels (saturation). Most chemists and spectroscopists expect that in the absence of these distinct phenom ena causing nonlinearity, Beer’s Law provides an exact description of the relationship between the absorbance and the analyte concentration. Unfortunately the world is not so simple, and Beer’s Law never holds exactly, EVEN IN PRINCIPLE. The reason for this arises from thermodynamics. Optical designers and specialists in heat transfer calculations in the chemical engineer ing and mechanical engineering sciences are familiar with the mathematical construct known as The Equation of Radiative Transfer, although most chemists and spectro scopists are not. The Equation of Radiative Transfer states that, disregarding absorbance and scattering, in a lossless optical system dE = I d d da dt

(29-2)

where dE = the differential energy transferred in differential time dt I = the optical intensity as a function of wavelength (i.e., the “spectrum”) d = the differential wavelength increment d = the differential optical solid angle the beam encompasses da = the differential area occupied by the beam. For a static (i.e., unvarying with time) system, we can recast equation 29-2 as: dE/dt = I d d da

(29-3)

where dE/dt is the power in the beam. The application of these equations to heat transfer problems is obvious, since by knowing the radiation characteristics of a source and the geometry of the system, these equations allow an engineer, by integrating over the differential terms of equation 29-2 or equation 29-3, to calculate the amount of energy transferred by electromagnetic radiation from one place to another. Furthermore, the first law of thermodynamics assures us that dE/dt will be constant anywhere along the optical beam, since any change would require that the energy in the

Linearity in Calibration: Act II Scene I

143

beam be either increased or decreased, which would require that energy would be either created or destroyed, respectively. Less obviously, perhaps, the second law of thermodynamics assures us that the inten sity, I, is also constant along the beam, for if this were not the case, then it would be possible to focus all the radiation from a hot body onto a part of itself, increasing the radiation flux onto that portion and raising its temperature of that portion without doing work – a violation of the second law. The constancy of beam energy and intensity has other consequences, some of which are familiar to most of us. If we solve equation 29-3 for the product (d da) we get: d da = dE/dt × d/I

(29-4)

All the terms on the right-hand side of equation 29-4 are constants, therefore for any given wavelength and source characteristics, the product d da) is a constant, and in an optical system one can be traded off for the other. We are all familiar with this characteristic of optical systems, in the magnification and demagnification of images described by geometric optics. Whenever light is brought to a small focus (i.e., da becomes small) the light converges on the focal point through a large range of angles (i.e., d becomes large) and vice versa. This trade-off of parameters is more obvious to us when seen through the paradigm of geometric optics, but now we see that this is a manifestation of the thermodynamics underlying it all. We are also familiar with this effect in another context: in the fact that we cannot focus light to an arbitrarily small focal point, but are limited to what we usually call the “diffraction limit” of the radiation in the beam. This effect also comes out of equation 29-4, since there is a physical (or perhaps a geometrical) limit to d: d cannot become arbitrarily large, therefore da cannot become arbitrarily small. Again, we are familiar with this effect by coming across it in another context, but we see that it is another manifestation of the underlying thermodynamic reality. Getting back to our main line of discussion, we can see from equation 29-2 (or equation 29-3) that the differential terms must all have finite values. If any of the terms d, d, or da were zero, then zero energy would pass through the system and we could not make any measurements. One thing this tells us, of interest to us as spectroscopists, is that we can never build an instrument with perfect resolution. The mechanistic fundamentals (quantum broadening, Doppler broadening, etc.) have been extensively discussed by one of our colleagues [2]. This effect also manifests itself in the fact that every technology has an “instrument function” that is convolved with the sample spectrum, and each instrument function is explained by the paradigms of the associated technology, but since “perfect” resolution means that d = 0, we see again that this is another result of the same underlying thermodynamics. More to the point of our discussion regarding nonlinearity, however, is the fact that d cannot be zero. d is related to the concept of “collimation”: for a “perfectly collimated” beam, d = 0. But as we have just seen, such a beam can transfer zero energy; so just as with d and da, a perfectly collimated beam has no energy. Beer’s law, on the other hand, is based on the assumption that there is a single pathlength (normally represented by the variable b in the equation A = abc) for all rays through the sample. In a real, physical, measurement system, this assumption is always false, because of the fact that d cannot be zero. As Figure 29-1 shows, the actual

144

Chemometrics in Spectroscopy I2

I0

θ

θ max

I1 b

Figure 29-1 Diagram showing the pathlength in a sample for ray going straight through (to I1 ) and those going at an angle (to I2 ).

rays have pathlengths that range from b (for those rays that travel “straight through”, i.e., normal to the sample surfaces) to b/cos(max (for the rays at the most extreme angles). We noted this effect above as item 2e in our list of sources of nonlinearity, and here we see the reason that there is fundamental limitation. Mechanistically, the nonlinearity is caused by the fact that the absorbance for the rays traveling normally = abc, while for the extreme rays it is abc/cos(max . Thus the non-normal rays suffer higher absorbance than the normal ones do, and the discrepancy (which equals abc1 − 1/cos) increases with increasing concentration. When the medium is completely nonabsorbing, then the difference in pathlength does not affect the measurement. When the sample has absorbance, however, it is clear that ray I2 will have its intensity reduced more than ray I1 , due to the longer pathlength. Thus not all rays are reduced by the same amount and this leads to the nonlinearity of the measurement. Mathematically, this can be expressed by noting that the intensity measured when a beam with a finite range of angles passes through a sample is I = Io

�max

e−b/ cos d

(29-5)

0

rather than the simpler form shown in equation 29-1 (which, we remind the reader, only holds true for “perfectly collimated” beams, which have zero energy). In practice, of course, this effect is very small, normally much smaller than any of the other sources of nonlinear behavior, and we are ordinarily safe in ignoring it, and calling Beer’s law behavior “linear” in the absence of any of the other known sources of nonlinear behavior. However, the point here is that this completes the demonstration of our statement above, that Beer’s law never exactly holds IN PRINCIPLE and that as spectroscopists we never ever really work with perfectly linear data.

REFERENCES 1. Mark, H. and Workman, J., Spectroscopy 13(6), 19–21 (1998). 2. Ball, D.W., Spectroscopy 11(1), 29–30 (1996).

30

Linearity in Calibration: Act II Scene II – Reader’s

Comments � � �

Some time ago we wrote an article entitled “Linearity in Calibration” [1], in which we presented some unexpected results when comparing a calibration model using MLR with the model found using PCR. That column generated an active response, so we are discussing the subject in some detail, spread over several columns. The first part of these discussions have been published [2]; this chapter is the continuation of that one. In this chapter we now present the responses we received to the original published article [1] in order of receipt, following which we will comment about them in subsequent chapters. Here, in order of receipt, are the comments: The first set of comments we received were from Richard Kramer: [Howard & Jerry], I’m afraid that this month’s Spectroscopy Column is badly off the mark (pun intended (with apologies)). The errors are two-fold with the most serious error so significant that the other error is moot. 1) If I understand the column correctly, a 1-factor model was used. Well, a single linear factor can never be sufficient to properly model a non-linear system. A minimum of 2 factors are required. The synthetic data did NOT demonstrate the advantage of a single linear wavelength over a multiple wavelength model, it merely illustrated the fact that a single linear factor is not sufficient to model non-linear data. We could stop here, but, for the sake of completeness � � � . 2) The second problem is that that we never have the luxury of working with noise-free data. Thus, the column did not ask the right question(s). The proper question to ask is “In what ways and under which circumstances do the signal averaging advantages of multiple-wavelength models outperform or underper form with respect to a single (or n wavelength, where n is a small integer) wavelength calibration when noise is present?” The answer will depend upon the levels of noise and non-linearity and the number of wavelengths in each model. Regards, Richard We went back and forth a couple of times, but rather than list each of our conversations individually, we will reserve comments until we have looked at all the comments, and then we will summarize our responses to all four respondents together, since several of these response comments say the same things, to some extent.

146

Chemometrics in Spectroscopy

Second, we received comments from Patrick Wiegand: Gents, I have always looked forward to reading your articles on Chemometrics in Spec troscopy. They are truly a valuable resource – I usually cut them out and save them for future reference. However, I think your article “Linearity in Calibration” in the June 1998 issue of Spectroscopy leads the reader to an erroneous conclusion. This conclusion results largely because of the assumptions you make about the application of PLS and PCR. I know of no experienced practitioner of chemometrics who would blindly use the “full spectrum” when applying PLS or PCR. In the book “Chemometrics” by Beebe, Pell and Seasholtz, the first step they suggest is to “examine the data.” Likewise, Kramer in his new book has two essential conditions: The data must have information content and the information in the data must have some rela tionship with the property or properties which we are trying to predict. Likewise, in the course I teach at Union Carbide, I begin by saying that “no model ing technique, no matter how complex, can produce good predictions from bad data.” In your article, you appear to be creating an artificial set of circumstances: 1) You start with a “perfectly noise-free spectrum” 2) You create an excessively high degree of non-linearity which would never be tolerated by an experienced spectroscopist. 3) You assume the spectroscopist will use the entire spectrum blindly when apply ing PLS or PCR, even though some parts of the spectrum clearly have no information and other parts are clearly nonlinear. 4) You limit the number of factors for PLS/PCR to 1, even though the number of latent variables must be greater, due to the nonlinearity. In regards to number 1, by using a perfectly noise-free spectrum, you have elim inated the main advantage of PLS/PCR. That is, the whole point of using these techniques is that they have better ability to reject noise than MLR. To come to an adequate conclusion as to the best performer, you should at least add an amount of random noise an order of magnitude greater than normal, since the amount of nonlinearity you use is an order of magnitude greater than normal. Number 2 – I understand that you wanted to use a high degree of nonlinearity so that the absorbance vs. concentration plot will be nonlinear to the naked eye, but you can’t really expect to use this degree of nonlinearity to make a judgmental comparison between two techniques if it is not realistic that it will ever occur in real life. Number 3 – There are many well-established techniques for choosing which wavelength regions to use when modeling with PLS/PCR. First, I advise people to make sure that the pure component spectrum actually has a band in the location being modeled. If this is not possible, at least only include regions that look like

Linearity in Calibration: Act II Scene II

147

valid bands – no sense in trying to include low s/n baseline regions. Plots of a linear correlation coefficient vs. wavelength for the property of interest are also useful in choosing the right regions to include in the model. Finally, if the initial model is built using the full-spectrum, an examination of factor plots would reveal areas in which there is no activity. Number 4 – In cases where there is no choice but to deal with nonlinearity in the spectra, then it will be necessary to use more factors than the number of chemical species in the system. Once again, an experienced practitioner will use other ways of choosing the right number of factors, like a PRESS plot, etc. Thus your conclusion – that MLR is more capable of producing accurate models than PLS/PCR – is based on a contrived set of circumstances that would not occur in reality, especially when the chemometrician/spectroscopist is experienced. It would be very interesting also, since the performance of the models presented are so similar, to see how the performance would be affected by noise, drift, etc. which are always present in actuality. I would not be surprised if PLS/PCR outperformed MLR under those circumstances. All of the above would seem to indicate that I am totally against using MLR. This is not the case. In my practice, I always try the simplest approach first. This means first trying MLR. If that does not work, then I use PLS. If that does not work – well, some people may use neural networks, but I have not yet found a need to do so. I think you are right in saying that there has been a lot of hype over PLS (although not as much as there has been over neural nets!) In many cases MLR works great, and I will continue to use it. To paraphrase Einstein, “Always use the simplest approach that works – but no simpler.” The third set of comments we received were from Fred Cahn: I read your article in Spectroscopy (13(6), June 1998) with interest. However, I don’t agree with the conclusions and the way your simulation was carried out and/or presented. While I am no longer working in this field, and cannot easily do simulations, I think that a 2 factor PCR or PLS model would fully model the simulated spectra. At any wavelength in your simulation, a second degree power series applies, which is linear in coefficients, and the coefficients of a 2 factor PCR or PLS model will be a linear function of the coefficients of the power series. (This assumes an adequate number of calibration spectra, that is, at least as many spectra as factors and a sufficient number of wavelength, which the full spectrum method assures.) The PCR or PLS regression should find the linear combination of these PCR/PLS coefficients that is linear in concentration. See my publication: Cahn, F. and S. Compton, “Multivariate Calibration of Infrared Spectra for Quanti tative Analysis Using Designed Experiments”, Applied Spectroscopy, 42:865–872 (July, 1988).

148

Chemometrics in Spectroscopy

Fred supplied a copy of the cited paper, and we read it. Again, the comments about it will be included among the general comments. And finally, the fourth set of comments we received were from Paul Chabot: Hello, I recently read your column in the Spectroscopy issue of June 1998, which was dealing with “Linearity in Calibration”. First, I have to tell you that I really like your monthly column. You do a good job at explaining the basics and more of many topics related to chemometrics, and “demistify” the subjects. As an avid user of PLS, I was concerned when you were comparing MLR to PLS and PCR on your synthetic data set. Even though I agree with you that in some cases, MLR is a much better approach than PLS or PCR, sometimes the use of a full spectrum technique is essential. In this particular case, I do not doubt your results showing that MLR outperforms the full spectrum techniques because the data set was designed to do so. But out of the full spectrum techniques, I would expect PLS to outperform PCR, and the loading of the first principal component to be mostly located around the lower wavelength peak for PLS. Did you notice any difference between PCR and PLS on this data set? I would appreciate it if you could let me know if you tried both approaches and the results you obtained so I don’t have to regenerate the data. Thank you very much, and keep up the good work, Paul Chabot To summarize the comments (including ones presented during subsequent discussions, and therefore not included above): 1) Richard Kramer, Patrick Wiegand, and Fred Cahn felt that we should have tried two factors. 2) Richard Kramer and Patrick Wiegand thought we should have added simulated noise to the data. 3) All four responders indicated that we should have tried PLS. 4) Richard Kramer, Patrick Wiegand, and Paul Chabot indicated that one PLS factor might do as well as one wavelength. 5) Richard Kramer and Patrick Wiegand thought that our conclusion was that MLR is better than PCA. As stated in the introduction to this chapter, we present our responses in chapters to follow.

REFERENCES 1. Mark, H. and Workman, J., Spectroscopy 13(6), 19–21 (1998). 2. Mark, H. and Workman, J., Spectroscopy 13(11), 18–21 (1998).

31 Linearity in Calibration: Act II Scene III

In Chapter 27, we discussed a previously published paper entitled “Linearity in Calibration” [1]. In the chapter and original paper we presented some unexpected results when comparing a calibration model using MLR with the model found using PCR. That chapter, when first published as an article, generated a rather active response, so we are discussing the subject and responding to the comments received in some detail, spread over several chapters. The first two parts of our response were included as Chapters 29 and 30, which refer to the papers published as [2, 3]; this Chapter 31 is the continuation of those. We ended Chapter 30 with a summary of the comments received regarding a previous “Linearity in Calibration” paper. We therefore pick up where we left off by starting this chapter with that same summary (naturally, anyone who wishes to read the full text of the comments will have to go back and reread Chapter 30 derived from reference [3]): 1) Richard Kramer, Patrick Wiegand, and Fred Cahn felt that we should have tried two factors. 2) Richard Kramer and Patrick Wiegand thought we should have added simulated noise to the data. 3) All four responders indicated that we should have tried PLS. 4) Richard Kramer, Patrick Wiegand, and Paul Chabot indicated that one PLS factor might do as well as one wavelength. 5) Richard Kramer and Patrick Wiegand thought that our conclusion was the MLR is better than PCA. In addition, each of the responders had some of their own individual comments; we discuss all these below. We now continue with our responses, and discussion of these comments: It may surprise some to hear this, especially in light of some of the comments we make below, but we agree with the responders more than we disagree. We also believe, for example, in pre-screening the data, at least as strongly as Patrick Wiegand does, and we believe his comments regarding the way all (or at least, let’s hope all) experienced chemometricians approach a problem. Indeed, fully half the book that one of us authored [4] was spent on just that point: how to “look at the data”. However, our experience in the “real world” (as some like to call it) of instrument manufacturers has given us a somewhat different slant on the reality of what actually happens when users get hold of a new super-whiz-bang package of calculation. In many years of experience in the NIR applications department at Technicon Instru ments, there was about an hour and a half available to teach both theory and practice of calibration to each group of new users; the rest of the training time was spent teaching the students how to set the instrument up, prepare samples, take reproducible readings,

150

Chemometrics in Spectroscopy

and learn the rest of the mechanics needed to run the instrument, take readings, and collect the data. How much attention do you think could be paid to the finer points? This seems to be typical of what happens in the majority of cases involving novice users, and it is rare that there is anyone “back at the plant” who can pick up the ball and take them any further. Even experienced practitioners can be misled, however. As was pointed out, real data contains various types and amounts of variations in both the X and Y variables. Furthermore, in the usual case, neither the constituent values nor the optical readings are spaced at nice, even, uniform intervals. Under such circumstances, it is extremely difficult to pick out the various effects that are operative at the different wavelengths, and even when the data analyst does examine the data, it may not always be clear which phenomena are affecting the spectra at each particular wavelength. Now we will respond to the various comments, and make some more observations of our own. We will re-quote the pertinent parts of the communications from the responders, collecting together those on a similar topic and comment on them collectively. Note than some of these quotes were from later messages than those quoted in our previous column, because they were generated during subsequent discussions, and so may not have appeared previously. We hope nobody takes our reply comments personally. Both some of the comments and some of our responses are energetic, because we seem to have touched on a subject that turned out to be somewhat controversial. So we do not take the responders comments personally, but we do enter with zest and gusto into what looks like something turning into a rather lively debate, and we sincerely hope that everybody can take our own comments in that same spirit. The format of this columns is as follows: each numbered section starts with the comments from the various responders dealing with a given aspect of the subject, followed by our response to them collectively. So now let us consider the various points raised, starting with the use of noise-free data: 1) “You start with a ‘perfectly noise-free spectrum’ ” (Patrick Wiegand) “In regards to number 1, by using a perfectly noise-free spectrum, you have eliminated the main advantage of PLS/PCR. That is, the whole point of using these techniques is that they have better ability to reject noise than MLR. To come to an adequate conclusion as to the best performer, you should at least add an amount of random noise an order of magnitude greater than normal, since the amount of nonlinearity you use is an order of magnitude greater than normal.” (Patrick Wiegand) “The second problem is that that we never have the luxury of working with noise-free data. Thus, the column did not ask the right question(s). The proper question to ask is ‘In what ways and under which circumstances do the signal averaging advantages of multiple-wavelength models outperform or underperform with respect to a single (or n wavelength, where n is a small integer) wavelength calibration when noise is present?’ The answer will depend upon the levels of noise and nonlinearity and the number of wavelengths in each model.” (Richard Kramer) “It isn’t a case of ‘extreme difficulty’. It is a situation where, in one case you use a factor which happens to be based upon an explicit model (i.e. linearity) which is correct

Linearity in Calibration: Act II Scene III

151

for the data while stacking the deck against the second case by denying any opportunity to be correct.” (Richard Kramer) Response: Of course we used noise-free data. Otherwise we could not be sure that the effects we see are due to the characteristics we impose on the data, rather than the random effects of the noise. When anyone does an actual, physical experiment and takes real readings, the noise level or the signal-to-noise ratio is a consideration of paramount importance, and any experimenter normally takes great pains to reduce the noise as much as possible, for just that reason. Why shouldn’t we do the same in a computer experiment? On the other hand, PCA and PLS are both known to perform better than MLR when the data is noisy because of the inherent averaging that they include. In this we agree fully; indeed, we also mentioned this characteristic in Chapter 27, as well as in the original column. Richard Kramer hit the nail on the head with his question “In what ways ?” The important question, then, that needs to be asked (and answered) is, at what point does one phenomenon or the other become dominant, so as to control or determine which algorithm will provide a better model? The next important question is, how can we tell which phenomenon is dominant in any particular case? Rich Kramer also had the insight to go to the next step, and realized that the only way to determine whether the nonlinearity is “small” or “large” is by having something to compare to, and the natural characteristic to compare it to is the noise. On this score we also agree with Richard and Patrick fully, and this is one place where much research is needed (there are others; and we will get to them in due course): How do you compare the systematic behavior of nonlinearity with the random behavior of noise? The standard application of the science of Statistics provides us with tools to detect systematic effects, but how do we go to the next step and ascertain their relative effects on calibration models? These are among the fundamental behavioral properties of calibrations that are not being investigated, but need to be. There are important theoretical reasons to reduce the spectral noise when doing calibrations. Nevertheless, if the main advantage of PLS is its behavior in the presence of noisy data (as Patrick Wiegand states), that is poor praise indeed. Noise levels of modern instruments are far below those of the past. In some cases, and NIR instruments come to mind here, the noise levels are so low that they are tantamount to having “zero noise” to start with. This improvement in instrumentation is a good thing, and we sincerely doubt that anybody would recommend using a noisy instrument for the sole purpose of justifying a more sophisticated algorithm. In any case, even if all the above statements are 100% true, it does not affect our discussion because they are beside the point. The behavior of calibration algorithms in the face of noisy data is an important topic and perhaps should be studied in depth, but it was not at issue in the “Linearity in Calibration” column. 2) “You create an excessively high degree of nonlinearity which would never be tolerated by an experienced spectroscopist.” (Patrick Wiegand) Response: In the absence of random variation, ANY amount of nonlinearity would give the same results, and if we used less, any differences from the results we presented would be only of degree, not of kind. Any amount of nonlinearity is infinitely greater

152

Chemometrics in Spectroscopy

than zero. As we explained in the original column, we deliberately chose an unrealis tically large amount of nonlinearity for pedagogical purposes; what would be the point of comparing different calibration lines that the naked eye saw as equally straight? The fact that it is “unrealistically” large is immaterial. 3) “You assume the spectroscopist will use the entire spectrum blindly when applying PLS or PCR, even though some parts of the spectrum clearly have no information and other parts are clearly nonlinear.” (Patrick Wiegand) Response: Above, I described the situation as we see it, regarding the traps that both experienced and novice users of these very sophisticated algorithms can fall into. Keep in mind the pedagogy involved as well as the chemometrics: by suitable choice of values for the “constituent”, the peaks at the nonlinear wavelengths could have been made to appear equally spaced, and the linear wavelengths appear stretched out at the higher values. The “clarity” of the nonlinearity is due to the presentation, not to any fundamental property of the data, and this clarity does not normally exist in real data. How is someone to detect this, especially if not looking for it? Attempts to address this issue have been made in the past (see [5]) with results that in our opinion are mixed, at best. And that simulated data was also noise-free. With real data, a more scientifically valid approach would be to correct the nonlinearity from physical theory. In the current case, for example, a scientifically valid approach would be to convert the data to transmission mode, subtract the stray light and reconvert to absorbance: the nonlinear wavelengths would have become linear again. There are, of course, several things wrong with this procedure, all of them stemming from the fact that this data was created in a specific way for a specific purpose, not necessarily to be representative of real data: a) You would have to know a priori that only certain wavelengths (and which ones) were subject to the “stray light” or whatever source of nonlinearity was present. b) One of the problems of current chemometric practice is the “numbers game” aspect. No matter how soundly based in physical theory a procedure is, if the numbers it produces are not as good (whatever that might mean in a specific case) as a different, more empirical, procedure, the second procedure will be used, no matter how empirical its basis. The counter-argument to that, of course, is something on the order of “Well, we have to get as good results as we can for the user” and there is a certain amount of legitimacy to this statement. However, we know of no other field of scientific study where a situation of this sort is tolerated. Certainly, every field has areas of unknown effects where not all the fundamental physical theory is available, but in all fields other than chemometrics, there are workers investigating these dark areas, to try to fill in the missing knowledge. In chemometrics, on the other hand, for at least the 22 years we have been involved with the field, all we have seen the workers in the field doing are building bigger and higher and more fanciful mathematical superstructures on foundations that few, if any of them, seem to be aware of. We will have more to say about this below. c) The simple fact that sometimes the nature of the correct physical theory to use is unknown. d) Finally, the real reason we presented these results the way we did was that the whole purpose of the exercise was to study the effect of this type of variation of

Linearity in Calibration: Act II Scene III

153

the data, so that simply removing it would not only be trivial, it would also be a counterproductive procedure. 4) “If I understand the column correctly, a 1-factor model was used. Well, a single linear factor can never be sufficient to properly model a non-linear system. A minimum of 2 factors are required.” (Richard Kramer) “PLS should have, in principle, rejected a portion of the non-linear variance resulting in a better, although not completely exact, fit to the data with just 1 factor. The PLS does tend to reject (exclude) those portions of the x-data which do not correlate linearly to the y-block.” (Richard Kramer) “You limit the number of factors for PLS/PCR to 1, even though the number of latent variables must be greater, due to the nonlinearity.” (Patrick Wiegand) “In principle, in the absence of noise, the PLS factor should completely reject the non linear data by rotating the first factor into orthogonality with the dimensions of the x-data space which are ‘spawned’ by the nonlinearity. The PLS algorithm is supposed to find the (first) factor which maximizes the linear relationship between the x-block scores and the y-block scores. So clearly, in the absence of noise, a good implementation of PLS should completely reject all of the nonlinearity and return a factor which is exactly linearly related to the y-block variances.” (Richard Kramer) “While I am no longer working in this field, and cannot easily do simulations, I think that a 2 factor PCR or PLS model would fully model the simulated spectra.” (Fred Cahn) “My “objection” is that you did not seem to look at the 2nd factor, which I think is needed to accurately model the spectra after the background is added.” (Fred Cahn) “I would expect PLS to outperform PCR, and the loading of the first principal component to be mostly located around the lower wavelength peak for PLS.” (Paul Chabot) Response: Yes, but: The point being that, as our conclusions indicate, this is one case where the use of latent variables is not the best approach. The fact remains that with data such as this, one wavelength can model the constituent concentration exactly, with zero error – precisely because it can avoid the regions of nonlinearity, which the PCA/PLS methods cannot do. It is not possible to model the “constituent” better than that, and even if PLS could model it just as well (a point we are not yet convinced of since it has not yet been tried – it should work for a polynomial nonlinearity but this nonlinearity is logarithmic) with one or even two factors, you still wind up with a more complicated model, something that there is no benefit to. Richard Kramer suggested that we use two wavelengths (with the MLR approach) to see what happens. Well, here’s what happens: if the second wavelength is also on the linear absorbance band, you get a “divide by zero” error upon performing the matrix inversion due to the perfect collinearity between the data at the two wavelengths. If the second wavelength is on the nonlinear band, the regression coefficient calculated for it is exactly zero (at least to 16 digits, where the computer truncation error becomes important), since it plays exactly no role in the modeling. In other words, not only is it

154

Chemometrics in Spectroscopy

unnecessary to add a second wavelength to the model, it is impossible to do so if you try; when the model is perfectly correct you can’t force a second wavelength into that model even if you want to. Richard Kramer, Patrick Wiegand, and Paul Chabot suggested that a one-factor PLS model should reject the data from the nonlinear wavelength and therefore also provide a perfect fit to the “constituent”. I offered to provide the data as an EXCEL spreadsheet to these responders; Paul accepted the offer, and I e-mailed the data to him. We will see the results at an appropriate stage. 5) “There are many well-established techniques for choosing which wavelength regions to use when modeling with PLS/PCR. First, I advise people to make sure that the pure component spectrum actually has a band in the location being modeled." (Patrick Wiegand) Response: That indeed is a good procedure when you can do it (keeping in mind our earlier discussion regarding users reactions to the case of a conflict between theoret ical correctness and the experimental “numbers game”), and we also make the same recommendation when appropriate. If anything, proper wavelength choice is even more important when using MLR than either PCA or PLS. But what do you do when the “constituent” is a physical property, with no distinct absorbance band? This consider ation becomes particularly pernicious when that property is not itself being calibrated for, but is a variation superimposed on the data, and needs a factor (or wavelength) to compensate for, yet has no absorbance band of its own? The prototype example of this is the “repack” effect found when the measurements are made by diffuse reflectance: “Repack” does not have an absorbance band. Other situations arise where that approach fails: when the chemistry is unknown or too complicated (octane rating in gasoline, for example). Here again, even though a fair amount is known about the chemistry behind octane rating, there is no absorbance band for “octane value”. Another case is where the chemistry is known, but the spectroscopy is unknown, because the pure material is not available. Protein, for example, cannot be extracted from wheat (or at least not and still remain protein), so the spectrum of “pure” protein as it exists in wheat is unknown. Even simpler molecules are subject to this effect: we can measure the spectrum of pure water easily enough, for example, but that is not the same spectrum as water has when it is present as an intimate mixture in a natural product – the changes in the hydrogen bonding completely change the nature of the spectrum. And these examples are ones we know about! 6) “Finally, the calibration statistics presented in Table 27-1 show a correlation coef ficient of 0.9996 for PCR, even when an obviously nonlinear region is used! I am not sure if this is significantly different from the one shown for MLR using only the linear region. To me either model would be acceptable at the stage of method development where the article ended. Besides, it is unlikely that someone would be able to know a priori that the linear region was the better region to use for MLR.” (Patrick Wiegand) Response: As a purely practical matter, we agree with that interpretation. However, we hope that by now we have convinced you that we are trying to do more than that – we are trying to find out what really goes on inside the “black boxes” of chemometric

Linearity in Calibration: Act II Scene III

155

calculations. The fact that the value of the PCR correlation coefficient differs significantly from unity becomes clear when you look at the other term of the ANOVA equation: in the MLR case the sum-squared error is zero, in the PCR case it is “infinitely” greater than that. Don’t forget that “significance”, at least in the statistical sense, is defined only when dealing with random variables. This also relates to the earlier comment regarding how to find ways to compare the relative effects of noise and nonlinearity on calibration models. 7) “It would be very interesting also, since the performance of the models presented are so similar, to see how the performance would be affected by noise, drift, etc. which are always present in actuality. I would not be surprised if PLS/PCR outperformed MLR under those circumstances.” (Patrick Wiegand) Response: Yes, it certainly would be most interesting to investigate this question. This is closely related to the previous discussion concerning the relationship between noise and nonlinearity, so I would modify the statement of the problem to “At what point does one or another effect dominate the behavior of the calibration?” that is, where is the crossover point? Investigating questions of this sort is called “research”, and a more fundamental question arises: why isn’t anybody doing such investigations? Other, related, questions are also important: Having determined this in isolation, how does the data analyst determine this in real data, where unknown amounts of several effects may be present? There is a similarity here to Richard’s earlier point regarding the relationship between the amount of noise and the amount of nonlinearity. Here are more fertile areas for research into the behavior of calibration models. 8) “At any wavelength in your simulation, a second degree power series applies, which is linear in coefficients, and the coefficients of a 2 factor PCR or PLS model will be a linear function of the coefficients of the power series. (This assumes an adequate number of calibration spectra, that is, at least as many spectra as factors and a sufficient number of wavelength, which the full spectrum method assures.) The PCR or PLS regression should find the linear combination of these PCR/PLS coefficients that is linear in concentration.” (Fred Cahn) Response: We have read the indicated section of that paper [6], and scanned the rest of it. We agree with much of what it says, both in the paper and in Fred Cahn’s messages, but we are not sure we see the relevance to the column. Certainly, nonlinearities in real data can have several possible causes, both chemical (e.g., interactions that make the true concentrations of any given species different than expected or might be calculated solely from what was introduced into a sample, and interaction can change the underlying absorbance bands, to boot) and physical (such as the stray light, that we simulated). Approximating these nonlinearities with a Taylor expansion is a risky procedure unless you know a priori what the error bound of the approximation is, but in any case it remains an approximation, not an exact solution. In the case of our simulated data, the nonlinearity was logarithmic, thus even a second-order Taylor expansion would be of limited accuracy. Alternative methods, such as correcting the nonlinearity though the application of an appropriate physical theory as we described above, may do as well or even better than a Taylor series approximation, but a rigorous theory is not always available. Even in

156

Chemometrics in Spectroscopy

cases where a theory exists, often the physical conditions for which the theory is valid cannot be achieved; we demonstrated this in the discussion in Chapters 29 and 30 of the fundamental impossibility of truly achieving “Beer’s Law linearity”. Thus we are left with a situation where even in the best cases we can achieve, there can be residual non-linearities in the data. The purpose of our column was to investigate the behavior of different modeling methods in the face of nonlinearity. 9) “Thus, my interest in 2 or more factor chemometric models of your simulation is in line with this view of chemometrics. I agree with the need for better physical understanding of instrument responses as well as of the spectra themselves. I would not choose PCR/PLS or MLR to construct such physical models, however.” (Fred Cahn) Response: We were not trying to use the chemometric techniques to create a physical model in the column. We also agree that physical models should be created in the traditional manner, based on the study of the physical considerations of a situation. Ideally you would start from a fundamental physical law and derive, through logic and mathematics, the behavior of a particular system: this is how all other fields of science work. A chemometric technique then would be used only to ascertain the value (from a series of physical measurements) of an unknown parameter that the mathematical derivation created. What we were trying to do in the column was to ascertain the behavior of a mathemat ical (not physical!) system in the face of a certain type of (simulated) physical behavior. There is nothing wrong with trying to come up with empirical methods for improving the practical performance of chemometric calibration, but one of the philosophical problems with the current state of chemometrics is that nobody is trying to do anything else, that is to determine the fundamental behavior of these mathematical systems. 10) “The synthetic data did NOT demonstrate the advantage of a single linear wavelength over a multiple wavelength [sic] model ” (Richard Kramer) “ in one case you use a factor which happens to be based upon an explicit model (i.e. linearity) which is correct for the data while stacking the deck against the second case by denying any opportunity to be correct.” (Richard Kramer) “In your article, you appear to be creating an artificial set of circumstances: ” (Patrick Wiegand) “Thus your conclusion – that MLR is more capable of producing accurate models than PLS/PCR – is based on a contrived set of circumstances that would not occur in reality, especially when the chemometrician/spectroscopist is experienced.” (Patrick Wiegand) Response: Artificial? Contrived? Only insofar as any experimental study is based on a “contrived” set of circumstances – contrived to enable the experimenter to separate the phenomenon of interest and study its effects, with “everything else the same”. But that is a minor matter. Richard and Patrick (and how many others, who didn’t respond?) believe that we concluded that “MLR is better than PCA/PLS”. The really critical point here is that that is NOT our conclusion, and anyone who thinks this has misunderstood us. We put the fault for this on ourselves, since the one thing that is clear is that we did not explain ourselves sufficiently.

Linearity in Calibration: Act II Scene III

157

Therefore let us clarify the point here and now: we are not fighting a “holy war” against PCA/PLS etc. The purpose of the exercise was NOT to “prove that MLR with wavelength selection is better”, but to investigate and explain conditions that cause that to be so, when it happens (which it does, sometimes). As we discussed in the original column, more and more discussions about calibration processes, both oral and in the literature, describe situations where wavelength selection improved the results (in PCR and PLS as well as MLR), but there has previously been no explanation for this phenomenon. Therefore we decided to investigate nonlinearity since we suspected that to be a major consideration, and so it turned out to be. We continue our discussion in the following chapters.

REFERENCES 1. 2. 3. 4.

Mark, H. and Workman, J., Spectroscopy 13(6), 19–21 (1998). Mark, H. and Workman, J., Spectroscopy 13(11), 18–21 (1998). Mark, H. and Workman, J., Spectroscopy 14(1), 16–17 (1999). Mark, H., Principles and Practice of Spectroscopic Calibration (John Wiley & Sons, New York, 1991). 5. Mark, H., Applied Spectroscopy 42(5), 832–844 (1988). 6. Cahn, F. and Compton, S., Applied Spectroscopy 42, 865–872 (1988).

This page intentionally left blank

32

Linearity in Calibration: Act II Scene IV

This chapter continues our discussion started by the responses received to our Chapter 27 when it was first published as a paper entitled “Linearity in Calibration” [1]. So far our discussion has extended over three previous chapters (29 through 31) whose original published citations are given in references [2–4]. In Chapter 31, originally referenced as [4] we stated, “we are not fighting a ‘holy war’ against PCA/PLS etc.” and then went on to discuss what our original column was really about. However, if there is a “holy war” being fought at all, then from our point of view it is against the practice of simply accepting the results of the computer’s cogitations without attempting to understand the underlying phenomena that affect the behavior of the calibration models, regardless of the algorithm used. This has been our “fight” since the beginning – which can be verified by going back and rereading our very first column ever [5]. The authors do not always agree, but we do agree on the following: it is incompre hensible how a person calling himself a scientist can fail to wonder WHY calibration models behave the way they do, and try to relate their behavior to the properties of the data giving rise to them. There are reasons for everything that happens, whether we know what those reasons are or not, and the goal of science is to determine what those underlying reasons or principles are. At least that is the goal of every other field of scientific endeavor that we are aware of – why is Chemometrics exempt? Real data, as we have seen, is far too complicated to work with to try to obtain fundamental understanding, just as the physical world is often too complicated to study directly in toto. Therefore work such as was presented in the “Linearity in Calibration” chapter is needed, creating a simplified system where the characteristic of interest can be isolated and studied – just as physical experiments often work with a simplified portion of the physical world for the same reason. This might be categorized as “Experimental Chemometrics”, controlling the nature of the data in a way that allows us to relate the properties of the data to the behavior of the model. Does this mimic the “real world”? No, but it does provide a window into the inner workings of the calibration calculations, and we need as many such windows as we can get. We will go so far as to make an analogy with Chemistry itself. The alchemists of old had an enormous empirical knowledge base, and from that could do all manner of useful things. But we do not consider alchemy a science, and it did not become a science until the underlying principles and phenomena were discovered and codified in a way that all could use. The current state of Chemometrics is more nearly akin to alchemy than Chemistry: we can do all manner of useful things with it, but it is all empirical and there are still many areas where even the most expert and prominent practitioners treat it as a “black box” and make no attempt to understand the inner workings of that black box.

160

Chemometrics in Spectroscopy

Empiricism is important and even necessary, but hardly sufficient. The ultimate test of whether something is scientific is its ability to predict – and that does NOT mean SEP!! The irony of the situation is that a good deal of basic knowledge is available. The field of Chemometrics bypasses all the Statistical basics and jumps right into the heavyduty sophisticated algorithms: everybody just wants to start running before they can even crawl. We commented on this situation in earlier Chapters 29–31 and previous publications [6], and what response we received was on the order of “Why was so much space wasted before getting to the important part?” It is certainly unfortunate that the portion of the discussion that was perceived as “wasted space” was the important part, but was not recognized as such. The early foundations of Statistics go back to the 1600s or so, to the time when proba bility theory was recognized as a distinct branch of mathematics. The current problem is that nobody currently seems to apply the knowledge gained over the intervening span of time, or to be interested in applying that knowledge, or to do fundamental investigations at all. The chemometric community completely ignores the previous mathematical basis underlying its structure. The science of Statistics does, in fact, form a firm foundation that Chemometrics is built on. It is almost shameful that the modern Chemometrics community seems to be content to build ever higher and fancier superstructures on a foundation that is solid enough, but to which it is hardly connected. Worse, there seems to be an active antipathy to such investigations: just look at the firestorm we aroused by publishing a very small and innocuous study of the funda mental behavior of a particular data system! In fact, from the response, you would almost think we committed heresy or attacked religious beliefs, in daring to suggest that PCR/PLS was not always the best way to go, much less do some serious research on the subject. Everybody gives lip service to the concept of “fundamental research is good for the long run”, but nobody seems interested in putting that concept into practice, even with the possibility of fairly short-term returns. Let us look at a couple of examples. In reference [7] we found the following passage: But, it would be dangerous to assume that we can routinely get away with extrapolation of that kind. Sometimes it can be done, sometimes it can’t. There is no simple rule that can tell us which situation we might be facing. (see p. 129 in [7]). And that passage seems to sum up the current state of affairs. Theoretically, a good straight line should be extrapolatable almost indefinitely, yet we all know how risky it is to extrapolate even a little bit beyond the range of our data. Why does not practice conform to theory? The obvious answer is that something is nonlinear. But why cannot we detect this? As Rich says, we do not have any simple rules. Well, OK, so we do not have simple rules. Maybe no simple rules exist. But then, why do not we at least have complicated rules to help us make such important decisions? At least then we would have a way to predict (in the scientific sense) something that is worthwhile knowing. As it stands we have nothing, and nobody seems interested in finding out why. Maybe a new approach is needed. Maybe this is where Fred Cahn’s work is pertinent: if you can approximate the nonlinearity with a Taylor series, then maybe the quality of the fit can provide a diagnostic to form the foundation of a rule on which to base a decision. Maybe something else will work. We do not know, but it is a possible starting

Linearity in Calibration: Act II Scene IV

161

point. Fred, you are in the ideal position to pursue this, how about it – will you accept this challenge? The above example, of course, is relatively abstract and “academic”, and as such perhaps not of too much interest to the majority. Another example, with more practical application, is transfer of calibration models from one instrument to another. This is an endeavor of enormous current practical importance. Witness that hardly a month passes without at least one article on that topic in one or more of the analytical or spectroscopic journals. Yet all those reports are the same: “Effect of Data Treatment ABC Combined with Algorithm XYZ Compared to Algorithm UVW” or some such; they are all completely empirical studies. In themselves there is nothing wrong with that. The problem is that there is nothing else. There are no critical reviews summarizing all this work and extracting those aspects that are common and beneficial (or common and harmful, for that matter). Even worse, there are no fundamental studies dealing with the relationship of the algorithm’s behavior to the underlying physics, chemistry, mathematics, or instrumental effects. It is not difficult to see that the calibration transfer problem breaks down into two pieces: a) The effect of instrumental variation on the data b) The effect of variations of the data on the model. Studying the effects of instrumental performance should be the province of the manu facturers. Unfortunately, the perception is that it is to their benefit to release such results only if they turn out to be “good”, and there is little incentive for them to perform studies whose only purpose is to increase scientific knowledge. Thus it is up to academia to pick up this particular ball, if there is any interest in it at all. Fundamental studies in those areas will eventually give rise to real knowledge about how and when calibrations can be transferred, and provide us with trustworthy recipes for doing the transfer. Such knowledge will also provide us with the confidence of knowing that the underlying science is sound, and thus take us beyond the “my algorithm is better than your algorithm” stage that we are now at. Furthermore, true fundamental understanding could also be applied in reverse. Then instrument manufacturers could concentrate on those aspects of construction and opera tion that affect the transferability situation, and be able to verify their capabilities in an unambiguous, scientifically valid and agreed-on manner. This is just one other example of a current problem that COULD be attacked with fundamental studies, with both short- and long-term benefits that are obvious to all. Connecting to the statistical foundations, as described above, can have other benefits. For example, computing an SEP on a validation set of data is considered the be-all and end-all of calibration diagnostics. This is an important calculation, to be sure, but it has its limitations, as well. For example, the SEP alone has no diagnostic capability: it tells you nothing about what you need to do in order to improve a calibration model. For another, even when you compare SEPs from different models and choose the model with the smallest SEP, that does not necessarily mean you are choosing the best model. We often see “robustness” bandied about in discussions of calibration models, but what diagnostics do we have to quantify “robustness”? Without such a diagnostic, how can we expect to evaluate “robustness” either in isolation or to compare with SEP?

162

Chemometrics in Spectroscopy

By focusing all our attention on the SEP we have also lost the ability to evaluate calibrations on their own. When calibrating spectrometers to do quantitative analysis, where samples are cheap and easy to come by, this loss is not too serious, but what do you do when a project requires calibration runs that cost a million (or ten million) dollars per run, and minimizing the number of runs is the absolute top priority? In such a case, you will not only not have validation data, you will likely not even have enough calibration data to do a leave-one-out calculation, and then being able to evaluate models from calibration diagnostics alone will be critical. Statisticians have, in fact, developed diagnostic tests that provide information about such characteristics, but the Chemometric community, in our arrogance, think we know better and ignore all this prior work. The statistical community has also developed many local and semi-local diagnostic tools to help understand and improve calibration models; we really need to get back to the roots on this, as well. There are innumerable unsolved problems in Chemometrics that need to be addressed: real, scientific problems, not just new ways to throw numbers around.

REFERENCES 1. 2. 3. 4. 5. 6. 7.

Mark, H. and Workman, J., Spectroscopy 13(6), 19–21 (1998). Mark, H. and Workman, J., Spectroscopy 14(2), 16–27 (1999). Mark, H. and Workman, J., Spectroscopy 14(1), 16–17 (1999). Mark, H. and Workman, J., Spectroscopy 13(11), 18–21 (1998). Mark, H. and Workman, J., Spectroscopy 2(1), 38–39 (1987). Mark, H. and Workman, J., Spectroscopy 13(4), 26–29 (1998). Kramer, R., Chemometric Techniques for Quantitative Analysis (Marcel Dekker, New York, 1998).

33 Linearity in Calibration: Act II Scene V

This chapter is still a continuation of our discussion started by the responses received to Chapter 27 from our initial publication of “Linearity in Calibration” [1]. Up until now our discussion has extended over Chapters 29–32 as original paper publications ([2–5], respectively). At this point, however, we are finally getting toward the end of our obsession with considerations of linearity – at least until we receive another set of comments from our readers. Incidentally, we welcome such feedback, even those that disagree with us or with which we disagree, so please keep it coming. Indeed, it seems that we do not get much feedback unless our readers disagree with us, and feel it strongly enough to feel the need to say so. That is great – there is nothing like a little controversy to keep a book like this interesting: who said chemometrics and statistics and mathematics were dry subjects, anyway?! In our original column on this topic [1] we had only done a principal component analysis to compare with the MLR results. One of the comments made, and it was made by all the responders, was to ask why we did not also do a PLS analysis of the synthetic linearity data. There were a number of reasons, and we offered to send the data to any or all of the responders who would care to do the PLS analysis and report the results. Of the original responders, Paul Chabot took us up on our offer. In addition, at the 1998 International Diffuse Reflectance Conference (The “Chambersburg” meeting), Susan Foulk also offered to do the PLS analysis of this data. Gratifyingly, when Paul and Susan reported their PLS loadings they were identical, even though they used different software packages to do the PLS calculations (PLSIQ and Unscrambler). We are certainly glad we do not have to worry about sorting out dif ferences in software packages (due to different convergence criteria, etc., that sometimes creep into results such as these) on top of the Chemometric issues we want to address. Figure 33-1 presents the plot of the PLS loadings. Paul and Susan each computed both loadings. Note that the first loading is indistinguishable to the eye from the first PCA loading (see our original column on this topic [1]). Paul and Susan each also computed the two calibration models and performance statistics for both models. Except that various programs did not compute the same sets of performance statistics (although in one case a different computation seemed to be given the same label as SEE), the ones that were reported by both programs had identical values. As expected by all responders, and by your hosts as well, when two-factor models (either PCR or PLS) were computed, the fit of the model to the synthetic data was perfect. Table 33-1 presents a summary of the numerical results obtained, for one-factor calibration models. Interestingly, when comparing the calibration results we find that the reported cor relation coefficients agree among the different programs using the same algorithm, but the SEE values differ appreciably; it would seem that not all programs use the same

164

Chemometrics in Spectroscopy PLS Loadings 0.2 0.15 0.1

300

288

276

264

252

240

228

216

204

192

180

168

156

144

132

120

108

96

84

72

60

48

36

24

0

0 –0.05

12

Loading

0.05

–0.1 –0.15 –0.2 –0.25 –0.3 Index

Figure 33-1 PLS loadings from the synthetic data used to test the fit of models to nonlinearity. (see Colour Plate 2)

Table 33-1 Summary of results obtained from synthetic linearity data using one PCA or PLS fac tor. We present only those performance results listed by the data analyst as Correlation Coefficient and Standard Error of Estimate Data analyst Column Chabot Chabot Foulk

Type of analysis

Corr. Coeff.

SEE

PCR PCR PLS PLS

0999622439 0999622411 0999623691 0999624

0057472 001434417 001436852 0051319

definition of SEE. This leaves in question, for example, whether the value reported for SEE from PLS by Susan Foulk is really as large an improvement over the SEE for PCR reported by your columnists, or if it is due to a difference in the computation used. Since Paul Chabot reported SEE for both algorithms and his values are more nearly the same, even though his computation seems to differ from both the others, the tentative conclusion is that there is a difference in the computation. Indeed, we find that if we multiply our own value for SEE by the square root of 4/5, we obtain a value of 0.0514045, a value that compares to the SEE obtained by Susan Foulk in more nearly the same way that Paul Chabot’s values compare to each other, indicating a possibility that there is a discrepancy in the determination of degrees of freedom that are used in the two algorithms. Based on the values of the correlation coefficients, then, we can find the following comparisons between the two algorithms: as several of the responders indicated, the PLS model did provide improved results over the PCR model. On the other hand, the degree of improvement was not the major effect that at least some of the responders expected. As Richard Kramer expected,

Linearity in Calibration: Act II Scene V

165

PLS should have, in principle, rejected a portion of the non-linear variance result ing in a better, although not completely exact, fit to the data with just 1 factor. Some of this variance was indeed rejected by the PLS algorithm, but the amount, compared to the Principal Component algorithm, seems to have been rather minuscule, rather than providing a nearly exact fit. Nonlinearity is a subject the specifics of which are not prolifically or extensively discussed as a specific topic in the multivariate calibration literature, to say the least. Textbooks routinely cover the issues of multiple linear regression and nonlinearity, but do not cover the issue with “full-spectrum” methods such as PCR and PLS. Some discussion does exist relative to multiple linear regression, for example in Chemometrics: A Textbook by D.L. Massart et al. [6], see Section 2.1, “Linear Regression” (pp. 167–175) and Section 2.2, “Non-linear Regression,” (pp. 175–181). The authors state, In general, a much larger number of parameters [wavelengths, frequencies, or factors] needs to be calculated in overlapping peak systems [some spectra or chromatograms] than in the linear regression problems. (p. 176) The authors describe the use of a Taylor expansion to negate the second and the higher order terms under specific mathematical conditions in order to make “any function” (i.e., our regression model) first-order (or linear). They introduce the use of the Jacobian matrix for solving nonlinear regression problems and describe the matrix mathematics in some detail (pp. 178–181). There are also forms of nonlinear PCR and PLS where the linear PCR or PLS factors are subjected to a nonlinear transformation during singular value decomposition; the nonlinear transformation function can be varied with the nonlinearity expected within the data. These forms of PCR/PLS utilize a polynomial inner relation as spline fit functions or neural networks. References for these methods are found in [7]. A mathematical description of the nonlinear decomposition steps in PLS is found in [8]. These methods can be used to empirically fit data for building calibration models in nonlinear systems. The interesting point is that there are cases, such as the one demonstrated in the Linearity in Calibration chapter where nonlinearity is the dominant phenomenon, where MLR will fit the data more closely with fewer terms than either PCR or PLS. One could imagine a real case where an analyte would have a minor absorption band such that the magnitude of the spectral band is within a linear region of the measuring instrument. One could also imagine the major absorption band of this analyte is somewhat nonlinear at the higher concentration ranges. In this special case the MLR would provide a closer fit with fewer terms than either the PLS or the PCR, unless the minor band was isolated prior to model development using the PCR or PLS. This points to a continuing need for spectral band selection algorithms that can automatically search for the optimum spectral information and linear fit prior to the calibration modeling step. But all things remaining constant, cases remain where MLR with automatic channel selection feature will provide a more optimum fit, in some cases, than either PCR or PLS. Surprising indeed, to some people! In their day, Principal Components and Partial Least Squares were each considered almost as “the magic answer to all calibration problems”. It took a long time for the realization to dawn that they contain no “magic” and are subject to most of the

166

Chemometrics in Spectroscopy

same problems as the algorithm previously available (at that time, what we now call MLR). Now we see a surge in other new algorithms: wavelets, neural networks, genetic algorithms, as well as the combining of techniques (e.g., selecting wavelengths before performing a PCA or PLS calculation). While some of the veterans of the “PC wars” (not “political correctness”, by the way) realize that they can be overfit just as MLR calibrations can, have become wary of the problem and are more cautious with new algorithms, there is some evidence that a large number, perhaps the majority, of users are not nearly so careful, and are still looking for their “magic answer”. There is a generic caution that need to be promoted, and all users made aware of when dealing with these more sophisticated methods. That is the simple fact that every new parameter that can be introduced into a calibration procedure is another way to overfit and hide the fact that it is happening. Worse, the more sophisticated the algorithm the harder it is to see and recognize that that is going on. With PCR and PLS we introduced the extra parameter of the number of factors: one extra parameter. With wavelets we introduce the order and the locality of each wavelet: two extra parameters. With neural nets, we have the number of nodes in each layer: n extra parameters, and then there is even a metaparameter: the number of layers. No wonder reports of overfitting abound (and don’t forget: those are only the ones that are recognized)! And nary a diagnostic in sight. In a perfect world, a new algorithm would not be introduced until a corresponding set of diagnostic methods were developed to inform the user how the algorithm was behaving. As long as we are dreaming, let us have those diagnostics be informative, in the sense that if the algorithm was misbehaving, it would point the user in the proper direction to fix it.

REFERENCES 1. 2. 3. 4. 5. 6.

Mark, H. and Workman, J., Spectroscopy 13(6), 19–21 (1998). Mark, H. and Workman, J., Spectroscopy 13(11), 18–21 (1998). Mark, H. and Workman, J., Spectroscopy 14(1), 16–17 (1999). Mark, H. and Workman, J., Spectroscopy 14(2), 16–27 (1999). Mark, H. and Workman, J., Spectroscopy 14(5), 12–14 (1999). Massart, D.L., Vandeginste, B.G.M., Deming, S.N., Michotte, Y. and Kaufman, L., Chemo metrics: A Textbook (Elsevier Science Publishers, Amsterdam, 1988). 7. Wold, S., Kettanah-Wold, N. and Skagerberg, B., Chemometrics and Intelligent Laboratory Systems 7, 53–65 (1989). 8. Wold, S., Chemometrics and Intelligent Laboratory Systems 14 (1992).

34

Collaborative Laboratory Studies: Part 1 – A Blueprint

We will begin by taking a look at the detailed aspects of a basic problem that confronts most analytical laboratories. This is the problem of comparing two quantitative methods performed by different operators or at different locations. This is an area that is not restricted to spectroscopic analysis; many of the concepts we describe here can be applied to evaluating the results from any form of chemical analysis. In our case we will examine a comparison of two standard methods to determine precision, accuracy, and systematic errors (bias) for each of the methods and laboratories involved in an analytical test. As it happens, in the case we use for our example, one of the analytical methods is spectroscopic and the other is an HPLC method. As it happens, a particularly opportune event occurred recently, almost simultaneously with our writing these next few chapters: an article [1] appeared in LC-GC, a sister magazine to Spectroscopy, that also takes concepts that we discussed and described in some of our early chapters, and applies them to a real-life situation (or at least a simulation of a real-life situation), the main difference is that the experiment described deals with macroscopic objects while the “real world” deals in atoms and molecules). In past chapters [2, 3] we also described how probabilistic phenomena give rise to distributions and even included computer programs to allow simulations of this, but given the constraints of time and text space, we were not able to link that to the actual behavior of the physical world nearly as well as Hinshaw does. In the case described, given the venue, the interest is in the chromatography, and for that reason we will not dwell on their application. However, we do strongly urge our readers to obtain a copy of this article and read it for it is description of the basis and generation of the distributions that arise from the effects of the random behavior of the physical world. The probabilistic and statistical experiments described are superb examples of how concepts such as these can be illustrated and brought to life. The statistical tools we describe in the next few chapters, and use for this demonstra tion, are ones that we have previously described. These tools include statistical hypothesis testing and ANOVA. Our previous descriptions of these topics were generic and rather general; at that time we were interested in presenting the theoretical background and reasoning behind the development of these statistical techniques. Now we will use them in a practical situation, to show how these methods can be used to evaluate various characteristics relating to the precision and accuracy of analytical methods, applying them to real data to simultaneously demonstrate how to use them and the nature of the results that can be obtained. We will use ANOVA to evaluate potential bias in reported results inherent in the analytical methods themselves, or due to the operators (i.e., location of laboratory) performing the methods. For the next series of articles all computations were completed using MathCad Worksheets [4] written by the authors. The objectives of this next set of articles is to determine the precision, accuracy, and bias due to choice of analytical

168

Chemometrics in Spectroscopy

method and/or operator for the determination of an analyte within a set of hypothetical production samples and spiked recovery samples (samples of gravimetrically known composition). The discussion will occupy the Chapters 34–39.

EXPERIMENTAL DESIGN The experimental design used for this hypothetical study is based on a relatively simple factorial model where individual samples are measured as shown in Figure 34-1 and Table 34-1. We have previously discussed factorial designs [5] although, as was the case with ANOVA, our previous discussion was simplified and primarily theoretical, to demonstrate the principles involved, while in the current discussion, we apply these concepts to a more realistic practical situation. For this hypothetical test, samples consist of three production run samples (i.e., Nos. 1–3) with a target analyte value of 3.60 units (percent, grams, pounds, etc.). In addition, three spiked recovery samples with target analyte levels of 3.40, 3.61, and 3.80% respectively are represented by Nos. 4–6. This experimental model allows the methods and locations (labs or operators) to be compared for precision, accuracy, and systematic errors. We will use the designation Lab 1 and Lab 2 to indicate different locations and/or operators performing the identical procedures for METHODS A and B (or I and II). Before considering the design and the analysis of it in detail, let us take a look at the factors that are being included in the design, and their impact on the experimental design and the analysis of this design: we have six samples, two methods of analysis for the constituent of interest, two laboratories, two chemists in each laboratory and five repeat readings of the constituents of each sample by each chemist. Statistical hypothesis

Method I

r1 r2 r 3

r 4

r 5

Method II

r1 r2 r 3

r 4

r 5

Method I

r1 r2 r 3

r 4

r 5

Method II

r1 r2 r 3

r 4

r 5

Lab 1

Each sample (n = 6)

Lab 2

Sample

Location

Method

Replicates

Figure 34-1 A simple factorial design for collaborative data collection. Each sample analyzed (in this hypothetical case n = 6) requires multiple labs, or operators, using both methods of analysis and replicating each measurements a number of times (r = 5) for this hypothetical case.

Collaborative Laboratory Studies: Part 1

169

Table 34-1 “As reported” analytical data∗ for collaborative study Sample No. – Replicate no.

Lab 1 – Method B

Lab 2 – Method B

Lab 1 – Method A

Lab 2 – Method A

11 12 13 14 15 Mean

3507 3463 3467 3501 3489 3.485

3507 3497 3503 3473 3447 3.485

3462 3442 3460 3517 3460 3.468

3460 3443 3447 — — 3.450

21 21 23 24 25 Mean

3479 3453 3459 3461 3481 3.467

3497 3660 3473 3447 3453 3.506

3446 3448 3455 3456 3455 3.452

3460 3470 3450 3460 3460 3.460

31 32 33 34 35 Mean

3366 3362 3351 3353 3347 3.356

3370 3327 3387 3430 3383 3.379

3318 3330 3328 3322 3323 3.324

3337 3317 3337 3330 3330 3.330

41 42 43 44 45 Mean

3421 3377 3399 3379 3379 3.391

3407 3400 3417 3353 3380 3.391

3366 3360 3361 3362 3370 3.364

3380 3380 3380 3380 3380 3.380

51 52 53 54 55 Mean

3565 3568 3561 3576 3587 3.571

3540 3550 3573 3533 3543 3.548

3538 3539 3544 3540 3543 3.541

3560 3580 3590 3580 3560 3.570

61 62 63 64 65 Mean

3764 3742 3775 3767 3766 3.763

3860 3833 3933 3870 3810 3.881

3741 3740 3739 3742 3744 3.741

3740 3760 3730 3770 3750 3.740

∗

Note: For this hypothetical exercise, Samples 1–3 have a target value of 3.60% absolute; whereas Samples 4–6 are Spiked Recovery Samples with target values of 3.40 (No. 4), 3.61 (No. 5), and 3.80 (No. 6).

170

Chemometrics in Spectroscopy

testing provides us with an objective method of determining whether or not a given difference in conditions (i.e., factor) has an effect on the readings. We have the following a priori expectations for the behavior of these several factors: a) Since we know that the samples are of different composition we expect the measure ments of the constituent value to reflect this genuine difference in composition, and be therefore to be systematic, and be constant across all other factors. Any departure from constant differences (beyond the amount expected from random variation due to unavoidable random error of the analysis, of course) can be attributed to an effect of the corresponding factor, or due to blunders such as improper mixing or sampling of the material. b) There may be an effect due to the use of two different laboratories. This effect may or may not be the same for the two different methods of analysis. This can be examined by comparing the results of measurements on the same sample by the same method in each of the two different laboratories. Under the proper circumstances, results from multiple samples may be combined to achieve a more definitive test. Before doing so, the existence of the appropriate circumstances must first be determined. c) There may be an effect due to the use of two different methods of analysis. This effect may or may not be the same in the two different laboratories. There may or may not be a difference between the two chemists in each laboratory. This can be examined by comparing the results of measurements on the same sample by the two different methods of analysis. Under the proper circumstances, results from multiple samples may be combined to achieve a more definitive test; if circumstances are appropriate, results from the two chemists in each laboratory and the results from the two laboratories may also be combined. Before doing so, the existence of the appropriate circumstances must first be determined. d) There may or may not be a difference between the two chemists’ readings of the constituent values in a given laboratory. If we arbitrarily label the chemists in each laboratory as “Chemist #1” and “Chemist #2”, we would not expect a systematic difference between the corresponding chemists in the two different laboratories. This can, however, happen by coincidence. This can be examined by comparing the results of measurements on the same sample by the two different chemists in each laboratory. Under the proper circumstances, results from multiple samples may be combined to achieve a more definitive test. Before doing so, the existence of the appropriate circumstances must first be determined. Many of these aspects will be presented over the next several chapters. e) We do not expect any systematic effects among the five repeat readings of each sample by each chemist in each laboratory. We do expect random variations, reflecting unavoidable random errors of measurement. These unavoidable random errors of measurement are quantified by the terms “precision” and “accuracy”. f) We expect the precision and accuracy for each method to be the same at both laboratories. This can be examined by comparing the precision and accuracy of each method in each laboratory, combining results from multiple samples when appropriate. Before doing so, the existence of the appropriate circumstances must first be determined. g) We do not expect the precision and accuracy to be the same for the two methods except by coincidence.

Collaborative Laboratory Studies: Part 1

171

h) We expect the precision and accuracy to be the same for all four chemists for each method, unless we find a difference in precision and/or accuracy between laboratories. This can be examined by comparing the precision and accuracy of each method as performed by each chemist, combining results from multiple samples when appropriate. Before doing so, the existence of the appropriate circumstances must first be determined. The use of the statistical tools of ANOVA and statistical hypothesis testing, described previously in these chapters and whose application is described in further detail below, allows separation of the effects due to the various factors and objective verification as to which ones are statistically significant. In the absence of any systematic effects due to one or more of the factors, our a priori expectation is that any differences seen are due to the effects of unavoidable random errors only, and will therefore be non-significant. Therefore, any statistically significant effects found due to differences between sets of readings indicates that the corresponding factor has a real, systematic effect on the readings. By posing the scientific questions about the effects of the factors in the formalism of statistical hypothesis tests [6], any statistically significant result is an indication that the corresponding factor has a real, systematic effect on the readings, and this gives us the handle we need to extract that information from the mass of data we obtain from this simple-seeming, but (as we see) actually very complicated experimental design. Data analysis for this series was performed using MathCad and the statistical methods used are described in greater detail in Youden’s monograph [7] and in Mark and Workman [8]. We use the MathCad worksheets both to illustrate how the theoretical concepts can be put to actual use and also to demonstrate how to perform the calculations we describe. The worksheets will be printed along with the chapters in which they are first used. At a later date we are planning to enable you to go to the Spectroscopy home page (http://www.spectroscopymag.com) and find them. If, and when, the actual URLs for the worksheets become available, we will let you know. The primary goal of this series of chapters is to describe the statistical tests required to determine the magnitude of the random (i.e., precision and accuracy) and systematic (i.e., bias) error contributions due to choosing Analytical METHODS A or B, and/or the location/operator where each standard method is performed. The statistical analysis for this series of articles consists of five main parts as: Part 1: Overall comparison of both locations and analytical methods for precision and accuracy; Part 2: Analysis of Variance testing for both locations and analytical methods to deter mine if an overall bias exists for location or analytical method; Part 3: Testing for systematic error in each method by performing a comparison test for a set of measurements versus the known True Value; Part 4: Performing a ranking test to determine if either analytical method or location affects the results as a systematic error (bias); and Part 5: Computing the “efficient comparison of two methods” as described by Youden and Steiner in reference [7]. The analyst may use one or more of these statistical test methods to compare analyti cal results depending upon individual requirements. It is recommended that the easiest

172

Chemometrics in Spectroscopy

and most fruitful test for the effort expended would be the test method described in Chapter 38. This simple set of tests statistically compares precision, accuracy, and sys tematic error for two methods with the minimum quantity of analytical effort. Chapter 38 is most highly recommended above the Chapters 34–37, but it is a useful tool to proceed through an understanding of the first chapters before proceeding to Chapter 38. The basic experimental design required for statistical methods in Chapters 34–37 is demonstrated in Figure 34-1 and the data is presented in Table 34-1. The basic experimental design required for Chapter 38 statistical methods is given in Figure 34-2 and the corresponding data in Table 34-2. Thus, if you would like to follow along by performing these tests on your own real data, the basic designs are demonstrated here to allow you to collect data before proceeding through the statistical methods described within the next 6 chapters.

r1

Sample X

r2 r3 r4 r5

Sample Y

r1 r2 r3 r4 r5

Sample X

r1 r2 r3 r4 r5

Sample Y

r1 r2 r3 r4 r5

Method A

Method B

Method

Sample

Replicates

Figure 34-2 Simple experimental design for Youden/Steiner comparison of two Methods (data shown in Table 34-2).

Table 34-2 Analytical data entry for comparison of two methods tests Method A

Mean

Method B

Sample X

Sample Y

Sample X

Sample Y

3366 3380 3360 3380

3741 3740 3740 3760

3421 3407 3377 3400

3764 3860 3742 3833

3372

3745

3401

3800

Collaborative Laboratory Studies: Part 1

173

ANALYTICAL METHODS Sample collection and handling Let us say the first three samples tested were collected by Lab 2 from their production facility. These samples were retained from actual production lots. An aliquot from each retained jar was removed and shipped to Lab 1 in appropriate sealed containers. METHOD B testing was started at both laboratories the day following receipt of the samples to rule out any possible aging effects. METHOD A testing was performed in Lab 1 on the following day, while the METHOD A testing in Lab 2 occurred a week later. The second three samples were spiked, produced at Lab 2 using the pure analyte reagent and Control material. An aliquot of each sample was shipped to Lab 1 in appropriate sealed containers. Once again, the METHOD B testing was performed on the same day at both locations. METHOD A testing was done at both sites within a 2-day time period.

METHOD A and B analysis All six samples at both sites were prepared the same way. Five separate aliquots from each sample were separately sampled and prepared for testing. Each aliquot was then measured three times. Conditions and standard operating procedures for METHODS A and B were carefully specified for both Labs 1 and 2.

RESULTS AND DATA ANALYSIS Comparing all laboratories and all methods for precision and accuracy COMPARISON OF PRECISION AND ACCURACY FOR METHODS AND LABO RATORIES USING THE GRAND MEAN FOR SAMPLES No. 1–3 (Collabor_GM Worksheet), OR BY USING A SPIKED RECOVERY STUDY FOR SAMPLES No. 4–7 (Collabor_TV Worksheet) To compute the results shown in Tables 34-3 and 34-4, the precision of each set of replicates for each sample, method, and location are individually calculated using the root mean square deviation equation as shown (Equations 34-1 and 34-2) in standard symbolic and MathCad notation, respectively. Thus the standard deviation of each set of sample replicates yields an estimate of the precision for each sample, for each method, and for each location. The precision is calculated where each yij is an individual replicate (j) measurement for the ith sample; y¯ i is the average of the replicate measurements for the ith sample, for each method, at each location; and N is the number of replicates for each sample, method, and location. The results of these computations for these data

174

Chemometrics in Spectroscopy

Table 34-3 Individual sample analysis precision for hypothetical production samples Sample no. Sample 1 Sample 2 Sample 3 Pooled

METHOD B – Lab 1

METHOD B – Lab 2

METHOD A – Lab 1

METHOD A – Lab 2

0020 0013 00079 0015

0025 0088 0037 0057

00089 00066 00068 0008

00089 0010 0012 0010

Table 34-4 Individual sample analysis precision for hypothetical spiked recovery samples Sample no. Sample 4 Sample 5 Sample 6 Pooled

METHOD B – Lab 1

METHOD B – Lab 2

METHOD A – Lab 1

0019 0010 0012 0014

0025 0015 0047 0032

00041 00026 00019 00030

METHOD A – Lab 2 0000 0013 0016 0012

are found in Tables 34-3 and 34-4 representing samples 1–3 (hypothetical production samples), and 4–6 (hypothetical spiked samples), respectively.

S=

� �N �� � y − y¯ i 2 � i=1 i N −1

� �−−−−−−−−−−−−−−→ � �� � Y − meanY 2 S= N −1

(34-1)

(34-2)

The pooled precision and accuracy for each sample for both analytical methods and locations are calculated using Equations 34-3 and 34-4, representing standard symbolic and MathCad notation, respectively. The pooled precision is calculated where each yi is an individual replicate measurement for an individual sample; y¯ i is the average of the replicate measurements for each sample, each method, each location; and Ni is the number of replicates for an individual (ith) sample, method, and location. The results of these computations for these data are found in Tables 34-3 and 34-4 (Pooled) row representing samples 1–3, and 4–6, respectively. The results from Tables 34-3 and 34-4 indicate there is no trend in error versus concentration, therefore the error appears to show no trending with respect to concentration.

Ps =

� � N1 � N2 � N3 � N4 � �2 � �2 � �2 � �2 �� y1j − y¯ 1 + y2j − y¯ 2 + y3j − y¯ 3 + y4j − y¯ 4 � � j=1 j=1 j=1 j=1 N1+N2+N3+N4−4

(34-3)

Collaborative Laboratory Studies: Part 1

175

Table 34-5 Individual sample analysis estimated accuracy using grand mean calculation Sample no. Sample 1 Sample 2 Sample 3 Pooled

� Ps =

METHOD B – Lab 1

METHOD B – Lab 2

METHOD A – Lab 1

METHOD A – Lab 2

0025 0014 0012 0018

0029 0096 0051 0065

0029 0031 0037 0033

0029 0017 0024 0024

− −−−−−−−−−−−−−−−−�−−−�−−−−−−−−−−−−−−2−−−−−−−−−−−−−−−→ �−−−−−−−−−−−−−−− � Y 3 − meanY 3 + Y 4 − meanY 42 Y 1 − meanY 12 + Y 2 − meanY 22 + N1+N2+N3+N4−4 N1+N2+N3+N4−4

(34-4) To compute the results shown in Table 34-5 for production samples, the accuracy of each set of replicates for each sample, method, and location was individually calculated using the root mean square deviation equation as shown in equations 34-5 and 34-6 in standard symbolic and MathCad notation, respectively. The standard deviation of each set of sample replicates yields an estimate of the accuracy for each sample, for each method, and for each location. The accuracy is calculated where each yi is an individual replicate measurement; GM is the Grand Mean of the replicate measurements for each sample, both methods, both locations; and N is the number of replicates for each sample, method, and location. The results found in Table 34-5 represent samples 1–3. Note: Each sample had a Grand Mean computed by taking the mean for all measurements made for each of the samples 1–3. � � N � �2 �� � yij − GMi � j=1 Si = (34-5) N −1 � �� � �−−−−−−−−−−→ � Y − GM2 S = N −1

(34-6)

To compute the results shown in Table 34-6 for the Spiked Recovery samples, the accu racy of each set of replicates for each sample, method, and location can be individually calculated using the root mean square deviation equation as shown in equations 34-5 and 34-6 in standard symbolic and MathCad 7.0 notation, respectively. The standard devia tion of each set of sample replicates yields an estimate of the accuracy for each sample, for each method, and for each location. The accuracy is calculated where each yi is an individual replicate measurement; and The Spiked or true values (TV) are substituted for GM in equations 34-5 and 34-6. The accuracy is calculated for each sample, each method, and each location; and N is the number of replicates for each sample, method, and location. The results found in Table 34-6 represent samples 34-4 through 34-6. Note: Each sample had a True Value given by a known analyte spike into the sample.

176

Chemometrics in Spectroscopy

Table 34-6 Individual sample analysis accuracy using Spiked Recovery study Sample no. Sample 4 Sample 5 Sample 6 Pooled

METHOD B – Lab 1

METHOD B – Lab 2

METHOD A – Lab 1

METHOD A – Lab 2

0022 0044 0043 0038

0027 0071 0083 0065

0041 0077 0066 0063

0022 0042 0058 0043

Table 34-7 Individual sample precision and accuracy for combined Methods A and B and Labs 1 and 2 – Production samples No. Sample 1 Sample 2 Sample 3 Pooled

GM

Precision

3.472 3.471 3.347 3.430

00231 00479 0021 0033

Accuracy 00278 00538 0033 0040

Table 34-8 Individual sample precision and accuracy for combined Methods A and B and Labs 1 and 2 – Spiked Recovery samples No. Sample 4 Sample 5 Sample 6 Pooled

TR

Precision

340 361 380 3603

0016 0011 0025 0018

Accuracy 0029 0061 0064 0054

The analytical results for each sample can again be pooled into a table of precision and accuracy estimates for all values reported for any individual sample. The pooled results for Tables 34-7 and 34-8 are calculated using equations 34-1 and 34-2 where precision is the root mean square deviation of all replicate analyses for any particular sample, and where accuracy is determined as the root mean square deviation between individual results and the Grand Mean of all the individual sample results (Table 34-7) or as the root mean square deviation between individual results and the True (Spiked) value for all the individual sample results (Table 34-8). The use of spiked samples allows a better comparison of precision to accuracy, as the spiked samples include the effects of systematic errors, whereas use of the Grand Mean averages the systematic errors across methods and shifts the apparent true value to include the systematic error. Table 34-8 yields a better estimate of the true precision and accuracy for the methods tested. A simple statistical test for the presence of systematic errors can be computed using data collected as in the experimental design shown in Figure 34-2. (This method is demonstrated in the Measuring Precision without Duplicates sections of the MathCad Worksheets Collabor_GM and Collabor_TV found in Chapter 39.) The results of this test are shown in Tables 34-9 and 34-10. A systematic error is indicated by the test using

Collaborative Laboratory Studies: Part 1

177

Table 34-9 Statistical test for the presence of systematic errors (using samples 1 and 2 only) F-test for bias 16.53

F-critical for bias 9.27

Table 34-10 Statistical test for the presence of systematic errors (using samples 4 and 5 only) F-test for Bias 2.261

F-critical for Bias 9.277

Samples 1 and 2, but not for Samples 4 and 5. This indicates that the difference between precision and accuracy is large enough to indicate a bias inherent within the analytical method(s). Since these are the same methods and locations tested, further evaluation is required to determine if a bias actually exists.

REFERENCES 1. 2. 3. 4. 5. 6. 7.

Hinshaw, J.V., LC-GC 17(7), 616–625 (1999). Mark, H. and Workman, J., Spectroscopy 2(2), 60–64 (1987). Workman, J. and Mark, H., Spectroscopy 2(6), 58–60 (1987). MathCad; MathSoft, Inc.: 101 Main Street, Cambridge, MA 02142; Vol. v. 7.0; (1997). Mark, H. and Workman, J., Spectroscopy 10(1), 17–20 (1995). Mark, H. and Workman, J., Spectroscopy 4(7), 53–54 (1989). Youden, W. J. and Steiner, E. H., Statistical Manual of the AOAC, 1st ed. (Association of Official Analytical Chemists, Washington, DC, 1975). 8. Mark, H. and Workman, J., Statistics in Spectroscopy, 1st ed. (Academic Press, New York, 1991).

This page intentionally left blank

35

Collaborative Laboratory Studies: Part 2 – using ANOVA

In this chapter the use of ANOVA will be described for use in collaborative study work.

ANOVA TEST COMPARISONS FOR LABORATORIES AND METHODS (ANOVA_s4 WORKSHEET) Analysis of Variance (ANOVA) is a useful tool to compare the difference between sets of analytical results to determine if there is a statistically meaningful difference between a sample analyzed by different methods or performed at different locations by different analysts. The reader is referred to reference [1] and other basic books on statistical methods for discussions of the theory and applications of ANOVA; examples of such texts are [2, 3]. Table 35-1 illustrates the ANOVA results for each individual sample in our hypo thetical study. This test indicates whether any of the reported results from the analytical methods or locations is significantly different from the others. From the table it can be observed that statistically significant variation in the reported analytical results is to be expected based on these data. However, there is no apparent pattern in the method or location most often varying from the others. Thus, this statistical test is inconclusive and further investigation is warranted.

Table 35-1 ANOVA: comparing methods and laboratories No.

F -test for bias

F -critical for bias

Difference

Bias

Sample 1

181

Sample 2

121

3.34

—

No

3.34

—

No

Sample 3

689

3.34

METHOD B-LAB 1 + METHOD B-LAB 2 vs. METHOD A-LAB 1 + METHOD A-LAB 2

Yes

Sample 4

328

3.24

METHOD A-LAB 1

Yes

Sample 5

1052

3.24

METHOD B-LAB 1 + METHOD A-LAB 2 vs. METHOD B-LAB 2 + METHOD A-LAB 1

Yes

Sample 6

2410

3.24

METHOD B-LAB 2

Yes

180

Chemometrics in Spectroscopy

ANOVA test comparisons (using ANOVA_s2 worksheet) Table 35-2 shows the ANOVA results comparing laboratories (i.e., different locations) performing the same METHOD B analytical procedure for analysis. This statistical test indicates that for the higher concentration spiked samples (i.e. 5 and 6 at 3.61 and 3.80% levels, respectively) a significant difference in reported average values occurred. However, Lab 1 was higher for Sample No. 5 and lower for Sample No.6 indicating no apparent trend in the analytical results reported for both labs, indicating that there is no systematic difference between labs using METHOD B. Table 35-3 illustrates the ANOVA results comparing laboratories (i.e., different loca tions) performing the same METHOD A for analysis. This statistical test indicates that for the mid-level concentration spiked samples (i.e. 4 and 4 at 3.40 and 3.61% levels, respectively) difference in reported average values occurred. However, this trend did not continue for the highest concentration sample (i.e., Sample No. 6) with a concentration of 3.80%. The Lab 1 was slightly lower in reported value for Samples 4 and 5. There is no significant systematic error observed between laboratories using the METHOD A. Table 35-4 reports ANOVA comparing the METHOD B procedure to the METHOD A procedure for combined laboratories. Thus the combined METHOD B analyses for each sample were compared to the combined METHOD A analyses for the same sample. This statistical test indicates whether there is a significant bias in the reported results for each method, irrespective of operator or location. An apparent trend is indicated using this statistical analysis, that trend being a positive bias for METHOD B as compared to

Table 35-2 ANOVA: comparing laboratories for METHOD B (Lab 1 vs. Lab 2) No. Sample Sample Sample Sample Sample Sample

Method 1 2 3 4 5 6

METHOD METHOD METHOD METHOD METHOD METHOD

B B B B B B

F -test for bias

F -critical for bias

Difference

Bias

0 098 199 00008 814 2091

532 532 532 532 532 532

— — — — 0.024 −0098

No

No

No

No

Yes

Yes

Table 35-3 ANOVA: comparing laboratories for METHOD A spectrophotometry (Lab 1 vs. Lab 2) No. Sample Sample Sample Sample Sample Sample

Method 1 2 3 4 5 6

METHOD METHOD METHOD METHOD METHOD METHOD

A A A A A A

F -test for bias

F -critical for bias

Difference

Bias

110 252 118 763 2952 153

5.99 5.99 5.99 5.32 5.32 5.32

— — — −0016 −0029 —

No

No

No

Yes

Yes

No

Collaborative Laboratory Studies: Part 2

181

Table 35-4 ANOVA: comparing methods for combined laboratories and operators, all Method B vs. all Method A No.

Method comparison

Sample 1

METHOD B vs. METHOD A

Sample 2

METHOD B vs. METHOD A

Sample 3

METHOD B vs. METHOD A

Sample 4

METHOD B vs. METHOD A

Sample 5 Sample 6

F -test for bias

F -critical for bias

Difference

Bias

505

4.49

0.024

Yes

193

4.49

—

No

4.49

0.041

Yes

706

4.41

0.019

Yes

METHOD B vs. METHOD A

007

4.41

—

No

METHOD B vs. METHOD A

1144

4.41

0.066

Yes

159

METHOD A. Thus METHOD B would be expected to report a higher level of analyte than METHOD A.

REFERENCES 1. Mark, H. and Workman, J., Statistics in Spectroscopy, 1st ed. (Academic Press, New York, 1991). 2. Draper, N. and Smith, H., Applied Regression Analysis (John Wiley & Sons, New York, 1981). 3. Zar, J.H., Biostatistical Analysis (Prentice Hall, Englewood Cliffs, NJ, 1974).

This page intentionally left blank

36 Collaborative Laboratory Studies: Part 3 – Testing for Systematic Error

TESTING FOR SYSTEMATIC ERROR IN A METHOD: COMPARISON TEST FOR A SET OF MEASUREMENTS VERSUS TRUE VALUE – SPIKED RECOVERY METHOD (COMPARET WORKSHEET) The Student’s (W.S. Gossett) t-test is useful for comparisons of the means and standard deviations of different analytical test methods. Descriptions of the theory and use of this statistic are readily available in standard statistical texts including those in the references [1–6]. Use of this test will indicate whether the differences between a set of measurement and the true (known) value for those measurements is statistically meaningful. For Table 36-1 a comparison of METHOD B test results for each of the locations is compared to the known spiked analyte value for each sample. This statistical test indicates that METHOD B results are lower than the known analyte values for Sample No. 5 (Lab 1 and Lab 2), and Sample No. 6 (Lab 1). METHOD B reported value is higher for Sample No. 6 (Lab 2). Average results for this test indicate that METHOD B may result in analytical values trending lower than actual values. For Table 36-2, a comparison of METHOD A results for each of the locations is made to the known spiked analyte value for each sample. This statistical test indicates that METHOD A results are lower than the known analyte values for Sample Nos. 4–6 for both Lab 1 and Lab 2. Average results for this test indicate that METHOD A is consistently lower than actual values.

Table 36-1 Comparison of METHOD B test results to true value Method–Location Sample Sample Sample Sample Sample Sample

4 4 5 5 6 6

METHOD METHOD METHOD METHOD METHOD METHOD

B–LAB B–LAB B–LAB B–LAB B–LAB B–LAB

1 2 1 2 1 2

t-test for bias

t-critical for bias

Difference

Bias

106 076 837 906 673 294

2776 2776 2776 2776 2776 2776

— — −0038 −0062 −0037 0061

No

No

Yes

Yes

Yes

Yes

184

Chemometrics in Spectroscopy

Table 36-2 Comparison of METHOD A results to true value Method–Location Sample Sample Sample Sample Sample Sample

4 4 5 5 6 6

METHOD METHOD METHOD METHOD METHOD METHOD

A–LAB A–LAB A–LAB A–LAB A–LAB A–LAB

1 2 1 2 1 2

t-test for bias

t-critical for bias

Difference

Bias

1952 90 598 60 684 707

2776 2776 2776 2776 2776 2776

−0036 −0018 −0069 −0036 −0058 −0050

Yes Yes Yes Yes Yes Yes

REFERENCES 1. MathCad; MathSoft, Inc.: 101 Main Street, Cambridge, MA 02142; Vol. v. 7.0 (1997). 2. Youden, W.J. and Steiner, E.H., Statistical Manual of the AOAC, 1st ed. (Association of Official Analytical Chemists, Washington, DC, 1975). 3. Mark, H. and Workman, J., Statistics in Spectroscopy, 1st ed. (Academic Press, New York, 1991). 4. Draper, N. and Smith, H., Applied Regression Analysis (John Wiley & Sons, New York, 1981). 5. Zar, J.H., Biostatistical Analysis (Prentice Hall, Englewood Cliffs, NJ, 1974). 6. Owen, D.B., Handbook of Statistical Tables (Addison-Wesley Publishing Co., Inc., Reading, MA, 1962).

37

Collaborative Laboratory Studies: Part 4 – Ranking Test

RANKING TEST FOR LABORATORIES AND METHODS (MANUAL COMPUTATIONS) The ranking test for laboratories provides for the calculation of individual ranks for each laboratory or method using the averaged results collected for all replicates and all methods/locations. The summary of averaged analytical results discussed in this series is shown in Table 37-1a. These compiled results are assigned ranks by column from the largest to the smallest reported analytical values. The largest analytical result in each column receives a score of 1, whereas the smallest result receives the largest number. When two results in a column are identical, a 0.5 is added to the rank number, and the subsequent number is not used. Note column 1 in Table 37-1a; both row 1 and row 2 have the identical value of 3.485 and are assigned 1.5 as rank score values. Note that rank 2 is not used due to the tie, and the lower analytical results are given ranks 3 and 4, respectively. The rows are summed resulting in a rank score as column #8, Table 37-1b. Table 37-1a Results table for ranking test Sample 1

Sample 2

Sample 3

Sample 4

Sample 5

Sample 6

3.485 3.485 3.468 3.450

3.467 3.506 3.542 3.460

3.356 3.379 3.324 3.330

3.391 3.391 3.364 3.380

3.571 3.548 3.541 3.570

3.763 3.861 3.741 3.740

L1: METHOD B–LAB 1 L2: METHOD B–LAB 2 L3: METHOD A–LAB 1 L4: METHOD A–LAB 2

Table 37-1b Ranked results table

L1: METHOD B–LAB 1 L2: METHOD B–LAB 2 L3: METHOD A–LAB 1 L4: METHOD A–LAB 2 ∗

Sample 1

Sample 2

Sample 3

Sample 4

Sample 5

Sample 6

Score∗

1.5

2

2

1.5

1

2

10

1.5

1

1

1.5

3

1

9

3

3

4

4

4

3

21

4

4

3

3

2

4

20

If an individual laboratory score is equal to or outside of the limit boundaries, then we conclude that there is a pronounced systematic error present between the laboratory, or laboratories, with the extreme score. In this particular case the limits are 8–22.

186

Chemometrics in Spectroscopy

Table 37-1c Approximate 5% two-tail limits for laboratory ranking Scores (from Ref. [1]) No. of locations/tests

Number of samples 3

4

5

6

7

8

9

10

3

—

4 12

5 15

7 17

8 20

10 22

12 24

13 27

4

—

4 16

6 19

8 22

10 25

12 28

14 31

16 34

5

—

5 19

7 23

9 27

11 31

13 35

16 38

18 42

6

3 18

5 23

7 28

10 32

12 37

15 41

18 45

21 49

7

3 21

5 27

8 32

11 37

14 42

17 47

20 52

23 57

8

3 24

6 30

9 36

12 42

15 48

18 54

22 59

25 65

9

3 27

6 34

9 41

13 47

16 54

20 60

24 66

27 73

10

4 29

7 37

10 45

14 52

17 60

21 67

26 73

30 80

The score values are compared to a statistical table of values found in reference [1]. This table is partially reproduced as Table 37-1c. If an individual laboratory score is equal to or outside of the limit boundaries, then we conclude that there is a pronounced systematic error present between the laboratory, or laboratories, with the extreme score. In this particular case the limits are 8 to 22, therefore there is no significant systematic error in the methods as determined using this test.

REFERENCE 1. Youden, W.J. and Steiner, E.H., Statistical Manual of the AOAC, 1st ed. (Association of Official Analytical Chemists, Washington, DC, 1975).

38

Collaborative Laboratory Studies: Part 5 – Efficient

Comparison of Two Methods

COMPUTATIONS FOR EFFICIENT COMPARISON OF TWO METHODS (COMP_METH WORKSHEET) The section following shows a statistical test (text for the Comp_Meth MathCad Work sheet) for the efficient comparison of two analytical methods. This test requires that replicate measurements be made on two different samples using two different analyt ical methods. The test will determine whether there is a significant difference in the precision and accuracy for the two methods. It will also determine whether there is sig nificant systematic error between the methods, and calculate the magnitude of that error (as bias). This efficient statistical test requires the minimum data collection and analysis for the comparison of two methods. The experimental design for data collection has been shown graphically in Chapter 35 (Figure 35-2), with the numerical data for this test given in Table 38-1. Two methods are used to analyze two different samples, with approximately five replicate measurements per sample as shown graphically in the previously mentioned figure. The analytical results can immediately be plotted using the Youden/Steiner twosample graphic shown in Figure 38-1. This graphic gives a rapid method for visually determining if the reported analytical values contain systematic error. The presence of systematic error is indicated by the occurrence of two-sample plot points that are found in the lower left, and upper right quadrants of the charts. The presence of points in these quadrants indicates that high analyte value samples are biased to the high end, and low analyte containing samples are biased to the low end. Analytical methods not exhibiting systematic (bias) errors should have randomly distributed two-sample plot points throughout all the quadrants of the chart. Figure 38-1 gives an indication that METHOD A has a negative bias; and METHOD B is more random. However, the range of the axes is much lower for Method A indicating that the overall bias is quite small, and significantly less than Method B. The calculations for the efficient two-method comparison are shown in Table 38-2 and the subsequent equations following. The mathematical expressions are given in MathCad symbolic notation showing that the difference is taken for each replicate set of X and Y and the mean is computed. Then the sum for each replicate set of X and Y is calculated and the mean is computed. The difference in the sums is computed (as d) and the differences are summed and reported as an absolute value (as �d). The mean difference is calculated as mean(d). Each X and Y result contains the systematic error of the analytical method for its respective laboratory, noting that the systematic error is assumed to be identical for

188

Chemometrics in Spectroscopy

Table 38-1 Analytical data entry for comparison of two methods tests METHOD A

METHOD B

Sample X

Sample Y

Sample X

Sample Y

3.366 3.380 3.360 3.380

3.741 3.740 3.740 3.760

3.421 3.407 3.377 3.400

3.764 3.860 3.742 3.833

3.372

3.745

3.401

3.800

Mean

METHOD A:

METHOD B:

3.9

3.905

3.9

+ +

mean(BY )

mean(AY )

3.8

BY

AY

+++ ++ 3.7

3.35

3.8

+++

+ +

+

+ 3.4

3.45

3.7

mean(AX ) . AX

3.35

3.4

3.45

mean(BX ) . BX

Figure 38-1 Two-sample charts illustrating systematic errors for Methods A vs. B.

Table 38-2 Calculations for comparison tests METHOD A:

METHOD B:

ADxy �= �AX − AY� mean�ADxy� = 0�374 ATxy �= �AX + AY� mean�ATxy� = 7�117

BDxy �= �BX − BY� mean�BDxy� = 0�399 BTxy �= �BX + BY� mean�BTxy� = 7�201

d � ATxy − BTxy � d = 0�337 Mean Difference: mean�d� = 0�084 d2 �= BTxy − ATxy

X and Y for each method. When the difference between X and Y is calculated (as d) the systematic error drops out so that the difference (d) between X and Y contains no systematic errors, only random errors. We then estimate the precision by using the difference quantities. The difference between the true analyte concentrations of X and Y represents the true analyte difference between X and Y without the systematic error, but

Collaborative Laboratory Studies: Part 5

189

with the random errors. The relative precision between the two methods is calculated using Table 38-2 and equations 38-1 and 38-2. The F-statistic used to compare the sizes of the Method A vs. Method B precision values is given by equation 38-5 and is compared to the F-statistic table value (equation 38-7). The null (Ho ) hypothesis states that there is no difference in the precision of the two methods; whereas the alternate hypothesis (Ha ) indicates that there is a difference in the precision. For the methods compared in this study there is a significantly larger precision for METHOD B as compared to METHOD A. Method A precision is 0.007, whereas Method B precision is 0.037 representing a 5.3 factor increase. When summing the X and Y values, the systematic contribution is found twice. The two used in the denominator is indicative of the error contribution from each independent set of results (i.e., X and Y ). Given independent random errors only, the standard deviation of the sum of two measurements X and Y would be identical to the standard deviation of the differences between the two measurements X and Y . In the absence of any systematic error, Sr2 and Sd2 estimate the same standard deviation. In the presence of systematic error, Sd2 is large compared to Sr2. The larger the Sd2, the greater is the systematic error contribution. The relative systematic error between the two methods is calculated using Table 38-2, and equations 38-3 and 38-4. The F -statistic is used to compare the sizes of the Method A vs. Method B systematic error values and is given by equation 38-6; and is compared to the F -statistic table value (equation 38-7). The null (Ho ) hypothesis states that there is no difference in the systematic error found in the two methods; whereas the alternate hypothesis (Ha ) indicates that there is a difference in the size of the systematic error. For the methods compared in this study there is a significantly larger systematic error for METHOD B as compared to METHOD A. The test to determine whether the bias is significant incorporates the Student’s t-test. The method for calculating the t-test statistic is shown in equation 38-10 using MathCad symbolic notation. Equations 38-8 and 38-9 are used to calculate the standard deviation of the differences between the sums of X and Y for both analytical methods A and B, whereas equation 38-10 is used to calculate the standard deviation of the mean. The t-table statistic for comparison of the test statistic is given in equations 38-11 and 38-12. The F -statistic and t-statistic tables can be found in standard statistical texts such as references [1–3]. The null hypothesis (Ho ) states that there is no systematic difference between the two methods, whereas the alternate hypothesis (Ha ) states that there is a significant systematic difference between the methods. It can be seen from these results that the bias is significant between these two methods and that METHOD B has results biased by 0.084 above the results obtained by METHOD A. The estimated bias is given by the Mean Difference calculation.

Measuring the Precision and Standard Deviation of the Methods (Youden/Steiner) Note that for the calculations of precision and standard deviation (equations 38-1 through 38-4), the numerator expression is given as 2�n − 1�. This expression is used due to the 2 times error contribution from independent errors found in each independent set (i.e., X and Y ) of results.

190

Chemometrics in Spectroscopy

Precision (Sr) � ��−−−−−−−−−−−−−−−−−−−−−−−−−−−−−→ � �− � 1 � ASr �= · �ADxy − mean�ADxy��2 2 · �nY − 1�

(38-1)

ASr = 6�692658 · 10−3

� ��−−−−−−−−−−−−−−−−−−−−−−−−−−−−−→ � �− � 1 � 2 BSr �= · �BDxy − mean�BDxy�� 2 · �nY − 1�

(38-2)

BSr = 0�037334 Standard deviation (Sd) � ��−−−−−−−−−−−−−−−−−−−−−−−−−−−−−→ � �− � 1 � ASd �= · �ATxy − mean�ATxy��2 2 · �nY − 1�

(38-3)

ASd = 0�012428

� �� −−−−−−−−−−−−−−−−−−−−−−−−→ � �−−−−−− � 1 � 2 BSd �= · �BTxy − mean�BTxy�� 2 · �nY − 1�

(38-4)

BSd �= 0�045387 F -statistic calculation �Fs � for precision ratio Sr2 Ratio: PFs �=

B2 Sr A2 Sr

(38-5)

PFs = 31�118 Ho : If Fs is less than or equal to Ft , then there is NO DIFFERENCE in Precision

estimation.

Ha : If Fs is greater than Ft , then there is a DIFFERENCE in Precision estimation.

F -statistic calculation (Fs ) for presence of systematic errors Sd2 Ratio: SF s �=

B2 Sd A2 Sd

SF s = 13�337

(38-6)

Collaborative Laboratory Studies: Part 5

191

Ho : If Fs is less than or equal to Ft , then there is NO DIFFERENCE in systematic error. Ha : If Fs is greater than Ft , then there is a DIFFERENCE in systematic error. F -statistic table value �Ft � df 1 � = nY − 1 df 1 = 3 qF�0�95� df 1 � df 1 � = 9�277

(38-7)

Student’s t-test for the difference in the biases between two methods

Mean Difference: mean�d� = 0�084

� �� � �−−−−−−−−−−−−−−−−−−−−−−→ � 1 � 2 s �= · �d2 − mean�d� � �df 1 �

(38-8)

s = 0�053

s

sm �= √ nY

(38-9)

sm = 0�026 Calculate t-test statistic: Te �=

mean�d� sm

Te = 3�201

(38-10)

Enter alpha value as a2: �2 �= �95

Calculate t-table value:

�1 �=

�2 + 1

2

(38-11)

�1 = 0�975 t �= qt��1 � df 1 � t-table value� t = 3�182

(38-12)

192

Chemometrics in Spectroscopy

Ho : If Te is less than or equal to t-table value, then there is NO SYSTEMATIC DIF

FERENCE between method results.

Ha : If Te is greater than t, then there is a SYSTEMATIC DIFFERENCE (BIAS) between

method results.

SUMMARY This set of articles presents the computational details and actual values for each of the statistical methods shown for collaborative tests. These methods include the use of precision and estimated accuracy comparisons, ANOVA tests, Student’s t-testing, The Rank Test for Method Comparison, and the Efficient Comparison of Methods tests. From using these statistical tests the following conclusions can be derived: 1. Both analytical methods are quite precise and accurate, therefore the production samples are below target value concentration. 2. The precision for METHOD B is significantly larger than METHOD A, indicating METHOD A is more precise than METHOD B. 3. There is no correlation of analytical error with concentration over the range tested (i.e., 3.40–3.80% analyte). 4. Analytical results comparing METHOD B and METHOD A will show significant variation due to the high precision of both analytical methods. 5. There is no operator/laboratory bias between labs for METHOD B. 6. There is no operator/laboratory bias between labs for METHOD A. 7. There is a significant bias between METHOD B and METHOD A; METHOD B yields higher results. 8. Both METHOD B and METHOD A results trend lower than actual values, but by small quantities (approximately −0.04% at the target value of 3.60%). 9. The laboratory ranking test did not show any laboratory or method outside of confidence limits, therefore neither method nor laboratory is consistently high or low in reported results. 10. METHOD B precision is a factor of 5.3 times greater than that of METHOD A. 11. The systematic error contribution is larger for METHOD B than METHOD A. 12. METHOD B is biased to +0.084 as compared to METHOD A.

ACKNOWLEDGEMENT The real analytical data used for Chapters 34–38 was graciously provided by Dan Devine of Kimberly-Clark Analytical Science & Technology.

REFERENCES 1. MathCad; MathSoft, Inc.: 101 Main Street, Cambridge, MA 02142 (1997). 2. Mark, H. and Workman, J., Spectroscopy 10(1), 17–20 (1995). 3. Mark, H. and Workman, J., Spectroscopy 4(7), 53–54 (1989).

39

Collaborative Laboratory Studies: Part 6 – MathCad

Worksheet Text

The MathCad worksheets used for this Chemometrics in Spectroscopy collaborative study series are given below in hard copy format. Unless otherwise noted, the worksheets have been written by the authors. The text files for the MathCad v7.0 Worksheets used for the statistical tests in this report are attached as Collabor_GM, Collabor_TV, ANOVA_s4, ANOVA_s2, CompareT, and Comp_Meth. References [1–11] are excellent sources of information of the details on these statistical methods. Collabor_GM

Collaborative Test Worksheet -------------------------

RAW DATA ENTRY: X01

X05

X09

3.51 3.46 3.47 3.50 3.49 3.48 3.45 3.46 3.46 3.48 3.37 3.36 3.35 3.35 3.35

X02

X06

X10

3.51 3.50 3.50 3.47 3.45 3.50 3.66 3.47 3.45 3.45 3.37 3.33 3.39 3.43 3.38

X03

X07

X11

3.46 3.44 3.46 3.52 3.46 3.45 3.45 3.46 3.46 3.46 3.32 3.33 3.33 3.32 3.32

X04

3.46 3.44 3.45

X08

3.46 3.47 3.45

X12

3.34 3.32 3.34

Mean values for Data:

n01:=rows(X01) n02:=rows(X02) n03:=rows(X03) n04:=rows(X04) mean(X01) mean(X02) mean(X03) mean(X04)

= = = =

3.485 3.485 3.468 3.45

n05:=rows(X05) n06:=rows(X06) n07:=rows(X07) n08:=rows(X08) mean(X05) mean(X06) mean(X07) mean(X08)

= = = =

3.467 3.506 3.452 3.46

n09:=rows(X09) n10:=rows(X10) n11:=rows(X11) n12:=rows(X12) mean(X09) mean(X10) mean(X11) mean(X12)

= = = =

3.356 3.379 3.324 3.3303

194

Chemometrics in Spectroscopy

--------------------------------------------------------

GRAND MEANS FOR EACH ROW (USE IF NO “TRUE VALUE” IS AVAILABLE): GM1 �=

�mean�X01� + mean�X02� + mean�X03� + mean�X04�� 4

GM2 �=

�mean�X05� + mean�X06� + mean�X07� + mean�X08�� 4

GM3 �=

�mean�X09� + mean�X10� + mean�X11� + mean�X12�� 4

GRAND MEANS FOR EACH ROW: GM1 = 3�472 GM2 �= 3�47115 GM3 �= 3�347433

COMPUTATIONS FOR PRECISION AND ACCURACY: Precision:

−−−−−−−−−−−−−−−−→

−−−−1−−−−− 2 SDp�X01� �= · �X01 − mean�X01�� n01 − 1 −−−−−−−−−−−−−−−−−−−−−−−−−→ 1 SDp�X02� �= · �X02 − mean�X02��2 n02 − 1 SDp�X01� = 0.02 SDp�X02� = 0.025

−−−−−−−−−−−−−−−−−−−−−−−−−→ 1 2 SDp�X03��= · �X03−mean�X03�� n03 − 1 −−−−−−−−−−−−−−−−−−−−−−−−−→ 1 SDp�X04��= · �X04−mean�X04��2 n04 − 1

Collaborative Laboratory Studies: Part 6

SDp�X03� = 8.888 ·10 –3 SDp�X04� = 8.888 ·10 –3

−−−−−−−−−−−−−−−−→

−−−−1−−−−− SDp�X05��= · �X05−mean�X05��2 n05 − 1

−−−−−−−−−−−−−−−−−−−−−−−−−→

1 2 SDp�X06��= · �X06−mean�X06�� n06 − 1 SDp�X05� = 0.013 SDp�X06� = 0.088

−−−−−−−−−−−−−−−−−−−−−−−−−→

1 2 SDp�X07��= · �X07−mean�X07�� n07 − 1

−−−−−−−−−−−−−−−−−−−−−−−−−→

1 SDp�X08��= · �X08−mean�X08��2 n08 − 1 SDp�X07� = 6.557 ·10 –3 SDp�X08� = 0.01

−−−−−−−−−−−−−−−−−−−−−−−−−→

1 2 SDp�X09��= · �X09−mean�X09�� n09 − 1

−−−−−−−−−−−−−−−−→

−−−−1−−−−− SDp�X10��= · �X10−mean�X10��2 n10 − 1 SDp�X09� = 7.918 ·10 –3 SDp�X10� = 0.037

−−−−−−−−−−−−−−−−−−−−−−−−−→

1 SDp�X11��= · �X11−mean�X11��2 n11 − 1

195

196

−−−−−−−−−−−−−−−−→ −−−−1−−−−− 2 SDp�X12��= · �X12−mean�X12�� n12 − 1 SDp�X12� = 0.012 SDp�X11� = 6.812 ·10 –3

Accuracy: −−−−−−−−−−−−−−−−−−−−−→ 1 SDa�X01� �= · �X01 − GM1�2 n01 − 1 −−−−−−−−−−−−−−−−−−−−−→ 1 2 SDa�X02� �= · �X02 − GM1� n02 − 1 SDa�X01� = 0.025 SDa�X02� = 0.029 −−−−−−−−−−−−−−−−−−−−−→ 1 2 SDa�X03� �= · �X03 − GM1� n03 − 1 −−−−−−−−−−−−−−−−−−−−−→ 1 SDa�X04� �= · �X04 − GM1�2 n04 − 1 SDa�X04� = 0.029 SDa�X03� = 0.029 −−−−−−−−−−−−−−−−−−−−−→ 1 SDa�X05� �= · �X05 − GM2�2 n05 − 1 −−−−−−−−−−−−−−−−−−−−−→ 1 2 SDa�X06� �= · �X06 − GM2� n06 − 1 SDa�X05� = 0.014 SDa�X06� = 0.096

Chemometrics in Spectroscopy

Collaborative Laboratory Studies: Part 6

197

−−−−−−−−−−−−→

−−−−1−−−−− 2 SDa�X07��= · �X07 − GM2� n07 − 1

−−−−−−−−−−−−−−−−−−−−−→

1 SDa�X08��= · �X08 − GM2�2 n08 − 1 SDa�X07� = 0.031 SDa�X08� = 0.017

−−−−−−−−−−−−−−−−−−−−−→

1 SDa�X09��= · �X09 − GM3�2 n09 − 1

−−−−−−−−−−−−−−−−−−−−−→

1 2 SDa�X10��= · �X10 − GM3� n10 − 1 SDa�X09� = 0.012 SDa�X10� = 0.051

−−−−−−−−−−−−→

−−−−1−−−−− 2 SDa�X11� �= · �X11 − GM3� n11 − 1

−−−−−−−−−−−−→

−−−−1−−−−− SDa�X12� �= · �X12 − GM3�2 n12 − 1 SDa�X11� = 0.037 SDa�X12� = 0.024

Pooled Standard Deviations (As Precision): Row 1: SpR1� = − −−−−−−−−−−−−−−−−→ −−−−−−−−−−−−−−−−→ − −−−−−−−−−−−−−−−−→ − −−−−−−−−−−−−−−−−→ −

�X01 − mean�X01��2 + �X02 − mean�X02��2 + �X03 − mean�X03��2 + �X04 − mean�X04��2 n01 + n02 + n03 + n04 − 4 SpR1 = 0.0231474

198

Chemometrics in Spectroscopy

Row 2: SpR2� = −−−−−−−−−−−−−−−−→ −−−−−−−−−−−−−−−−−→ −−−−−−−−−−−−−−−−−→ −−−−−−−−−−−−−−−−−→

−

�X05 − mean�X05��2 + �X06 − mean�X06��2 + �X07 − mean�X07��2 + �X08 − mean�X08��2 n05 + n06 + n07 + n08 − 4 SpR2 = 0.0478817

Row 3: SpR3� = −−−−−−−−−−−−−−−−→ − −−−−−−−−−−−−−−−−→ − −−−−−−−−−−−−−−−−→ − −−−−−−−−−−−−−−−−→

−

�X09 − mean�X09��2 + �X10 − mean�X10��2 + �X11 − mean�X11��2 + �X12 − mean�X12��2 n09 + n10 + n11 + n12 − 4 SpR3 = 0.021

Pooled Standard Deviations (As Accuracy): Row 1: − − − − − −

−−−−−−−−−−−→

−−−−−−−−−−−→

−−−−−−−−−−−→ −

−−−−−−−−−−−→ − �X01 − GM1�2 + �X02 − GM1�2 + �X03 − GM1�2 + �X04 − GM1�2 SpR1� = n01 + n02 + n03 + n04 − 4 SpR1 = 0.0277715

Row 2: −−−−−−−−−−−−−→ − − −− −

−−−−−−−−−−−→

−−−−−−−−−−−→ −

−−−−−−−−−−−→

�X05 − GM2�2 + �X06 − GM2�2 + �X07 − GM2�2 + �X08 − GM2�2 SpR2� = n05 + n06 + n07 + n08 − 4 SpR2 = 0.0537719

Row 3: −−−−−−−−−−−−→ −−−−−−−−−−−−−→ −−−−−−−−−−−−−→ −−−−−−−−−−−−−→

− �X09 − GM3�2 + �X10 − GM3�2 + �X11 − GM3�2 + �X12 − GM3�2 SpR3� = n09 + n10 + n11 + n12 − 4 SpR3 = 0.033

Collaborative Laboratory Studies: Part 6

199

Measuring Precision without Duplicates (Youden/Steiner): ------------------------------------------------

RAW DATA ENTRY (Enter single Determinations for Sample X from different laboratories or operators): Sample X LAB LAB LAB LAB

#1 #2 #3 #4

X:=

3.51 3.51 3.46 3.46

nX: = rows(X) mean(X) = 3.484 (Enter single Determinations for Sample Y from different laboratories or operators): Sample Y LAB LAB LAB LAB

#1 #2 #3 #4

Y: =

3.48 3.50 3.45 3.46

nY: = rows(X) mean(Y) = 3.47 3.5

mean(Y)

3.48

Y 3.46

3.44

3.44 3.46 3.48 3.5 mean(X), X

3.52

Two-sample Chart Illustrating systematic errors

200

Chemometrics in Spectroscopy

CALCULATIONS: Dxy �=�X − Y� Txy �=�X + Y� mean�Dxy� = 0.014 mean�Txy� = 6.955

Precision (Sr):

− −−−−−−−−−−−−−−−−−−−−−−→

−−−−− 1 Sr �= · �Dxy − mean�Dxy��2 2 · �nY − 1� Sr = 8.276473 ·10 –3

Measuring the Standard Deviation of the Data (Youden/Steiner): -----------------------------------------------------

Standard Deviation (Sd):

−−−−−−−−−−−−−−−−−−−−−−−−−−−→

− 1 2 Sd �= · �Txy − mean�Txy�� 2 · �nY − 1� Sd = 0.033653

Statistical Test for presence of systematic errors (Youden/Steiner):

------------------------------------------------------

F-statistic Calculation (Fs):

Fs �=

Sd2 Sr2

Fs = 16.533

F-statistic Table Value (Ft): df1 �= nY − 1 df1 = 3 qF�0.95,df1� df1� = 9.277

Collaborative Laboratory Studies: Part 6

201

Test Criteria: If Fs is less than or equal to Ft, then there is NO SYSTEMATIC ERROR

If Fs is greater than Ft, then there is SYSTEMATIC ERROR (BIAS)

Standard Deviation estimate for the distribution of systematic errors (Sb2):

2

Sd − Sr2

Sb2�=

2 Sb2 = 5.32 ·10–4

202

Chemometrics in Spectroscopy

Collabor_TV

Collaborative Test Worksheet -------------------------

RAW DATA ENTRY: X01

X05

X09

3.42 3.38 3.40 3.38 3.38 3.56 3.57 3.56 3.58 3.59 3.76 3.74 3.77 3.77 3.77

X02

X06

X10

3.41 3.40 3.42 3.35 3.38 3.54 3.55 3.57 3.53 3.54 3.86 3.83 3.93 3.87 3.81

X03

X07

X11

3.37 3.36 3.36 3.36 3.37 3.54 3.54 3.54 3.54 3.54 3.74 3.74 3.74 3.74 3.74

X04

X08

X12

3.38 3.38 3.38 3.38 3.38 3.56 3.58 3.59 3.58 3.56 3.74 3.76 3.73 3.77 3.75

Mean Values for Data Rows:

n01:=rows( X01) n02:=rows(X02) n03:=rows(X03) n04:=rows(X04) mean(X01) mean(X02) mean(X03) mean(X04)

= = = =

3.391 3.391 3.364 3.38

n05:= rows(X05) n06:=rows(X06) n07:=rows(X07) n08:=rows(X08) mean(X05) mean(X06) mean(X07) mean(X08)

= = = =

3.571 3.548 3.541 3.574

n09:=rows(X09) n10:=rows(X10) n11:=rows(X11) n12:=rows(X12) mean(X09) mean(X10) mean(X11) mean(X12)

= = = =

3.763 3.861 3.741 3.75

----------------------------------------------------------

ENTER TRUE VALUES FOR EACH ROW (SPIKED RECOVERY SAMPLES): TR1:=3.40

TR1:=3.61

TR1:=3.80

COMPUTATIONS FOR PRECISION AND ACCURACY: Precision: −−−−−−−−−−−−−−−−→ −−−−1−−−−− SDp�X01� �= · �X01 − mean�X01��2 n01 − 1

Collaborative Laboratory Studies: Part 6

−−−−−−−−−−−−−−−−→ −−−−1−−−−− 2 SDp�X02� �= · �X02 − mean�X02�� n02 − 1 SDp�X01� = 0.019 SDp�X02� = 0.025 −−−−−−−−−−−−−−−−−−−−−−−−−→ 1 2 SDp�X03� �= · �X03 − mean�X03�� n03 − 1 −−−−−−−−−−−−−−−−→ −−−−1−−−−− SDp�X04� �= · �X04 − mean�X04��2 n04 − 1 SDp�X03� = 0 SDp�X04� = 0 −−−−−−−−−−−−−−−−−−−−−−−−−→ 1 SDp�X05��= · �X05 − mean�X05��2 n05 − 1 −−−−−−−−−−−−−−−−→ −−−−1−−−−− 2 SDp�X06��= · �X06 − mean�X06�� n06 − 1 SDp�X05� = 0.01 SDp�X06� = 0.015 −−−−−−−−−−−−−−−−→ −−−−1−−−−− 2 SDp�X07��= · �X07 − mean�X07�� n07 − 1 −−−−−−−−−−−−−−−−→ −−−−1−−−−− SDp�X08��= · �X08 − mean�X08��2 n08 − 1 SDp�X07� = 2.588 ·10–3 SDp�X08� = 0.013

203

204

−−−−−−−−−−−−−−−−→ −−−−1−−−−− 2 SDp�X09��= · �X09 − mean�X09�� n09 − 1 −−−−−−−−−−−−−−−−−−−−−−−−−→ 1 SDp�X10��= · �X10 − mean�X10��2 n10 − 1 SDp�X09� = 0.012 SDp�X10� = 0.047 −−−−−−−−−−−−−−−−−−−−−−−−→ − 1 2 SDp�X11��= · �X11 − mean�X11�� n11 − 1 −−−−−−−−−−−−−−−−−−−−−−−−−→ 1 SDp�X12��= · �X12 − mean�X12��2 n12 − 1 SDp�X11� = 1.924 ·10 –3 SDp�X12� = 0.016

Accuracy: −−−−−−−−−−−→ −−−−1−−−−− SDa�X01� � = · �X01 − TR1�2 n01 − 1 −−−−−−−−−−−→ −−−−1−−−−− 2 SDa�X02� � = · �X02 − TR1� n02 − 1 SDa�X01� = 0.022 SDa�X02� = 0.027 −−−−−−−−−−−−−−−−−−−−→ 1 SDa�X03� � = · �X03 − TR1�2 n03 − 1 −−−−−−−−−−−→ −−−−1−−−−− SDa�X04� � = · �X04 − TR1�2 n04 − 1 SDa�X04� = 0.022 SDa�X03� = 0.041

Chemometrics in Spectroscopy

Collaborative Laboratory Studies: Part 6

− −−−−−−−−−−−→ −−−1−−−−− 2 SDa�X05� � = · �X05 − TR2� n05 − 1 −−−−−−−−−−−−−−−−−−−−→ 1 2 SDa�X06� � = · �X06 − TR2� n06 − 1 SDa�X05� = 0.044 SDa�X06� = 0.071 −−−−−−−−−−−−−−−−−−−−→ 1 2 SDa�X07� � = · �X07 − TR2� n07 − 1 −−−−−−−−−−−−−−−−−−−−→ 1 2 SDa�X08� � = · �X08 − TR2� n08 − 1 SDa�X07� = 0.077 SDa�X08� = 0.042 −−−−−−−−−−−−−−−−−−−−→ 1 2 SDaX09 � = · �X09 − TR3� n09 − 1 −−−−−−−−−−−−−−−−−−−−→ 1 SDa�X10� � = · �X10 − TR3�2 n10 − 1 SDa�X09� = 0.043 SDa�X10� = 0.083 −−−−−−−−−−−−−−−−−−−−→ 1 2 SDa�X11� � = · �X11 − TR3� n11 − 1 −−−−−−−−−−−−−−−−−−−−→ 1 SDa�X12� � = · �X12 − TR3�2 n12 − 1 SDa�X11� = 0.066 SDa�X12� = 0.058

205

206

Chemometrics in Spectroscopy

Pooled Standard Deviations (As Precision): Row 1: SpR1 �= −−−−−−−−−−−−−−−−−→ − −−−−−−−−−−−−−−−−→ − −−−−−−−−−−−−−−−−→ − −−−−−−−−−−−−−−−−→

�X01 − mean�X01��2 + �X02 − mean�X02��2 + �X03 − mean�X03��2 + �X04 − mean�X04��2 n01 + n02 + n03 + n04 − 4 SpR1 = 0.0159961

Row 2: SpR2 �= −−−−−−−−−−−−−−−−→ −−−−−−−−−−−−−−−−−→ − −−−−−−−−−−−−−−−−→ − −−−−−−−−−−−−−−−−→

−

�X05 − mean�X05��2 + �X06 − mean�X06��2 + �X07 − mean�X07��2 + �X08 − mean�X08��2 n05 + n06 + n07 + n08 − 4 SpR2 = 0.0114967

Row3: SpR3 �= −−−−−−−−−−−−−−−−→ − −−−−−−−−−−−−−−−−→ − −−−−−−−−−−−−−−−−→ − −−−−−−−−−−−−−−−−→

−

�X09 − mean�X09��2 + �X10 − mean�X10��2 + �X11 − mean�X11��2 + �X12 − mean�X12��2 n09 + n10 + n11 + n12 − 4 SpR3 = 0.025

Pooled Standard Deviations (As Accuracy): Row 1: − −−−−−−−−−−−→ −−−−−−−−−−−→ −−−−−−−−−−−→ − −−−−−−−−−−−→ −

−

�X01 − TR1�2 + �X02 − TR1�2 + �X03 − TR1�2 + �X04 − TR1�2 SpR1 �= n01 + n02 + n03 + n04 − 4 SpR1 = 0.0289623

Row2: − −−−−−−−−−−−→ −−−−−−−−−−−→ −−−−−−−−−−−→ −

−−−−−−−−−−−→ −

−

�X05 − TR2�2 + �X06 − TR2�2 + �X07 − TR2�2 + �X08 − TR2�2 SpR2 �= n05 + n06 + n07 + n08 − 4 SpR2 = 0.0608954

Collaborative Laboratory Studies: Part 6

207

Row 3: − −−−−−−−−−−−→ −−−−−−−−−−−→ −−−−−−−−−−−→ − −−−−−−−−−−−→ −

−

2 2 + �X11 − TR3�2 + �X12 − TR3�2 �X09 − TR3� + �X10 − TR3� SpR3 �= n09 + n10 + n11 + n12 − 4 SpR3 = 0.064

Measuring Precision without Duplicates (Youden/Steiner):

-----------------------------------------------

RAW DATA ENTRY

(Enter single Determinations for Sample X from different laboratories or

operators):

Sample X LAB LAB LAB LAB

#1 #2 #3 #4

X:=

3.42 3.41 3.37 3.38

nX� = rows�X� mean�X� = 3�394 (Enter single Determinations for Sample Y from different laboratories or operators): Sample Y LAB LAB LAB LAB

#1 #2 #3 #4

Y:=

nY� = rows�Y� mean�Y� = 3�551

CALCULATIONS: Dxy�= �X − Y� Txy�= �X + Y� mean�Dxy� = −0�157 mean�Txy� = 6�944

3.56 3.54 3.54 3.56

208

Chemometrics in Spectroscopy

3.56

mean(Y) 3.55

Y

3.54

3.53

3.36 3.38

3.4

3.42

3.44

mean(X), X

Two-sample Chart illustrating systematic errors

Precision (Sr):

−−−−−−−−−−−−−−−−−−−−−−→

−−−−−− 1 2 Sr �= · �Dxy − mean�Dxy�� 2 · �nY − 1� Sr = 0.015805

Measuring the Standard Deviation of the Data (Youden/Steiner): -----------------------------------------------------

Standard Deviation (Sd):

−−−−−−−−−−−−−−−−−−−−−−−−−−−→

− 1 Sd�= · �Txy − mean�Txy��2 2 · �nY − 1� Sd = 0�023765

Statistical Test for presence of systematic errors (Youden/Steiner): ------------------------------------------------------

F-statistic Calculation (Fs): Fs�=

Sd2 Sr2

Fs = 2.261

Collaborative Laboratory Studies: Part 6

209

F-statistic Table Value (Ft): df1� = nY − 1 df1 = 3 qF�0�95� df1� df1� = 9�277 If Fs is less than or equal to Ft, then there is NO SYSTEMATIC ERROR If Fs is greater than Ft, then there is SYSTEMATIC ERROR (BIAS)

Standard Deviation estimate for the distribution of systematic errors (Sb2):

2

Sd − Sr2

Sb2�=

2 Sb2 = 1.575 · 10−4

210

Chemometrics in Spectroscopy

ANOVA_s4

ANOVA (Analysis of Variance) Test -------------------------------------------------------This Worksheet demonstrates using Mathcad’s F distribution function and programming

operators to conduct an analysis of variance (ANOVA) test.

Enter sample data used in test:

An element of D represents the data collected with a particular factor.

Data Entry:

D0

3.421

3.407

3.366

3.380

3.377

3.400

3.360

3.380

3.399

D1

3.417

D2

3.361

D3

3.380

3.379

3.353

3.362

3.380

3.379

3.380

3.370

3.380

Enter level of significance a: � � = 0�05

Collaborative Laboratory Studies: Part 6

211

Program for conducting ANOVA test:

ANOVA( D , α )

n total

0

0

SX

SX2 0 T

0

for i ∈ 0 .. last ( D ) SDi

Di

nDi

length Di

SX

SX

SDi Di .Di

SX2 SX2 2

T

SDi

T

nDi n total

n total

nDi 2

SS factor

SX

T

n total

SS error

SX2 T

SS total

SX2

2

SX

n total

df factor

length ( D )

1

df error

n total

length ( D )

df total

n total

1

SS factor df factor Analysis 0

SS error

df error

SS total

df total

Analysis 0 Analysis 1

Analysis 0

df factor SS error df error 0

0,2 1,2

α , df factor , df error

Analysis 2

qF 1

Analysis 3

Analysis 1 < Analysis 2

Analysis

SS factor

212

Chemometrics in Spectroscopy

Calculate Mean Values: mean�D0 � = 3�391

mean�D1 � = 3�3914

mean�D2 � = 3�3638

mean�D3 � = 3�38

Conducting an analysis of variance:

For a given set of grouped data D and level of significance a:

⎡

⎤ �3� 3�

⎢ 3�281 ⎥

⎢ ⎥ ANOVA�D� �� = ⎣ 3�239 ⎦

0

The ANOVA table: ⎡

SS 2�519 · 10−3

⎢ ⎢ −3 ANOVA�D� ��0 = ⎢ ⎢ 4�094 · 10 ⎣ 6�613 · 10−3

df MS ⎤ 3 8�396 · 10−4 Between Groups ⎥ ⎥ −4 ⎥ 16 2�559 · 10 ⎥ Within Groups ⎦ Total 19 0

The Calculated F statistic: ANOVA�D� ��1 = 3�281485

The critical F Statistic: ANOVA�D� ��2 = 3�238872

The hypothesis test conclusion at the specified level of significance: ANOVA�D� ��3 = 0 0 = reject hypothesis – there is a significant difference 1 = accept hypothesis – there is not a significant difference

Collaborative Laboratory Studies: Part 6

213

ANOVA_s2

ANOVA (Analysis of Variance) Test -----------------------------This Worksheet demonstrates using Mathcad’s F distribution function and programming

operators to conduct an analysis of variance (ANOVA) test.

Enter sample data used in test:

An element of D represents the data collected with a particular factor.

Data Entry:

D0

3.421

3.366

3.377

3.360

3.399

D1

3.361

3.379

3.362

3.379

3.370

Enter level of significance a: � � = 0�05

214

Chemometrics in Spectroscopy

Program for conducting ANOVA test:

ANOVA( D, α )

n total SX

0

0

SX2 0 T

0

for i ∈ 0.. last ( D ) SDi

Di

nDi

length Di

SX

SX

SDi

SX2 SX2

Di Di 2

T

SDi

T

nDi n total

n total

nDi 2

SS factor

SX

T

n total

SS error

SX2 T

SS total

SX2

2

SX

n total

df factor

length ( D )

1

df error

n total

length ( D )

df total

n total

1

SS factor df factor Analysis 0

SS error

df error

SS total

df total

Analysis 0 Analysis 1

Analysis 0

df factor SS error df error 0

0,2 1,2

α , df factor , df error

Analysis 2

qF 1

Analysis 3

Analysis 1 < Analysis 2

Analysis

SS factor

Collaborative Laboratory Studies: Part 6

215

Calculate Mean Values: mean �D0 � = 3�391 mean �D1 � = 3�3638

Conducting an analysis of variance: For a given set of grouped data D and level of significance a: ⎡

⎤ �3� 3�

⎢ 9�755 ⎥

⎥ ANOVA�D� �� = ⎢ ⎣ 5�318 ⎦

0

The ANOVA table: ⎡

SS 1�85 · 10−3

⎢ ⎢ −3 ANOVA�D� ��0 = ⎢ ⎢ 1�517 · 10 ⎣ 3�366 · 10−3

df MS ⎤ 1 1�85 · 10−4 ⎥ ⎥ 8 1�896 · 10−4 ⎥ ⎥ ⎦ 9 0

Between Groups Within Groups Total

The Calculated F statistic: ANOVA�D� ��1 = 9�755274

The critical F Statistic: ANOVA�D� ��2 = 5�317655

The hypothesis test conclusion at the specified level of significance: ANOVA�D� ��3 = 0 0 = reject hypothesis – is a significant difference 1 = accept hypothesis – is not a significant difference

216

Chemometrics in Spectroscopy

CompareT

Comparison Test for a Set of Measurements Vs. True Value -------------------------------------------------

DATA ENTRY: X1:=

5.10 5.20 5.30 5.10 5.00

n�=rows�X1�

Mean of X1: mean�X1� = 5�14

Enter True Value ���: � �= 5�2

Precision (or standard deviation): ⎛ ⎞1 −−−−−−−−−−−−−−−−−−−−−→ 2 1 sd�X1� �= ⎝ · �X1 − mean�X1��2 ⎠ n−1 sd�X1� = 0�114

Compute degrees of freedom as (n-1): df �= n − 1

Enter alpha value as �2: �2 �= �95

Calculate t-table value: �1 �=

�2 + 1 2

t �= qt��1� df�

Collaborative Laboratory Studies: Part 6

217

t-value: t = 2�776

t experimental (Te):

�mean�X1� − �� √ Te �=

· n

sd�X1� Te =1�177

If Te ≤ t-value, then there is NO SIGNIFICANT DIFFERENCE If Te ≥ t-value, then there IS A SIGNIFICANT DIFFERENCE between the set of measured values and the TRUE VALUE (i.e., they are different)

218

Chemometrics in Spectroscopy

Comp_Meth

Computations for the Comparison of Two Methods (Youden/Steiner): ---------------------------------------------------------RAW DATA ENTRY FOR METHOD A (Enter single Determinations for Sample X from different laboratories using Method A): METHOD A: Sample X LAB LAB LAB LAB

#1 #2 #3 #4

AX:= 3.37 3.38 3.36 3.38

nX�= rows�AX� mean�AX� = 3�372 (Enter single Determinations for Sample Y from different laboratories using Method A): Sample Y LAB LAB LAB LAB

#1 #2 #3 #4

AY:= 3.74 3.74 3.74 3.76

nY�= rows�AY�

mean�AY� = 3�746

RAW DATA ENTRY FOR METHOD B:

(Enter single Determinations for Sample X from different laboratories using

Method A):

METHOD B:

Sample X

LAB LAB LAB LAB

#1 #2 #3 #4

BX:= 3.42 3.41 3.38 3.40

nX�= rows�BX� mean�BX� = 3�401

Collaborative Laboratory Studies: Part 6

219

(Enter single Determinations for Sample Y from different laboratories using Method A): Sample Y LAB LAB LAB LAB

#1 #2 #3 #4

BY:= 3.76 3.86 3.74 3.83

nY�= rows�BY� mean�BY� = 3�8

METHOD A:

METHOD B:

3.76

3.85

mean(AY) AY

mean(BY) 3.75

BY

3.74 3.73

3.8 3.75

3.35 3.36 3.37 3.38 mean(AX), AX

3.39

3.7

3.36 3.38

3.4

3.42

mean(BX), BX

Two-sample Charts illustrating systematic errors for Methods A vs. B:

CALCULATIONS: METHOD A:

METHOD B:

ADxy �= �AX − AY� mean�ADxy� = 0�374 ATxy �= �AX + AY� mean�ATxy� = 7�117

BDxy �= �BX − BY� mean�BDxy� = 0�399 BTxy �= �BX + BY� mean�BTxy� = 7�201

d �= ATxy − BTxy

�d = 0�335

Mean Difference: mean�d� = 0�084

d2 �= BTxy − ATxy

3.44

220

Chemometrics in Spectroscopy

Measuring the Precision and Standard Deviation of the Methods (Youden/Steiner): ----------------------------------------------------------

Precision (Sr): −−−−−−−−−−−−−−−−−−−→ −−−−−1−−−−−− ASr�= · �ADxy − mean�ADxy��2 2·�nY − 1� − −−−−−−−−−−−−−−−−−−−→ −−−−1−−−−−− 2 BSr�= · �BDxy − mean�BDxy�� 2·�nY − 1� ASr = 6.692658 · 10−3 BSr = 0.037334

Standard Deviation (Sd): −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−→ 1 ASd�= · �ATxy − mean�ATxy��2 2·�nY − 1� −−−−−−−−−−−−−−−−−−−−−−−−−−−−−→ − 1 2 BSd�= · �BTxy − mean�BTxy�� 2·�nY − 1� ASd = 0�013056 BSd = 0�045387

Statistical Test for presence of systematic errors (Youden/Steiner): ------------------------------------------------------

F-statistic Calculation (Fs) for Precision Ratio: Sr2 Ratio: PFs�=

BSr2 ASr2

PFs = 31�118 Ho: If Fs is less than or equal to Ft, then there is NO DIFFERENCE in Precision

estimation.

Ha: If Fs is greater than Ft, then there is a DIFFERENCE in Precision estimation.

Collaborative Laboratory Studies: Part 6

221

F-statistic Calculation (Fs) for Presence of Systematic Errors: Sd2 Ratio: SFs�=

BSd2 ASd2

SFs = 12�085

Ho: If Fs is less than or equal to Ft, then there is NO DIFFERENCE in systematic error for methods.

Ha: If Fs is greater than Ft, then there is a DIFFERENCE in systematic error for

methods.

F-statistic Table Value (Ft): df1�=nY − 1 df1 �= 3 qF�0�95� df1� df1� = 9�277

Student’s t-Test for the Difference in the biases between Two Methods:

mean�d� = −0�084

Mean Difference:

mean�d� = 0�084

−−−−−−−−−−−−−−−→

−−−1−−−− s�= · �d2− mean�d� �2 �df1� s = 0�053 s sm�= √ nY sm = 0�026 t-test Statistic: Te�=

mean�d� sm

Te =3�189

222

Chemometrics in Spectroscopy

Enter alpha value as a2: �2 �= �95 Calculate t-table value: �1�=

�2+1 2

�1 = 0.975 t�= qt��1� df1� t-Table Value: t = 3�182 Ho: If Te is less than or equal to t, then there is NO SYSTEMATIC DIFFERENCE between method results.

Ha: If Te is greater than t, then there is a SYSTEMATIC DIFFERENCE (BIAS)

between method results.

REFERENCES 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11.

Hinshaw, J.V., LC-GC 17(7), 616–625 (1999). Mark, H. and Workman, J., Spectroscopy 2(2), 60–64 (1987). Workman, J. and Mark, H., Spectroscopy 2(6), 58–60 (1987). MathCad; MathSoft, Inc.: 101 Main Street, Cambridge, MA 02142; Vol. v. 7.0 (1997). Mark, H. and Workman, J., Spectroscopy 10(1), 17–20 (1995). Mark, H. and Workman, J., Spectroscopy 4(7), 53–54 (1989). Youden, W.J. and Steiner, E.H., Statistical Manual of the AOAC, 1st ed. (Association of Official Analytical Chemists, Washington, DC, 1975). Mark, H. and Workman, J., Statistics in Spectroscopy, 1st ed. (Academic Press, New York, 1991). Draper, N. and Smith, H., Applied Regression Analysis (John Wiley & Sons, New York, 1981). Zar, J.H., Biostatistical Analysis (Prentice Hall, Englewood Cliffs, NJ, 1974). Owen, D.B., Handbook of Statistical Tables (Addison-Wesley Publishing Co., Inc., Reading, MA, 1962).

40

Is Noise Brought by the Stork? Analysis of Noise: Part 1

Well no, actually. If the truth be told, we all know that noise is brought (on) by quantum mechanics. Now, if we could some day find a really good quantum mechanic, one who could actually fix once and for all, all those broken quanta around us, then maybe all the noise would go away, but that is probably too much to ask for and not likely to happen. About as likely as our getting away with making more of these sorts of bad jokes, those are more in the domain of other spectroscopy writers? On to more serious matters: where does the noise come from and how does noise affect our data, that is the spectra we measure? Chemists are interested in the effects that various phenomena have on the accuracy of chemical analyses. General books about instrumental analysis discuss some of the sources of error, and even provide elementary derivations relating some of the instrumental phenomena to their effect on the error of the chemical analysis. Elementary texts [1, 2] derive a formula for the “optimum” absorbance a sample should have. More recent work has also been directed to ascertaining the “optimum” transmittance (or reflectance) value a sample should have for best quantitative accuracy, directing their efforts particularly to the situation when multivariate methods of analysis are in use [3, 4]. One standard treatment of the problem derives the error in concentration of an analyte caused by error of the spectral value, and presents the often-seen curve showing that the relative error in concentration, C/C, goes through a minimum and computes that the minimum occurs at a transmittance of 0.368, corresponding to an absorbance of 0.4313 More advanced texts [5] relate the measurements and the measurement process to the noise of the spectrum given the nature of different noise sources, “noise” being the term generally (although rather loosely, to be sure) used to describe error of an instrumental reading, while “error” is used more generally. At the end of the day, though, they really mean the same thing: the random variations superimposed on the desired information. Close examination reveals that these expositions are wanting. Sometimes a simplifying assumption is made that results in an incorrect description [2]. In other cases the argument in taken into the statistical domain prematurely, leaving no room to accommodate different situations [5]. It is clear, however, that one formula cannot fit all cases. There are a large number of ways in which instruments react to various sources of variation of the signal; we summarize some of them here: 1) Many common infrared and near-infrared detectors are subject to phenomena that are mainly thermal in origin, and therefore the detector noise is independent of the signal level. 2) Some detectors for the visible and UV spectral regions can detect individual photons. These detectors are shot-noise limited. X-ray and gamma-ray spectroscopy also detects

224

Chemometrics in Spectroscopy

individual photons and therefore is also limited by this source of variation. Since shotnoise follows Poisson statistics, the detector noise in these cases increases with the square root of the signal. 3) Sometimes the detector noise is not the limiting noise source. One prime noise source can be generically called “scintillation noise”: variation in the amount of energy impinging on the detector. These often have mechanical causes: vibration of the source, or vignetting at an optical stop in the optical system, changing the geometry of the radiation on the detector. Astronomical measurements of course, are subject to this noise source from atmospheric fluctuations, and represent the classic example of this type of variation. From whatever source, however, scintillation noise is directly proportional to the energy of the optical signal. 4) Other cases of non-detector noise occur when the noise is introduced after the detector. These are usually a result of limitations of the instrument and in principle could be reduced by re-engineering the instrument. Examples include power line pickup, and mechanical vibrations affecting a sensitive part (generically called “microphonics”). The magnitude of these would also tend to be independent of the signal level. 5) One noise source tends to affect older design spectrometers, which are the spec trometers that use the optical-null principle. In the case of optical-null spectrometers, various electrical (random noise and power-line pickup) and mechanical (vibrations) noise sources can be introduced after the transmittance via the optical null is determined (P.R. Griffiths, 1998, personal communication), and in those cases the error of the transmittance will be constant. In fact, because of the historical origins, this is the case that is usually treated in the extant literature. However, this is not a simple one-to-one relationship either, since it depends on how the instrument designer chose to deal with the problem. Many of those types of instruments had variable slits, and the slits could be opened or closed during a scan, according to some preset (hardwired, to be sure: these were not computer-controlled instruments) program. One possibility, of course, was to leave the slit at a constant opening that was preset before the scan was run. A second possibility was to program the slit for a constant bandwidth across the spectrum. A third possibility was to program the slit for constant reference energy. Here again, it is clear that the noise characteristics of the instrument will depend on how the construction of the instrument determined which of these situations applied, and therefore gives us at least three subcases here. 6) Variations in the temperature of a blackbody used as the source in a spectrometer. The energy density of blackbody radiation is given by the well-known formula: dE 8h 3 = 3 V h/kt d c e −1

(40-1a)

for radiation in the frequency range from to + d, where t is the temperature, V is the volume of the enclosure containing the radiation and h, k and c have their usual meanings. Collecting the constants (to simplify the expression), we obtain dE K 3 = h/kt −1 d e

(40-1b)

Analysis of Noise: Part 1

225

Taking the derivative of this with respect to temperature, we obtain � � d dE −1 −h d eh/kt 2 = K 3 h/kt 2 − 1 kt dt e Back substituting equation 40-1b into equation 40-2, we obtain � � d ddE dE heh/kt = dt d kt2 eh/kt − 1

(40-2)

(40-3)

and we see that the relative energy change (as a fraction of the energy) in the wavelength interval between and + d is given by the expression: heh/kt kt2 eh/kt − 1

(40-4)

7) Variation of pathlength will create a source of variation in the data such that the change in absorbance is proportional to the absorbance. This can happen even in transmission spectroscopy if the walls of the sample cell for some reason should not be rigidly fixed in place, or possibly the cell might expand through temperature changes. Of course, in that case the sample itself is also likely to be affected directly; expansion of a liquid sample would have an effect equivalent to a reduction in pathlength. It can also happen, and is perhaps more common, in the case of diffuse reflectance. In that measurement technique, absent a rigorous theory to describe this physical phe nomenon, the concept of a variable pathlength is used as a first approximation to the nature of the change in the measurements. 8) There are other sources of noise, whose behavior cannot be described analytically. They are often principally due to the sample. A premier example is the variability of the measured reflectance of powdered solids. Since we do not have a rigorous ab initio theory of diffuse reflectance, we cannot create analytic expressions that describe the variation of the reflectance. Situations where the sample is unavoidably inhomogeneous will also fall into this category. In all such cases the nature of the noise will be unique to each situation and would have to be dealt with on a case-by-case basis. 9) Another source of variability, which can have still different characteristics, is com prised of the interaction of any of the above factors with a nonlinearity anywhere in the system. These nonlinearities could consist of nonlinearity in the detector, in the spectrometer’s electronics, optical effects such as changes in the field of view, and so on. Many of these nonlinearities are likely to be idiosyncratic to the cause, and would have to be characterized individually and also analyzed on a case-by-case basis. 10) Another, specialized, case would be nondispersive analyzers. For these instruments the whole concept of determining the signal between and + is inapplicable, since the measured signal represents the integrated optical intensity of the incident radiation over a broad range of wavelengths, likely including wavelength regions where the optical radiation is weak as well as where it is strong. Furthermore, this will be sampledependent, and almost certainly would have to be dealt with on a sample-by-sample basis.

226

Chemometrics in Spectroscopy

Thus, given the variety of ways that the noise output of a detector is related to the optical signal into the detector, the argument that a single formula cannot account for them all becomes even more forceful. This being so, it is clear that each case needs to be treated separately in order to obtain a correct description of the effect on the noise of the spectrum. For single-beam spectra the noise can be described directly. For ratioed spectra, it is of interest to ascertain the effect of the various noise sources on the ratioed spectrum (i.e., the transmittance or reflectance spectrum as the case may be), on the absorbance spectrum, and also to determine, as was done previously [1, 2, 5], the optimum value for the sample to have that will give the minimum error of the calculated value. We will be doing this exercise during the course of the next few chapters. We will consider each of these types of noise one at a time. We will start from first principles, derive the appropriate expressions and deal with them in a completely rigorous manner. During the course of this we will compare out results with the ones in the literature and see where the standard derivations (NOT deviations!) depart from our presentation. We will begin with the next chapter with an analysis of the effect of one of the most common cases: constant detector noise, typical of mid-infrared and near-infrared instruments.

REFERENCES 1. Strobel, H. A., Chemical Instrumentation – A Systematic Approach to Instrumental Analysis (Addison-Wesley Publishing Co., Reading, MA, 1960). 2. Ewing, G., Instrumental Methods of Chemical Analysis, 4th ed. (McGraw-Hill, New York, 1975). 3. Honigs, D.E., Hieftje, G.M. and Hirschfeld, T., Applied Spectroscopy 39(2), 253–256 (1985). 4. Hirschfeld, T., Honigs, D. and Hieftje, G., Applied Spectroscopy 39(3), 430–433 (1985). 5. Ingle, J.D. and Crouch, S.R., Spectrochemical Analysis (Prentice-Hall, Upper Saddle River, NJ, 1988).

41 Analysis of Noise: Part 2

Note to the Reader: Chapters 41 through 53 are derived from a series of papers written about the subject of noise. They are sequential in nature and the rationale and descriptions follow a series of equations, figures and tables that are best followed using a serial numbering system running sequentially throughout the chapter series. Thus the equations, figures, and tables for these chapters will contain the chapter number and then the sequential equation, figure, or table number. For example chapter 42 begins with Equation 41-19 and this equation would be designated as (42-19), following a format. Chapter 40 is based on reference [1]. In this chapter we brought up the question of how various types of noise are related to the noise characteristics of the spectra one observes. In this chapter, and in the thirteen subsequent chapters (41 through 53), we will derive the expressions for the various situations that arise; these situations have been described in greater detail within Chapter 40. We begin with a fairly simple case: that of constant detector noise. This chapter will also serve to lay out the general conditions that apply to these derivations, such as nomenclature. We will treat this first case in excruciating detail, so that the methods we will use are clear; then, for the cases we will deal with in the future, we will be able to give an abbreviated form of the derivations, and anyone interested in following through themselves will be able to see how to do it. Also, some of the results are so unexpected, that without our giving every step, they may not be believed. Since the measurement of reflectance and transmittance are defined by essentially the same equation, we will couch our discussion in terms of a transmittance measurement. The important difference lay, as we discussed previously, in the nature of the error superimposed on the measurement. Therefore, we begin by noting that transmittance (T ) is defined by the equation 41-1: T=

Es − E0s Er − E0r

(41-1)

where Es and Er represent the signal due to the sample and reference readings, respec tively, E0s and E0r are the “dark” or “blank” readings associated with Es and Er . (Er − E0r ), of course, must be non-zero. The measured value of T , caused by the error �T is T + �T =

� � �Es + �Es� � − �E0s + �E0s � �Er + �Er� � − �E0r + �E0r �

(41-2)

where the � terms represent the fluctuation in the reading due to instantaneous random effect of noise. An important point to note here is that Es , Er , and T , for any given set

228

Chemometrics in Spectroscopy

of readings at a given wavelength, are constants. All variations in the readings, due to noise, are associated with �Es , �Er , and �T . Rearranging equation 41-2 we have T + �T =

� � �Es − E0s � + ��Es� − �E0s � � �Er − E0r � + ��Er − �E0r �

(41-3)

The difference between two random variables is itself a random variable, therefore we � � � in equation 41-3 with the equivalent, � and (�Er� − �E0r replace the terms (�Es� − �E0s simpler terms �Es and �Er , respectively: T + �T =

Es − E0s + �Es Er − E0r + �Er

(41-4)

The presence of a non-zero dark reading, E0 , will, of course, cause an error in the value of T computed. However, this is a systematic error and therefore is of no interest to us here; we are interested only in the behavior of random variables. Therefore we set E0s and E0r equal to zero and note, if T as described in equation 41-1 represents the “true” value of the transmittance, then the value we obtain for a given reading, including the instantaneous random effect of noise, is T + �T =

Es + �Es Er + �Er

(41-5)

and we also find that upon setting E0s and E0r equal to zero in equation 41-1, equation 41-1 becomes E T= s (41-6) Er where �Es and �Er represent the instantaneous, random values of the change in the sample and reference readings due to the noise. Since, as we noted above, T , Es , and Er are constant for any given reading, any change in the measured value due to noise is contained in the terms �Er and �Es . In statistical jargon this would be called “a point estimate of T from a single reading”, and �T is the corresponding instantaneous change in the computed value of the transmittance. Again, Er must be non-zero. We note here that �Es and �Er need not be equal; that will not affect the derivation. For the case we are considering in this chapter, however, we are assuming constant detector noise, therefore when we pass to the statistical domain, we will consider �Es to be equal to �Er . That, of course, refers only to the expected values; but since the noise is random, the instantaneous values will virtually never be the same. Upon subtracting equation 41-6 from equation 41-5 we obtain the following: T + �T − T =

Es + �Es Es − Er + �Er Er

(41-7)

�T =

Er �Es + �Es � − Es �Er + �Er � Er �Er + �Er �

(41-8)

�T =

Er Es + Er �Es − Es Er − Es �Er

Er �Er + �Er �

(41-9)

�T =

Er �Es − Es �Er

Er �Er + �Er �

(41-10)

Analysis of Noise: Part 2

229

Equation 41-10 might look familiar. If you check an elementary calculus book, you will find that it is about the second-to-last step in the derivation of the derivative of a ratio (about all you need to do is go to the limit as �Es and �Er →zero). However, for our purposes we can stop here and consider equation 41-10. We find that the total change in T , that is �T , is the result of two contributions: �T =

Es �Er Er �Es − Er �Er + �Er � Er �Er + �Er �

(41-11)

We note that, since by assumption Er is non-zero, and �Er is non-zero and independent of Er , the first term of equation 41-11 is non-zero. The value of the second term of equation 41-11, however, will depend on the value of Es , that is on the transmittance of the sample. In order to determine the standard deviation of T we need to consider what would happen if we take multiple sample and reference readings, then we can characterize the variability of T . Since Er and Es are fixed quantities, when we take multiple readings we note that we arrive at different values of T + �T due to the differences in the values of �Er and �Es on each reading, causing a change in �T . Therefore we need to compute the standard deviation of �T , which we do from the expression for �T in equation 41-11: � � Er �Es Es �Er SD��T � = SD − (41-12) Er �Er + �Er � Er �Er + �Er � Or equivalently, we calculate the variance of �T , which is the square of the standard deviation: � � Er �Es Es �Er Var��T � = Var − (41-13) Er �Er + �Er � Er �Er + �Er � The proof that the variance of the sum of two terms is equal to the sum of the variances of the individual terms is a standard derivation in Statistics, but since most chemists are not familiar with it we present it in the Appendix. Having proven that theorem, and noting that �Es and �Er are independent random variables, they are uncorrelated and we can apply that theorem to show that the variance of �T is: � � � � Er �Es −Es �Er Var��T � = Var + Var (41-14) Er �Er + �Er � Er �Er + �Er � Since �Er is small compared to Er , the �Er in the denominator terms will have little effect on the variance of T and in the limit approaches zero. In a case where this is not true, the derivation must be suitably modified to include this term. This is relatively straightforward: substitute the parenthesized terms into the equation for variance (e.g., as we do in the appendix), hook up about a 100-hp motor or so and “turn the crank” – as we will do in due course. It is mostly algebra, although a lot of it! In our current development, however, we assume �Er is small and therefore negligible compared to Er we replace (Er + �Er ) with Er : � � � � Er �Es −Es �Er Var��T � = Var + Var (41-15) Er 2 Er 2

230

Chemometrics in Spectroscopy

� Var��T � = Var

� � � −T�Er �Es + Var Er Er

(41-16)

We have shown previously that if a represents a constant, then Var �aX� = a2 Var�X� ([2], or see [3] Chapter 11, p. 94). Hence equation 41-16 becomes � � � �2 1 −T 2 Var��Er � (41-17) Var��T � = Var��Es � + Er Er Since we have assumed constant detector noise for this chapter, Var��Es � = Var��Er � = Var��E� Var��T � =

1+T2 Var��E� Er 2

(41-18)

Finally, reconverting variance back to SD by taking square roots on both sides of equation 41-18: SD��T � =

�

1+T2

SD��E� Er

(41-19)

We remind our readers here that �E, as we have been using it in this derivation is, as you will recall, the difference between �E � and �E0� in equation 41-4 and the expected value in the statistical nomenclature is therefore 21/2 times as large as �E� (due to the fact that it is the result of the difference between random variables with equal variance), a difference that should be taken note of when comparing results with the original definition of S/N in equation 41-2. We next note, and this is in accordance with expectations, that the noise of the trans mission spectrum, SD(�T ) is dependent on the noise-to-signal ratio of the readings, the inverse of the S/N ratio commonly used and presented as a spectrometer specification – at least, as long as the noise is small compared to the reference energy reading so that the approximation made in equation 41-15 remains valid. Recall that Er is the energy of the reference reading and SD(�E) is the noise of the readings from the detector; this ratio of SD(�E)/Er is the (inverse of the) true signal-to-noise ratio; the noise observed on a transmission spectrum, while related to S/N , is in itself not the true S/N ratio. Next we note further, and this is probably contrary to most spectroscopist’s expecta tions, that the noise of the transmittance spectrum is not constant, but depends on the transmittance of the sample, being higher for highly transmitting samples than for dark samples. Since T can vary from 0 (zero) to 1 (unity), the noise level can vary by a factor of the square root of two, from a relative value of unity (when T = 0) to 1.414 � � � (when T = 1). This behavior is shown in Figure 41-1. The increase in noise with increasing signal might be considered counterintuitive, and therefore surprising, by some. Intuition tells us that he S/N ratio might be expected to improve with increased signal regardless of its source, or that the noise level of the transmittance spectrum should at least remain constant, for constant detector noise. This misapprehension has worked its way into the literature to modern times: “In most infrared measurements situations, the detector constitutes the limiting noise source. Because the resulting fluctuations have the same effect as a fixed uncertainty in the signal readout, they appear as a constant error in the transmittance”. [4]

Analysis of Noise: Part 2

231

1.6 1.4

Relative noise

1.2 1 0.8 0.6 0.4 0.2

1

0.96

0.92

0.88

0.8

0.84

0.76

0.72

0.68

0.6

0.64

0.56

0.52

0.48

0.4

0.44

0.36

0.32

0.28

0.2

0.24

0.16

0.12

0.08

0

0.40

0

Sample transmittance

Figure 41-1 Noise level of a transmittance spectrum as a function of the sample transmittance.

Intuition tells us that if the transmittance is zero, then it should have no effect on the readings. In fact this is true, but misleading. The transmittance being zero, or the sample energy being zero, does not mean that the variability of the reading is zero. The explanation of the actual behavior comes from a careful perusal of the intermediate equations developed in the course of arriving at equation 41-19, specifically equation 41-14. From the first term in that equation we see that the irreducible minimum noise is contributed by the reference signal level (Er � multiplied by the variation of the sample signal (�Es �, independently of the value of the sample signal. Increasing sample signal then serves to add additional noise to the total, through its contribution, in the second term of equation 41-14, which comes from the sample signal through its being multiplied by the reference noise. Conventional developments of the subject contain flaws that are usually hidden and subtle. In Ewing’s book, for example [5], the development includes the step (see page 43, the section between equations 3-6 and 3-7) of noting that, since the reference energy is essentially set equal to unity, log (Er � (or P0 , the equivalent in Ewing’s terminology) is set equal to zero. However, this is done before the separation of P0 from �P0 , creating the implicit, but erroneous, result that �P0 is zero as well. In our nomenclature, this causes the second term of equation 41-14 to vanish, and as a consequence the erroneous result obtained is that �T is independent of T . This, of course, appears to confirm intuition and since it is based on mathematics, appears to be beyond question. Other treatments [6] simply do not question the origin of the noise in T and assume a priori that it is constant, and work from there. The more sophisticated treatment of Ingle and Crouch [7] comes very close but also misses the mark; for an unexplained reason they insert the condition: “� � � it is assumed there is no uncertainty in measuring Ert and E0t � � � ”. Now in fact this could happen (or at least there could be no variation in �Er �; for example, if one refer ence spectrum was used in conjunction with multiple sample spectra using an FTIR spectrometer. However, that would not be a true indication of the total error of the measurement, since the effect of the noise in the reference reading would have been removed from the calculated SD, whereas the true total error of the reading would in

232

Chemometrics in Spectroscopy

fact include that source of error, even though part of it were constant. It is to their credit that these authors explicitly state their assumption that they ignore the variability of Er rather than hiding it. Furthermore they allude to the fact that something is going on when they state “� � � the approximation is good to within a factor of 21/2 .” Nevertheless they failed to follow through and derive the exact solution to the problem. The bottom line to all this is that in one way or another, previous treatments of this subject have invariably failed to consider the effect of the noise of the reference reading, and therefore arrived at an erroneous conclusion. Whew! I think that is enough for one chapter. I need a rest. And so does the typesetter! We will continue the derivation in our next chapter.

APPENDIX Proof that the variance of a sum equals the sum of the variances Let A and B be random variables. Then the variance of (A + B) is by definition:

Var�A + B� =

�2 � � �A + B� − �A + B� n−1

(41-A1)

Since �A + B� = A + B, we can separate the numerator terms and then expand the numerator: � � � A2 + AB − AA − AB + AB + B2 − AB − BB 2 2 −AA − AB + A + AB − AB − BB + AB + B Var�A + B� = (41-A2) n−1 We can now collect terms as follows: � 2 � � 2 2 2 �B − 2BB + B � �A − A��B − B� �A − 2AA + A � + +2 Var�A + B� = n−1 n−1 n−1 (41-A3) Equation 41-A3 can be checked by expanding the last term, collecting terms and verifying that all the terms of equation 41-A2 are regenerated. The third term in equation 41-A3 is a quantity called the covariance between A and B. The covariance is a quantity related to the correlation coefficient. Since the differences from the mean are randomly positive and negative, the product of the two differences from their respective means is also randomly positive and negative, and tend to cancel when summed. Therefore, for independent random variables the covariance is zero, since the correlation coefficient is zero for uncorrelated variables. In fact, the mathematical definition of “uncorrelated” is that this sum-of-cross-products term is zero. Therefore, since A and B are random, uncorrelated variables: � � �A − A�2 �B − B�2 Var�A + B� = + (41-A4) n−1 n−1

Analysis of Noise: Part 2

233

The two terms of equation 41-A4 are, by definition, the variances of A and B. Var�A + B� = Var�A� + Var�B�

(41-A5)

QED

REFERENCES 1. Mark, H. and Workman, J., Spectroscopy 15(10), 24–25 (2000). 2. Mark, H. and Workman, J., Spectroscopy 3(8), 13–15 (1988). 3. Mark, H. and Workman, J., Statistics in Spectroscopy, 1st ed. (Academic Press, New York, 1991). 4. Honigs, D.E., Hieftje G.M. and Hirschfeld, T., Applied Spectroscopy 39(2), 253–256 (1985). 5. Ewing, G., Instrumental Methods of Chemical Analysis, 4th ed. (McGraw-Hill, New York, 1975). 6. Strobel, H.A., Chemical Instrumentation – A Systematic Approach to Instrumental Analysis (Addison-Wesley Publishing Co., Reading, MA, 1960). 7. Ingle, J.D., and Crouch, S.R., Spectrochemical Analysis (Prentice-Hall, Upper Saddle River, NJ, 1988).

This page intentionally left blank

42

Analysis of Noise: Part 3

We have been discussing the question of how noise in a spectrometer affects the observed noise in the spectra we measure. This question was introduced [1] and various known phenomena was presented that contribute (or, at least, can contribute) to the noise level of the observed spectra. Since this is a continuation of the previous chapters, we will continue the numbering of equations, figures, and so on as though it were all one chapter. In Chapter 41, based on reference [2] we derived the following expression for the noise of a transmission measurement, for the case of constant detector noise, as is commonly found in IR and NIR spectrometers: SDT =

SDE 1+T2 Er

(42-19 also shown as 41-19)

To continue the derivation, the next step is to determine the variation of the absorbance readings; starting with the definition of absorbance. The extension we present here, of course, is based on Beer’s law, which is valid for clear solutions. For other types of measurements, diffuse reflectance for example, the derivation should be based on a suitable function of T that applies to the situation, for example the Kubelka-Munk function for diffuse reflectance should be used for that case: A = − logT

(42-20a)

A = −04343 lnT

(42-20b)

dA = −04343 dT/T

(42-21)

We take the derivative,

and substitute the expressions for T (Equation 41-6) and dT , replacing the differen tials by finite differences: so that we can use the expression for T found previously (Equation 41-11): Er Es Es Er − −04343 Er Er + Er Er Er + Er A = (42-22) Es Er −04343Er Er Es Es Er A = − (42-23) Es Er Er + Er Er Er + Er −04343Er Er Es − Es Er (42-24) A = Er Er + Er Es

236

Chemometrics in Spectroscopy

Again allowing ourselves to neglect Er in comparison with Er : −04343 Er Es − Es Er A = Es Er

(42-25)

At this stage we have two branches of a derivation “tree” to pursue: one is to determine the standard deviation of A, the other is to continue the derivation, toward the final result corresponding to the “standard” treatments of the topic, but using our rigorously derived equations. We start with the computation of standard deviation of A, which is straightforward. We cut the derivations short slightly, however, in that the process we will use will apply the same sequence of steps; as we did to the case of T as we previously showed [2], but present only the results of each step, not all the intermediate equations. These steps are: separating the fraction in equation 42-25 into two terms, taking the variance of both sides of the equation, noting that Var(Es = VarEr = VarE, applying the two theorems that tell us 1) VarX + Y = VarX + VarY 2) VaraX = a2 VarX simplifying the expressions when possible and then taking square roots again. So we start by multiplying through and separating the fractions in equation 42-25: A =

−04343Es 04343Er + Es Er

(42-26)

taking the variance of both sides of the equation: −04343Es 04343Er VarA = Var + Er Es apply the theorem: VarX + Y = VarX + VarY −04343Es 04343Er VarA = Var + Var Es Er

(42-27)

(42-28)

and then the theorem: VaraX = a2 VarX VarA =

−04343 Es

2

04343 Var Es + Er

2 Var Er

(42-29)

Let VarEs = VarEr = VarE:

−04343 2 04343 2 Var E + Var E Es Er 2 −1 2 1 VarA = + 043432 Var E Es Er

VarA =

(42-30)

(42-31)

Analysis of Noise: Part 3

237

and finally: SDA = 04343SDE

1 1 + E s 2 Er 2

(42-32)

We may compare this with SD(A) that would be obtained if Er were set to zero in equation 42-25 (as per the conventional derivation): SDA =

04343 SDE Es

(42-33)

Since Es can go from zero to Er , it is interesting and instructive to plot these two functions, in order to compare the effect of eliminating the terms involving Er from the expressions. We do this in Figure 42-2. To continue on the second branch of our derivation “tree” as described above, we next derive expressions for relative precision, A/A, starting with the use of equations 42-20b and 42-25: −04343 Er Es − Es Er A Es Er = (42-34) A −04343 lnT A Er Es − Es Er = A Es Er lnT A 1 = A lnT

Es Er − Es Er

(42-35) (42-36)

Exact versus approximate solution 0.6

Absorbance noise

0.5 0.4 0.3 0.2 0.1

1

0.96

0.92

0.88

0.8

0.84

0.76

0.72

0.68

0.6

0.64

0.56

0.52

0.48

0.4

0.44

0.36

0.32

0.28

0.2

0.24

0.16

0.12

0.08

0

0.04

0

%T

Figure 42-2 Absorbance noise as a function of transmittance, for the exact solution (upper curve: equation 42-32) and the approximate solution (lower curve: equation 42-33). The noise-to-signal ratio, i.e., E/Er was set to 0.01. (see Color Plate 3)

238

Chemometrics in Spectroscopy

Again going through the steps needed to convert to the statistical domain (as we did before) we first take the variance of both sides of equation (42-36) to obtain A 1 Es Er Var = Var − (42-36a) Er A lnT Es Then apply the theorem: VarA + B = VarA + VarB: A 1 Es 1 −Er Var = Var + Var A lnT Es lnT Er

(42-36b)

And then the theorem: VaraX = a2 VarX: Var

A A

=

1 Es lnT

2 VarEs +

−1 Er lnT

2 VarEr

A 1 1 Var = 2 VarEs + 2 VarEr 2 A Er lnT 2 Es lnT A 1 1 1 Var = VarE + VarE s r A Er 2 lnT 2 Es 2 Then setting VarEs = VarEr = VarE: A 1 1 1 Var = VarE + VarE A Er 2 lnT 2 Es 2 Var Var

A A A A

=

VarE

=

VarE

lnT 2

lnT 2

1 1 + 2 2 Es Er Es 2 +E r 2 Es 2 Er 2

(42-36c) (42-36d)

(42-36e)

(42-36f)

(42-36g)

(42-36h)

And finally, taking square roots on both sides to convert to standard deviations, and substituting Es /Er forT A −SDE Es 2 + Er 2 SD = (42-37) A Es Er lnEs /Er We may compare this, for example, with the equation at an equivalent point in Ingle and Crouch’s development [3] (taking that as a “typical” derivation): A −st = A TEr lnT

(Ingle and Crouch’s equation 5-45)

The relationship and differences between the two equations are obvious, except we may note that, while can never be negative, there is always the issue, when taking a square root, of determining the sign. Since Es /Er is less than unity, the logarithm in the denom inator is negative and therefore we must determine that the sign of the square root in the

Analysis of Noise: Part 3

239 Exact versus Approx Solution for SD [Δ(A)/A]

1.6 1.4 1.2

Δ(A)/A

1 0.8 0.6 0.4 0.2

1

0.88

0.92

0.9

0.95

0.8

0.85

0.7

0.75

0.6

0.65

0.5

0.55

0.45

0.4

0.3

0.35

0.25

0.2

0.1

0.15

0.05

0

0

%T

Exact Approx

Expansion of SD [Δ(A)/A] 0.16 0.14 0.12

Δ(A)/A

0.1 0.08 0.06 0.04 0.02 1

0.96

0.8

0.84

0.76

0.72

0.68

0.64

0.6

0.56

0.52

0.48

0.44

0.4

0.36

0.32

0.28

0.24

0.2

0.16

0.12

0.08

0.04

0

0

%T

Figure 42-3 Comparison of the exact (upper curve: equation 42-37) and approximate (lower curve: Ingle and Crouch equation 5-45) expressions for the standard deviation of A/A as a function of %T. Noise-to-signal is set at 0.01.

numerator is also negative in order to obtain a positive value for SD(A). Equation 42-37 then reduces to the Ingle & Crouch equation if Er goes to zero (as Ingle & Crouch assume) and we pass to the statistical domain. Again, it is interesting and instructive to compare the two expressions by plotting them as a function of T , which we do in Figure 42-3. From Figure 42-3 we also see the well-known effect on the relative precision of spectral analysis of, on the one hand, T → 0 and on the other the effect of lnT → 0 as T → 1. The minimum relative error occurs, in the standard treatment, at T = 0368 [4]. Examining the data table from which Figure 42-3 was created (using EXCEL™) confirms what Figure 42-3 leads us to suspect: using the exact solution, there is a

240

Chemometrics in Spectroscopy

shift from the previously accepted value; the optimum value of transmittance occurs at 33.0%T rather than the generally accepted value of 36.8%T . We wish to develop an analytic expression for this situation. To do so, we will follow the same steps used in the standard development, but use the rigorously correct equation (i.e., equation 42-37 instead of the approximate equation previously used. The steps are the standard ones used for finding a minimum (or maximum) of a function: take the derivative of equation 42-37, then set that derivative equal to zero. Since the derivative of interest is the derivative with respect to T , in preparation for this we reorganize equation 42-37 as follows: we substitute equation 41-6 (reference [2], reorganized to Es = TEr (since Er is a constant) into equation 42-37; this enables us to eliminate Es from the equation:

A SD A

A SD A

SDE TEr 2 + Er 2 = TEr Er lnT

SDE Er 2 T 2 + 1 = TEr 2 lnT

A SD A

√ SDE T 2 + 1 = TEr lnT

(42-38)

(42-39)

(42-40)

We could work with equation 42-40, but it is instructive to slightly reorganize it: SD

A A

=

√ SDE T 2 + 1 Er T lnT

(42-41)

SDE is, as before, the noise-to-signal ratio of the reference Er signal. We can also note that if the variation of the sample reading was neglected, then the term under the radical would simply be unity and the expression would again reduce to the conventional expression. We are now ready to take the derivative with respect to T :

√ d A d SDE T 2 + 1 SD = (42-42) dT A dT Er T lnT We note again that

√ d A SDE d T2 +1 SD = dT A Er dT T lnT

(42-43)

Applying the theorem for the derivative of a ratio:

d A SD dT A

⎫ ⎧ √ d √ 2 d ⎪ ⎪ 2 +1 ⎨ ⎬ lnT T lnT T + 1 − T T SDE dT dT = (42-44) ⎪ ⎪ Er T lnT 2 ⎩ ⎭

Analysis of Noise: Part 3

241

Since the derivative in the first term in the numerator of equation 42-44 is of the form U n , where n has the value of 1/2, we apply the theorem that the derivative of U n is nU n−1 to that part. And since the derivative in the second term of equation 42-44 is of the form U × V , where U = T and V = ln(T ), we apply the theorem that the derivative of the product of U × V is U dV + V dU to that part, then:

d A SDE SD = dT A Er ⎫ ⎧ √ d d d ⎪ T lnT √ 1 ⎪ 2 2 ⎪ ⎪ lnT + lnT T ⎬ T + 1 − T + 1 T ⎨ dT dT 2 T 2 + 1 dT ⎪ ⎪ T lnT 2 ⎪ ⎪ ⎩ ⎭ (42-45)

Now we can start simplifying (in several steps): ⎫ ⎧ √ 2T 1 ⎪ ⎪ 2 +1 T ⎪ ⎪ + lnT T lnT − T √ ⎬ d A SDE ⎨ T 2 T2 +1 SD = ⎪ ⎪ dT A Er T lnT 2 ⎪ ⎪ ⎩ ⎭ (42-46) ⎧ ⎫ 2 ⎪ T 2 lnT − √T 2 + 1 1 + lnT ⎪ ⎨ ⎬ d A SDE = SD √ ⎪ ⎪ dT A Er T 2 + 1T lnT 2 ⎩ ⎭ d A SDE T 2 lnT − T 2 + 1 1 + lnT SD = √ dT A Er T 2 + 1T lnT 2

(42-47)

(42-48)

For comparison, we note that the corresponding equation from the conventional formu lation is d A SDE 1 + lnT SD = (42-49) dT A Er T lnT 2 Now we set the derivative in equation 42-48 equal to zero and obtain T 2 lnT − T 2 + 1 1 + lnT = 0

(42-50)

This is a transcendental equation, which is not easily solved by ordinary methods. Nowadays, however, computers make the solution of such equations by successive approximations easy. In this case, again using EXCEL™, we find that the value of T that makes the left-hand side of equation (42-50) become zero, which thus gives the value corresponding to the transmittance corresponding to minimum relative error, is 0.32994, rather than the previously accepted value of 0.368 By now you probably think we are done. Not by a long shot! There is considerably more to learn about the effect of noise of a spectrum when the detector noise is constant, some of which is even more surprising than what we have seen until now. More to come in the next chapters – Stay tuned

242

Chemometrics in Spectroscopy

REFERENCES 1. Mark, H. and Workman, J., Spectroscopy 15(10), 24–25 (2000). 2. Mark, H. and Workman, J., Spectroscopy 15(11), 20–23 (2000). 3. Ingle, J.D. and Crouch, S.R., Spectrochemical Analysis (Prentice-Hall, Upper Saddle River, NJ, 1988). 4. Strobel, H.A., Chemical Instrumentation – A Systematic Approach to Instrumental Analysis (Addison-Wesley Publishing Co., Reading, MA, 1960).

43

Analysis of Noise: Part 4

This chapter is the continuation of a set [1–3] dealing with the rigorous derivation of the expressions relating the effect of instrument noise to their contributions to the spectra we observe. Our first chapter in this set was an overview; since then we have been analyzing the effect of noise on spectra when the noise is constant detector noise, that is noise that is independent of the strength of the optical signal. Inasmuch as we are dealing with a continuous set of equations, we again continue our discussion by continuing our equation numbering, figure numbering, use of symbols, and so on as though there were no break. We left off in Chapter 42 based on the original publication [3] with determining the sample transmittance corresponding to the best relative precision of a spectral mea surement, we then noted that there is more to learn about noise effects on quantitative spectroscopic analysis. “What more is there?” you might ask. Well, in the previous chapters, we learned that the transmittance level affects the noise. In this chapter we will learn that the noise can also affect the transmittance. To see how, let us go to equation (41-14) (reference [2]), which we reproduce here, and note the discussion that followed it (which we won’t reproduce here: the reader may go back to the original and reread it): �

� � � Er Es −Es Er VarT = Var + Var Er Er + Er Er Er + Er

(41-14)

Basically, the development of the mathematical derivations from Chapter 41 (Equa tions 41-15 onward) was based on the assumption that in Equation 41-14, Er was small compared to Er so that it could be ignored; this was done for several reasons, one being that it allowed considerable simplification of the equations, which was pedagogically useful. More significant and fundamental is that it represents a limiting case of the situation. But suppose the noise is not small enough to be ignored, that is it is not small compared to Er ? Then we cannot ignore it, or its effect on the equations. As we might expect, it also complicates the analysis of the situation enormously. We mentioned at that time that we would discuss that situation in due course, and the time has now come to do so. Normally, mathematical derivations are done the other way round: the full equations are developed first and then the special cases described and their effect on the equations worked through. But we chose to do it “backwards”, so to speak, because we felt it is more pedagogically effective that way; and it allows our readers to follow along with us in the simpler situations, before becoming immersed in the full complexities of the equations.

244

Chemometrics in Spectroscopy

Besides, that is the way we like to do things As we will see, there are significant consequences of non-negligible noise. To start our discussion we will go back even farther than equation 41-14 and start our discussion with equation 41-5 (reference [2]), which we again reproduce here. T + T =

Es + Es Er + Er

(Equation 41-5 from Chapter 41)

This can be separated into two terms: T + T =

Es Es + Er + Er Er + Er

(43-51)

so that now we can equate corresponding terms on the two sides of the equation: TM =

Es Er + Er

(43-52a)

T =

Es Er + Er

(43-52b)

where TM represents the measured transmittance value of a reading subjected to noise. So that now we see that equation 43-52a represents the computed transmittance of the reading, and equation 43-52b represents the deviation due to noise of that transmittance. We will address the possibility of a contribution of equation 43-52b to the computed value of TM a bit later in this chapter. We will also occasionally use the term fEr to refer to the expression on the right-hand side of equation 43-52a. Upon averaging several values of equation 43-52a, the fact that the noise is in the denominator causes the average value of the effect of the noise not to approach zero, and therefore averaging several values of T will result in a computed value different than the actual value of Es /Er . This is because division is a non-linear arithmetic operation. To illustrate this effect, we will use a numerical example, and consider two readings of T with values of Er = 02 and −0.2 times Er (remember that we are dealing specifically with the case where the noise is not negligibly small compared to the signal); this will make the “noise” symmetrical around Er . Then, the general formula for the average value of T computed will be T=

n 1� Es n i=1 Er + Ei

and for two readings as we described this becomes: � � 1 Es Es + T= 2 Er + Er Er − Er

(43-53)

(43-54)

where represents the fractional change of the measurement. For our example, the specific value of = 02: � � Es 1 Es T= + (43-55) 2 Er + 02 × Er Er − 02 × Er

Analysis of Noise: Part 4

245

�

1 1 + 1 + 02 1 − 02

�

T=

Es 2Er

T=

Es 0833333333 + 125 2Er T = 10416666

Es Er

(43-56) (43-57) (43-58)

Thus we see that, even though the noise values of the reference readings are equally distributed around their mean value of Er , their effect on the computed value of trans mittance is not symmetrically distributed due to the nonlinearity of the division process, resulting in a change of the computed value from the (in this case, known) true value. Now, smaller variations will show small effects and larger variations will show larger effects (i.e., change the computed value of T less or more than the amount shown). The relative effect, for this somewhat artificial case, is shown in Figure 43-4. As the noise becomes larger and larger compared to the value of the reference signal, the second term in equation 43-54 becomes more and more dominant. Therefore we cannot allow the “noise” to equal the signal value, since in that case the denominator of the second term would become zero and the “average” value of T would be infinite. This concern will occur again as we continue discussing this situation. One obvious consequence of this is that if data is to be coadded, the coaddition of sample and reference signals should be done individually, before the computation of T rather than computing T for each reading and then averaging the several values of T together. An interesting side note here: in the real world there is nothing to prevent the noise from becoming greater than the signal (except for the alertness of the spectroscopist doing the work!), thus it is entirely possible for the measured value of the reference reading to become arbitrarily small and the computed value of T to become arbitrarily large. Presumably, any spectroscopist will recognize that such data have no meaning. However, we here find ourselves in the quite unusual situation of allowing the mathematics to limit the extent of our analysis, rather than “real world” considerations, just the reverse of what usually happens. It is even possible for an individual noise pulse to exceed −Er so that a negative reading of Er will be obtained. This happens in the real world and therefore must be taken into account in the mathematical description. This is a good place to also note that since the transmittance of a physical sample must be between zero and unity, Es must be no greater than Er , and therefore when Er is small an individual reading of Es can also be negative. Therefore it is entirely possible for an individual computed value of T to be negative. Now, while we are concerned in these chapters with a thorough analysis of these effects, in practice this is usually not too serious a problem, for several reasons. The first reason is that if the data is noisy and needs to be coadded to reduce the noise level, the coadding is normally done in accordance with our recommendation above: before the computation of T . Therefore the error of the values of Es and Er is reduced before the computation of T is performed, thus keeping it out of the regime where the nonlinear effect becomes important. The second reason is that under normal measurement conditions, the only place where such a high N/S ratio is liable to occur will be at the ends of the spectral range, where

246

Chemometrics in Spectroscopy Relative computed transmittance 60

Relative increase

50 40 30 20 10

1

0.96

0.88 0.84

0.92

0.8

0.84 0.8

0.76

0.76

0.72

0.72

0.68

0.68

0.6

0.64

0.56

0.52

0.48

0.4

0.44

0.36

0.32

0.28

0.2

0.24

0.16

0.12

0.08

0

0.04

0

α Expansion of plot 9 8

Relative increase

7 6 5 4 3 2 1 0.92

0.88

0.64

0.6

0.56

0.52

0.48

0.44

0.4

0.36

0.32

0.28

0.24

0.2

0.16

0.12

0.08

0

0.04

0

α Figure 43-4 The relative change in computed value of T from equation 43-53 for various values of .

Er is becoming very small. Here, however, the effect will be masked by other effects contained in the data, such as the effect of small changes in source intensity, external interference or, in the case of FTIR, interferometer misalignment, or any of several other effects that change the actual values of reference and sample energy at the limits of the spectral range. On the other hand, if the measurement situation is such that the reference energy is small and cannot be increased (e.g. outdoor open-air monitoring, or insufficient time available for coaddition of data), so that the noise level is an appreciable fraction of the reference signal, then this phenomenon can become important. Now it is time to examine the effect of a more realistic type of noise than we have been considering so far. In a real situation, of course, where many readings may be

Analysis of Noise: Part 4

247

averaged together, some will contain small errors and some will contain large errors, each one making its nonlinear contribution to the value of T . Obviously, only one average value of T will be computed from the data. The net effect on the value of T computed from many readings, then, will thus depend not only on the standard deviation of Er compared to Er , but also on how many values of each value of Er are there, that is on the distribution of the values of Er . Statisticians call this average of many values of a quantity the “expected value” of that quantity. For many reasons, that we have discussed previously [4] the Normal distribution is the one that inevitably occurs in nature when there is no overriding factor to change it, therefore it is the one we consider. How do we determine the effect of using the Normal distribution? Basically, what we want to do is find the average value for many readings, when we know how often each reading occurs, after all, that is the meaning of a distribution. This would be the expected value. If we had discrete readings, we would let Wi represent the weight of the ith value, that is, how often that value occurs, and Xi represent the value, then use the formula for a weighted average: � Wi FXi i (43-59) XW = � Wi i

The Normal distribution, however, is a continuous distribution as is the distribution of values of Es /(Er + Er ; therefore we have to change the summations to integrations: � Wxfxdx XW = � (43-60) Wxdx and, since� in �this case Wx represents the Normal distribution weighting: − 1 e 2 21/2

1

Er −Er

2

which specifies the relative weights of the different values, we replace Wx with the expression for the Normal distribution, and fx is the function Es /(Er + Er ), so that equation 43-60 becomes T WN =

� �2 r − 21 Er −E Es e dEr − Er +Er � �2 � − 21 Er −E r 1 e dEr 21/2 −

1 21/2

�

(43-61)

where is the standard deviation of the variations of the energy readings and T WN rep resents the mean computed transmittance for Normally distributed detector noise. Since the normalization factor in front of the integral representing the Normal distribution in the denominator is intended to make the final value of the integrated Normal distribution be unity, the denominator of equation 43-61 is therefore unity, hence: T WN =

� �2 � r Es 1 − 21 Er −E e dEr 21/2 − Er + Er

(43-62)

A plot presenting the two parts of the integrand, and their product, is shown in Figure 43-5. We made an attempt to perform the integration analytically, which failed. While that approach may still be possible, it does not seem likely, for a couple of reasons. The dif ficulty arises from two sources. One is the general difficulty of integrating the Normal

248

Chemometrics in Spectroscopy 5

Integration terms 4

f(E r)

3

Normal distribution

Product

2

f(E r)

1 0 –0.25 –1

–0.13

–0.01

0.11

0.23

0.35 ΔE r

0.47

0.59

0.71

0.83

0.95

–2 –3 –4 –5 –6

Expansion of integral functions 2

f(E r)

1.5 1

Normal distribution

Product

0.23

0.2

0.17

0.14

0.11

0.08

0.05

0.02

–0.01

–0.04

–0.07

–0.1

–0.13

–0.16

–0.19

–0.5

–0.22

0

0.25

f(E r)

0.5

ΔE r –1 –1.5 –2

Figure 43-5 The Normal curve, the function f (Er [= Er /(Er + Er from equation 43-62 and their product. (see Color Plate 4)

distribution (sometimes called the Error Function, for obvious reasons). The other is that the Normal distribution is infinite in extent, and therefore, regardless of the value of Er or of the standard deviation being represented by the particular Normal distribution in use, there will inevitably be a point at which term Es /(Er + Er in equation 43-62 attains a value of infinity (when Er = −Er ). While this in itself does not automati cally preclude performing the integration, or prevent the integral from having a finite value, it points to a problem area, one which indicates that if the integral can be evaluated at all, it will require special methods, as the evaluation of the error function itself does. Now in fact, all this is also in accord with reality: an attempt to use data in which the reference energy becomes so small that the noise brings even a single reading down to zero will cause the computed value corresponding to that reading to become infinite; then, averaging that with any finite number of other finite values will still result in an

Analysis of Noise: Part 4

249

infinite value for the computed value of T . This is, of course, catastrophic to our attempt to deal with this situation analytically. Another point to note: if we look at equation 43-62 critically, we note that the variables are not completely separable. While we can remove Es from inside the integration, Er is not so easily removed. How, then, can we determine the effect on the computed value of T ? One way is to multiply the right-hand-side of equation 43-62 by unity, in the form of Er /Er , this leads to T WN =

� �2 � r Es 1 Er − 21 Er −E e dEr Er 21/2 − Er + Er

(43-63)

which now puts the expression into the form of the ratio of the measured values of Es and Er , with a multiplier. It also, perhaps, makes what is going on somewhat clearer: in the limit of small values of Er the base expression reduces to Er /Er which is unity; the integral then reduces to the ordinary Normal distribution, which, as we noted, also evaluates to unity, so that in the limit of small levels of noise, T becomes Es /Er , as it should. However, we still have that pesky Er inside the integral. As we might expect, the effect of the noise, Er , is really going to be affected by its relationship to Er , the signal strength. The overall noise value is contained in the exponent of the Normal distribution weighting factor, but its presence in the first part of the integral indicates that it has more than just that effect. Thus, if we try to determine the effect of changing the signal-to-noise ratio, at constant noise level, by changing Er , we must realize that Er then becomes a parameter affecting the value of the integral. Therefore in order to represent the effect of varying the signal-to-noise in this regime, we will require a family of curves rather than just a single one. Since we have seen that the integral cannot be evaluated analytically, there are several alternatives to analytic integration of equation 43-63: we can perform the integration numerically, we can investigate the behavior of equation 43-63 using a Monte-Carlo simulation, or we can expand equation 43-63 into a power series. In all cases we need to take at least a brief look at what happens when Er is close to the asymptote at −Er ; basically, it goes off to +infinity when approaching from above (as we saw), and to −infinity when approaching from below. If we do not try to compute values when we are too close to −Er , therefore, using either approach there will be a tendency toward cancellation of the positive and negative terms, leaving a finite result. In the case of a power series expansion, the closer we come to unity, the more terms we would need to include in the series. We now report on the evaluation of the integral in equation 43-63, which was done numerically by computer. The numerical computations were carried out using MATLAB. Here we examine the conditions and the results obtained for this exercise. Before attempting to evaluate the integral, we first tested for convergence, that is, that the integral is finite, and also that when evaluating it we are using a sufficiently fine interval of integration to provide accurate results. To do this, we evaluated the integral for a small region around the point Er = 0, using different values of the integration interval. The integration range was −0.01 to +0.01. Integration intervals ranged from 10−2 to 10−7 . The standard deviation of the Normal distribution was set to unity (note that we will eventually investigate the behavior of equation 43-63 for various values of the standard deviation, so that at this point setting it equal to unity is convenient for

250

Chemometrics in Spectroscopy

Table 43-1 Values of the integral between −0.01 and 0.01, for various values of the integration interval Integration interval 10−2 10−3 10−4 10−5 10−6 10−7

Value of integral 0.012130208846544832 0.012130457519586295 0.012130476382397228 0.012130478208633820 0.012130478390650151 0.012130478408845785

pedagogical purposes, and also for a quick “ballpark” evaluation of equation 43-63 for other values of this parameter), and the mean of the Normal distribution was also set equal to unity. Since the section of the Normal distribution, that is 1 standard deviation away from the mean, is the region that has the maximum slope, these conditions gave the maximum weight to the region around the infinity of f (Er ; thus if the integral did not diverge here it would not diverge at any other point of the Normal distribution. The results are in Table 43-1. Since the value of integration interval also determines how close to the point of infinity any contribution may be, presumably, if the integral were to diverge, what we would see around the point of infinity would be contributions to the integral increasing faster and faster as the computation included points closer to the infinity. Under those circumstances, we would observe an increasing value of the integral as we used finer and finer intervals of integration. What we see in Table 43-1 on the other hand is that, as we use smaller intervals of integration, more digits of the integral remain stable; thus we conclude that the integral does indeed seem to be converging on a finite value. We also observe that using an integration interval of 10−4 provides precision on the order of one part in 107 , which is more than sufficient for our purposes. First, the range of integration was set to be wide enough (10 standard deviations) that at the number of iterations we used, there is no further appreciable contribution to the integral from values beyond that range, the value of the Normal distribution at 10 standard deviations is approximately 2×10−22 . The integral is computed for various values of Er , each set of such integrals forming one curve that we will plot. The family of curves is generated by using various values of sigma (, the standard deviation of the readings due to detector noise). For our demonstration, we compute the curve of multiplication factor versus Er for values of sigma of 0.1, 0.2, 1.0. The point at Er = −Er . with the infinite value, was deleted from the set before adding the terms of the integral. Since we are using the Normal distribution, we take this opportunity to point out some of the other characteristics of the error, in particular the fact that the errors have a mean value of zero. The multiplication factor according to the integral of equation 43-63 was computed, and the family of curves is presented in Figure 43-6. Interestingly, while the values of individual computations of the multiplication factor for a finite number of discrete points can reach large values, as we saw above, we find that the expected value of the multiplication factor reaches a maximum value at a modest level, and then approaches zero as Er approaches zero. The explanation is that at large values of the reference signal strength, Er , where the noise becomes small compared to the signal, the multiplication factor approaches unity, so that the computed value of T W approaches Es /Er , as we

Analysis of Noise: Part 4

251

would expect. As the reference signal strength decreases so that it becomes comparable to the noise level, occasional individual data points will be measured in the regime where the nonlinearity of the division process becomes important; this nonlinearity then causes the computed value of T to be higher than the value computed under strong-signal (i.e., low-noise) conditions. When Er approaches zero, however, the Normal curve then allows occasional negative values to be included in the integral, and more and more often as the reference signal strength decreases further. In reality, noise can indeed cause an apparent negative value of Er , which would result in a negative computed value for the computed quantity T , even though it is a mathematical artifact and cannot correspond to an actual negative value for the physical property, T . In the limit of the reference signal strength approaching zero, there will be equal contributions of negative and positive excursions from zero, so that the average value will be zero. Since the sample signal strength must be less than the reference signal strength, the same thing is happening to Es the sample signal, so that in fact the computation would assume the undefined form of 0/0. Examining Figure 43-6, however, shows that the limiting value of T as Er approaches zero is also zero. The family of curves obtained, and presented in Figure 43-6, show that, not surpris ingly, the controlling parameter of the family of curves is the standard deviation of the noise; the maximum value of the multiplication factor occurs at a given fraction of the standard deviation of the energy readings. Successive approximations show that the maximum multiplier of approximately 1.28 occurs when Er is approximately 2.11 times sigma, the standard deviation of Er . Some miscellaneous questions arise, which we address here: First of all, since the value of a reading can become infinite, why is the integral finite and well-behaved? The answer is that while a single reading can indeed become large beyond all bounds as Er approaches −Er the probability of obtaining a value closer and closer to exactly −Er becomes smaller and smaller, and the probability of Multiplication factor for T as a function of E r

1.4

σ = 0.1

σ = 1.0

Multiplication factor

1.2 1 0.8 0.6 0.4 0.2

4.84

4.4

4.62

4.18

3.96

3.74

3.3

3.52

3.08

2.86

2.64

2.2

2.42

1.98

1.76

1.54

1.1

1.32

0.88

0.66

0.44

0

0.22

0

Er

Figure 43-6 Family of curves of multiplication factor as a function of Er , for different values of the parameter sigma (the noise standard deviation), for Normally distributed error. Values of sigma range from 0.1 to 1.0 for the ten curves shown. (see Color Plate 5)

252

Chemometrics in Spectroscopy

being exactly −Er is exactly zero, therefore in reality an infinity will not occur. Hence the integral, representing the average of what will actually occur, remains finite. There are other factors, also. One factor is that, as we consider two values of Er at equal magnitude and opposite directions from Er , we realize that as the two values get closer to Er there is less room for the nonlinearities to act, therefore the magnitudes of the two values of fEr ) become more and more nearly the same, and since they have opposite sign cancel each other more and more exactly. Secondly, why do the curves pass through a maximum and then go to zero as Er approaches −1? If we look at Figure 43-5, and particularly at the expanded plot, we see that the asymmetry of the Normal curve with respect to the function f (Er causes the cross-product of the two curves (which, after all, is what is being integrated) to exhibit a fairly large area between the peak of the Normal curve and where the curve of f (Er ) really “takes off” that has no counterpart in the region where f (Er ) is negative. This creates a net positive contribution to the integral. As Er approaches −1, the Normal curve “slides under” f (Er ), and there is an increasing contribution from the negative portion of f (Er ), until symmetry assures us that when Er = −1 there is always a negative contribution of f (Er ) to cancel each positive contribution, so that T W = 0 at that point. Thirdly, when we separated equation 43-51 into two terms, we only worked with the first term. The second term, which we presented in equation 43-52B, was neglected. Is it possible that the nonlinear effects observed for equation 43-52A will also operate on equation 43-52B? The answer is yes, it will, but And the “but ” is this: Es is a random variable, just as Er is. Furthermore, it is uncorrelated with E r . Therefore, in order to evaluate the integral representing the variation of both Es and Er , it would be necessary to perform a double integration over both variables. Now, for each value of Es , the nonlinearity caused by the presence of Er in the denominator would apply. However, Es is symmetrically distributed around zero, therefore for every positive value of Es there is an equal but negative value that is subject to exactly the same nonlinear effect. The net result is that these pairs always form equal and opposite contributions to the integral, which therefore cancel, leaving no effect due to Es . We have analyzed the effect that noise has on the computed transmittance, just as we previously analyzed the effect that the sample transmittance has on the computed noise value. We can experimentally measure the variation in noise level due to the sample transmittance. On the other hand, we will not be able to realize the effect of noise on the computed transmittance, for reasons we will discover in our next chapter, which will deal with the noise of the transmittance when the energy is low, or the noise is high, so that again we cannot make the “low noise” approximation we made previously.

REFERENCES 1. 2. 3. 4.

Mark, Mark, Mark, Mark,

H. H. H. H.

and and and and

Workman, Workman, Workman, Workman,

J., J., J., J.,

Spectroscopy Spectroscopy Spectroscopy Spectroscopy

15(10), 24–25 (2000). 15(11), 20–23 (2000). 15(12), 14–17 (2000). 3(1), 44–48 (1988).

44

Analysis of Noise: Part 5

This chapter is the continuation of Chapters 40–43 from a set of articles [1–4] dealing with the rigorous derivation of the expressions relating the effect of instrument (and other) noise to their effects to the spectra we observe. Chapter 40 in this set was an overview; since then we have been analyzing the effect of noise on spectra, when the noise is constant detector noise, that is, noise that is independent of the strength of the optical signal. Inasmuch as we are dealing with a continuous set of chapters (40 through 53) on the same subject, we continue our discussion by serially numbering our equations, figures, use of symbols, and so on. as though there were no break across these chapters. It seems we said something wrong. When we first began this series of chapters (starting at 40) dealing with the effects of various kinds of noise on spectra [1, 2], we said that there does not seem to have been any recent attention paid to the question of noise in spectra. It turns out that that is not quite true. Edward Voigtman pointed out that in fact, he had performed and published computer simulation studies of just this subject [5, 6]. His studies were based on computer simulations of the behavior of various analytical instruments in various situations using a simulation engine described in an Analytical Chemistry Report [7]. In addition to the simulations of spectrometers, he also published simulations of polarimeters [8, 9] with results that are interesting, if not of direct application to our current study. The diagrams he published [5] clearly show the difference in the optimum absorbance values (i.e., minimum relative absorbance error) between these simulations and the conventional theory in use previously. Unfortunately the noise levels of the simulations were too high to precisely determine the actual minimum. When Dr. Voigtman contacted us to inform us of these papers, we discussed the results he obtained, and he revealed that due to the limitations of the computer hardware available at the time the simulations were performed, he could not use more than a few hundred repeats of the Monte-Carlo experiments, resulting in the high noise levels observed. Having seen our early Chapters 40 and 41 dealing with this topic from the papers first published [1, 2], he reprogrammed his simulation engine to perform new simulations and compared the results with the exact solution we derived (see equation 41-19 [2]), and with new hardware allowing use of much more extensive Monte-Carlo calculations, he found excellent agreement (E. Voigtman, 2001, personal communication). We are grateful to Dr Voigtman for pointing out the previous literature that we had missed, as well as sharing the results of his new simulations with us. Now let us recap where we came from in our discussion, in this mini-series-within a-book, and where we are going. In Chapter 41, referenced in [2] we demonstrated that, because previous treatments of this topic failed to take into account the effect of the noise of the reference reading, they did not come up with the rigorously correct formula to describe the effect of transmittance on the computed value of the noise. The rigorously exact solution to this situation shows that the noise level of a transmittance

254

Chemometrics in Spectroscopy

spectrum increases with the transmittance of the sample, rather than being independent of the sample characteristics, as previously thought. We then continued the development of those equations in Chapter 42 [3] to show the effect of the random noise on absorbance spectra, and on the relative precision: SDA/A, in both cases comparing the result of the rigorous treatment of the topic to the previous mathematical analysis, and showing that in both cases, the results from the rigorous treatment differ slightly but noticeably from the previous results. Finally we developed and solved the equations for the minimum in the curve of SDA/A, this being the generally accepted criterion for determining the best value of transmittance (or absorbance) that a sample should have, to obtain the most accurate results from this form of spectroscopic chemical analysis. Our conclusion here was that the optimum value of transmittance under these conditions, that is constant detector noise, is approximately 33 %T rather than the previously accepted 36.8 %T . We next noted in Chapter 43 [4] that all the results obtained up until that point were relevant only to the condition where the detector noise was small compared to the reference signal, and therefore the S/N ratio was high. We then noted that if that condition did not hold for any particular set of measurements, then other phenomena also come into action. We then pointed out that under low-noise conditions the signal can affect the noise level, but under conditions where the signal is weak or the noise excessive, the noise can affect the computed transmittance, as well. The expressions we obtained showed that as the reference signal gets weaker and weaker (or the noise gets larger and larger), the system first reaches a point where the expected value of T is larger than Es /Er and as the reference signal continues to decrease, the multiplying factor first goes through a maximum and then decreases, so that the expected value of T approaches zero as Er as the reference signal energy approaches zero. We are now ready in this chapter to consider the behavior of the noise under conditions where it is not small compared to the signal. We start with the definition of transmittance, as we pointed out previously, and we rewrite the equation here: T=

Es Er

(44-6)

To put equation 44-6 into a usable form under the conditions we wish to consider, we could start from any of several points of view: the statistical approach of Hald (see [10], pp. 115–118), for example, which starts from fundamental probabilistic considerations and also derives confidence intervals (albeit for various special cases only); the mathe matical approach (e.g., [11], pp. 550–554) or the Propagation of Uncertainties approach of Ingle and Crouch ([12], p. 548). In as much as any of these starting points will arrive at the same result when done properly, the choice of how to attack an equation such as equation 44-6 is a matter of familiarity, simplicity and to some extent, taste. At this point, however, we again need to take cognizance of comments we received after the material of this chapter was published as a column. One of our respondents noted that the analysis performed could be done in a different way, a way which might be superior to the way we did it. Normally, if we agree with someone who takes issue with our work we would simply publish a correction, or, when rewriting the material for this book, use the corrected form (as we have done in various places). In this case, however, that seems inappropriate, for several reasons. First, we are not convinced that

Analysis of Noise: Part 5

255

our original approach is “wrong”, therefore we wish to retain it. Secondly, some of our readers may wish to refresh themselves about our original material. Thirdly, some of our readers may wish to compare the two approaches for themselves, to decide if the original one is “wrong” or simply “not as good”, or whether, in fact, the new analysis is better. Therefore we present, at this point, the original analysis of the situation, the same way it was presented in the original column except, perhaps, for some minor enhancements in the wording to improve the comprehensibility. Later on in this chapter, under the heading “Alternate Analysis” we present the new analysis, as recommended. Therefore, continuing as we originally did, we note that we, being chemists and spectroscopists, and writing for spectroscopists, will use the Propagation of Uncertainties approach of Ingle and Crouch: FC D =

fC D fC D C + D C D

(44-64)

Note that we use the letters C, D to represent the variables in equation 44-64 to avoid confusion with our usage of A to mean absorbance. Applying this to equation 44-6: T =

Es /Er Es /Er Es + Er Es Er

(44-65)

Es −Es Er + Er Er2

(44-66)

T = As usual, we take the variance of this:

� VarT = Var

Es −Es Er + Er Er2

�

And apply first, the theorem that VarA + B = VarA + VarB: � � � � −Es Er Es VarT = Var + Var Er Er 2

(44-67)

(44-68)

and then the theorem that VaraX = a2 VarX: VarT =

� � 1 −Es 2 E Var + Var Er s Er2 Er2

and continue as before by setting Es = Er = E: � � 1 Es2 + Var E VarT = Er 2 Er4

(44-69)

(44-70)

and finally take square roots to obtain: � SDT =

1 Es2 + SD E Er2 Er4

(44-71)

256

Chemometrics in Spectroscopy

This is clearly a function of both Er and Es ; in the regime we are concerned with in this chapter, however, as Er approaches 0, the second term under the radical dominates the expression, although clearly the point at which the numerical value becomes large com pared to 1/Er 2 will depend on the value of Es as well, or equivalently, the transmittance of the sample. Here, again, therefore, the behavior of the noise of the transmittance must be expressed as a family of curves. Figures 44-7 and 44-8 present the behavior of this family of curves. Note that equation 44-71 can be reduced to equation 41-19 [2], which is appropriate when the signal-to-noise ratio is high and may be considered constant. Under these conditions Er is large and the second term under the radical is small and the first term under the radical, which is independent of Es , dominates; then the noise of the √ transmittance increases with T as 1 + T 2 and inversely with the reference energy. Here, however, under low-signal/high-noise conditions, where the variation of Er cannot be ignored and therefore the S/N ratio varies, we must use the full expression of equation 44-71. Note further that when Er is small enough, as we noted above, the second term under the radical dominates, then � T2 T SDT = SD E = SD E (44-72) 2 Er Er The noise of the transmittance thus becomes directly proportional to T and inversely proportional to Er . Under these conditions; the noise of the transmittance approaches infinite values as Er approaches zero, even as the expected value of the transmittance approaches zero, as we saw in Chapter 43 [4]. To summarize the effects at low signal-to-noise to compare with the high signal-to-noise case summarized above, here the noise of the transmittance increases directly with T and still inversely with the reference energy. We now wish to follow through, as we did before, on finding the “optimum” value for sample transmittance under these conditions. To do this, we start with equation 44-24 (reference [3]): � � −04343Er Er Es − Es Er A = (44-24) Er Er + Er Es This is the point at which, in the previous development, we considered the effect of letting Er become negligible, but of course in this case we wish to investigate the small-signal/large-noise behavior. We now, therefore, go directly to dividing A by A (from equation 44-20b): � � −04343Er Er Es − Es Er A Er Er + Er Es = (44-73) A −04343 ln T � � A Er Er Es − Es Er = (44-74) Er + Er A Es Er ln T � � A 1 Er Es −Er = + (44-75) A ln T Es Er + Er Er + Er

Analysis of Noise: Part 5

1 −Er A 1 Es = + A T ln T Er + Er ln T Er + Er

257

(44-76)

To determine the variance of A/A we perform our usual exercise of taking the variance of both sides of equation 44-76 and applying our two favorite theorems; the result is � � � �2 � � � � � �2 A 1 Es 1 −Er Var = Var + Var (44-77) A T ln T Er + Er ln T Er + Er We cannot simplify this equation further; in particular, we cannot separate out the variances of Es and Er , n in order to replace them with the same generic value. To determine the variance of A/A, that is the relative precision (in chemists terms), we need to evaluate the variance of the two terms in equation 44-77. As we had observed previously, as the value of Er approaches −Er , the value of the expressions attains infinite values. However, a difference here is that when computing the variance, these values are squared, and hence the computations are always done using positive values. This differs from out previous case, where the presence of both positive and negative values afforded the opportunity for cancellation of near-infinite contributions; we do not have that situation here. Therefore we are faced with the possibility that the variance will be infinite. An empirical test of this possibility was performed by computing values of the variance of the two terms in equation 44-77. The Normal random number generator of MATLAB was used to create multiple values of Normally distributed random numbers for Er and Es ; these were plugged into the two expressions of equation 44-77 and the variance computed. Values between 100 and 106 were used in each computation of the variance. When Er was more than five standard deviations away from the center of the Normal distribution representing Er , the computed variance was fairly small and reasonably stable, and decreased as Er was moved further away from the center of Er . This might be considered an empirical determination of the point of demarcation of the “small-signal” case. When Er was moved below five standard deviations, the computed value of the variance became very unstable; computed values of the variance would differ by as much as four orders of magnitude. The closer Er came to Er , the more erratic the computed variance became. It was clear that bringing Er close to the center of Er afforded more opportunity for a given reading of the noise to become close to −Er , thus giving a value approaching infinity that would be included in the calculation. Furthermore, for a given relationship between Er and Er , the more readings that were included in the computation, the higher the values of variance that would be calculated. For example, with 100 readings, values of variance might fall between 101 and 104 , while with 10,000 readings calculated variance values would fall in the range of approximately 103 and 106 . This is attributed to the increased likelihood of more data points being close to −Er and also of at least a few points being closer to −Er than with fewer data. Another test of whether the variance actually diverges and becomes infinite is the same as the test we applied in the previous chapter: to integrate the expressions in equation 44-77 in a small region around the point Er = −Er using different intervals of integration and see if the values converge or diverge. Basically, except for a multiplying factor these are both the same expression, so evaluating the expression once suffices to settle the question for both of them. Furthermore, since we are integrating over values of

258

Chemometrics in Spectroscopy

Table 44-2 Value of integral of 1/Er 2 over range −001 to +001 Integration interval 10−2 10−3 10−4 10−5 10−6 10−7

Value of integral 2.0000000000000000e+002 3.0995354623330845e+003 3.2699678003698089e+004 3.2878691333625099e+005 3.2896681436917488e+006 3.2898481337470137e+007

variance, the expression that needs to be integrated is 1/Er 2 . The result of performing this test is presented in Table 44-2. In contrast to the previous test results, the values are clearly growing increasingly larger without bound as the integration interval is reduced. The conclusion from all this is that the variance and therefore the standard deviation attains infinite values when the reference energy is so low that it includes the value zero. However, in a probabilistic way it is still possible to perform computations in this regime and obtain at least some rough idea of how the various quantities involved will change as the reference energy approaches zero; after all, real data is obtained with a finite number of readings, each of which is finite, and will give some finite answer; what we can do for the rest of this current analysis is perform empirical computations to find out what the expectation for that behavior is; we will do that in the next chapter.

ALTERNATE ANALYSIS Here we present the revised analysis of the situation of the effect on the expected noise level of noise that is not small compared to the signal level Er . Before we proceed, however, there is a technical point we need to clear up. This is the numbering of the equations, figures, etc. The previous column/chapter ended with equation 43-63. Therefore it is appropriate to begin the analysis with equation 43-64, as we did above in this chapter, and in the original analysis published in the columns. For obvious reasons, however, we cannot simply repeat using the same equation/figure/etc. numbers that we did above. Neither can we simply continue from the last number used in the first analysis, above, because then we would have to renumber all equations, figures, etc., for the rest of this series of chapters. While laborious, that could be done, but would raise another, insoluble, problem: it would put the numbering of the equations, etc., out of synchronization with the numbering of the original columns. Therefore anybody reading the later chapters and wishing to compare them with the original columns will find that task well-nigh impossible. Fortunately, none of the equations developed in this chapter, nor the figures, used any suffix, as was occasionally done in previous chapters (we do refer to equation 42-20b above, but that equation is in a previous chapter and we will not repeat the use of equation 42-20 here. We will also copy equation 43-52b from the previous column, but the b suffix does not signify a new equation, since it is the equation used previously; also, a b suffix is not indicative of a copy of an equation number in this section). Therefore, we can distinguish the numbering of any equations or other numbered entities in this

Analysis of Noise: Part 5

259

section by appending the suffix “a” to the number, without causing confusion with other corresponding entities. Now we are ready to proceed. We reached this point from the discussion just prior to equation 44-64, and there we noted that a reader of the original column felt that equation 44-64 was being incorrectly used. Equation 44-64, of course, is a fundamental equation of elementary calculus and is itself correct. The problem pointed out was that the use of the derivative terms in equation 44-64 implicitly states that we are using the small-noise model, which, especially when changing the differentials to finite differences in equation 44-65, results in incorrect equations. In our previous column [4] we had created an expression for T + T (as equa tion 44-51) and separated out an expression for T (as equation 44-52b). We present these two equations here: T + T =

Es Es + Er + Er Er + Er

(44-51)

from which we concluded that: T =

Es Er + Er

(44-52b)

At this point we would like to compute the variance of T , but simply computing s would also not be correct, since it would ignore the influence the variance of E E r +Er of the variability of the first term in equation 44-51 [4], and not take its contribution to the variance into proper account. Therefore the expression for T in equation 44-52b is not correct, even though it is the result of the formal breakup of equation 44-51 [4]. We should be using a formula such as: T =

Es Es + Er + Er Er + Er

(44-64a)

in order to include the variability of the first term, also. This, however, leads to another problem: subtracting equation 44-64a from equa tion 44-51 leaves us with the result that T = 0. Furthermore, the definition of T gives us the result that Es is zero, and that therefore T is in fact equal to the expression given by equation 44-52b; anyway despite our efforts to include the contribution to the variance of the first term in equation 44-51. Our conclusion is that the original separation of equation 44-51 into two equations, while it served us well for computing TM and TA , fails us here. This is because Es and Er are random variables and we cannot treat their influences separately; we have no expectation that they will either cancel or reinforce each other, wholly or partially, in any particular measurement. Therefore when we compute the variance of T we wish to retain the contribution from both terms. This also raises a further question: the analysis of equation 44-52a by itself served us well, as we noted; but was it proper, or should we have maintained all of equation 44-51, as we find we must do here? The answer is yes, it was correct, and the justification is given toward the end of the previous column [4]. The symmetry of the expression when

260

Chemometrics in Spectroscopy

averaged over values of Es means that the average will be zero for each value of Er , and therefore the average of the entire second term will always be zero. Therefore, the best way to maintain the entire expression is to go back still a further step, and note that the ultimate source of equation 44-51 was equation 44-5 [2]: T + T =

Es + Es Er + Er

(44-5)

Therefore we solve equation 44-5 for T and, noting the definition of T , we find: T =

Es + Es Es − Er + Er Er

(44-65a)

Then we take the variance of both sides: � VarT = Var

Es + Es Es − Er + Er Er

� (44-66a)

Once again applying the rule that the variance of a sum is the sum of the variances, we obtain: � � � � E + Es E + Var s (44-67a) VarT = Var s Er + Er Er Since Es /Er is the true transmittance of the sample, the value of T for a given sample is constant, and therefore the variance of that term is zero, resulting in: �

E + Es VarT = Var s Er + Er

� (44-68a)

The variables in equation 44-68a are again not separable. While we could formally split equation 44-68a into the sum of two variances: � VarT = Var

Es Er + Er

�

� + Var

Es Er + Er

� (44-69a)

that would not be correct because the two variances that we wish to add have a common term Er + Er and therefore are not independent of each other, as application of the rule for adding variances requires [2]. Also, evaluation of a variance by integration requires the integral of the square of the varying term, which as we have seen previously [13] is always positive and therefore the integrals of both terms of equation 44-69a

diverge.

Thus we conclude that we must compute the variance of T directly from equation

44-68a and the definition of variance:

n �

VarT =

i=1

��

� � ��2 Es + Es Es + Es − Er + Er Er + Er n−1

(44-70a)

Analysis of Noise: Part 5

261

We can learn something interesting by again noting, as we did previously [4], that Es has a mean of zero, therefore equation 44-70a becomes: n �

VarT =

��

i=1

� � ��2 Es + Es Es − Er + Er Er + Er n−1

(44-71a)

and by splitting up the first term in the numerator of equation 44-71a into its two parts: n �

VarT =

��

i=1

Es Er + Er

�

�

Es + Er + Er

�

�

Es − Er + Er

��2 (44-72a)

n−1

and rearranging the terms: n �

VarT =

i=1

��

Es Er + Er

�

�

Es − Er + Er

�

�

Es + Er + Er

��2

n−1

(44-73a)

and again using the definition of variance: � VarT = Var

Es Er + Er

n �

� +

i=1

�

Es Er + Er n−1

�2 (44-74a)

and then the definition of the average value: � VarT = Var

Es Er + Er

� +

� �2 n Es n − 1 Er + Er

(44-75a)

Where we note that the limit of n/n − 1 → 1 as n becomes indefinitely large. Of course, the noise level we want will be the square root of equation 44-75a. We have previously seen, in equation 44-77 [13], that the variance term in equation 44-75a diverges, and clearly, as Er → −Er the second term in equation 44-75a also becomes infinitely large. However, as we discussed at the conclusion of the original analysis, using finite differences means that the probability of a given data point having Er close enough to −Er to cause a problem is small, especially as Er increases. This allows for the possibility that a finite value for an integral can be computed. To recapitulate some of that here, it was a matter of noting two points: first, that as Er gets further and further away from zero (in terms of SD) it becomes increasingly unlikely that any given value of Er will be close enough to Er to cause trouble. The second point is that, in a real instrument there is, of necessity, some maximum limit on the value that 1/Er − Er can attain, due to the inability to contain an actually infinite number. Therefore it is not unreasonable to impose a corresponding limit on our calculations, to correspond to that physical limit. We now consider how to compute the variance of T , according to equation 44-68a. Ordinarily we would first discuss converting the summations of finite differences to

262

Chemometrics in Spectroscopy

integrals, as we did previously, but we will forbear that, leaving it as an exercise for the reader. Instead we will go directly to consideration of the numerical evaluation of equation 44-68a, since a conversion to an integral would require a back-conversion to finite differences in order to perform the calculations. We wish to evaluate equation 44-68a for different values of Es and Er , when each is subject to random variation. Note that VarEs = VarEr , we cannot simply set the two terms equal to a common generic value of E as we did previously, since that would imply that the instantaneous values of Es and Es were the same, but of course they are not since we assume that they are independent noise contributions, although they have the same variance. Under these conditions it is simplest to work with equation 44-68a itself, rather than any of the other forms we found it convenient to convert equation 44-68a into, for the illustrations of the various points we presented and discussed. There are still a variety of ways we can approach the calculations. We could assume that Es or Er were constant and examine how the noise varies as the other was changed. We could also hold the transmittance constant and examine how the transmittance noise varies as both Es and Er are changed proportionately. What we will actually do here, however, is all of these. First we will assume that the ratio of Es /Er , representing T , the true transmittance of the sample, is constant, and examine how the noise varies as the S/N ratio is changed by varying the value of Er , for a constant noise contribution to both Es and Er . The noise level itself, of course, is the square root of the expression in equation 44-67a: � � � E + Es SD T = Var s (44-76a) Er + Er To do the computations, we again use the random number generator of MATLAB to produce Normally-distributed random numbers with unity variance to represent the noise; values of Er will then directly represent the S/N ratio of the data being evaluated. For the computations reported here, we use 100,000 synthetic values of the expression on the RHS of equation 44-76a to calculate the variance of, for each combination of conditions we investigate. A graph of the transmittance noise as a function of the reference S/N ratio is presented in Figure 44-7a-1 and the expanded portion of Figure 44-7a-1, shown in Figure 44-7a-2. The “true” transmittance Es /Er was set to unity (i.e., 100%T ). The inevitable existence of a limit on the value of TM , as described in the section following equation 44-75a was examined in Figure 44-7a-1 by performing the computa tions for two values of that limit, by setting the limit value (somewhat arbitrarily, to be sure) to 1,000 and 10,000, corresponding to the lower and upper curves, respectively. Note that there are effectively two regimes in Figure 44-7a-1, with the transition between regimes occurring when the value of S/N ratio equals approximately 4. When the value of Er was greater than approximately four, i.e., the S/N ratio was greater than four, the curves are smooth and appear to be well-behaved. When Er was below an S/N of four, the graph entered a regime of behavior that shows an appreciable random component. The transition point between these two regimes would seem to represent an implicit definition of the “low noise” versus the “high noise” conditions of measurement. In the low-noise regime the transmittance noise decreases smoothly and continuously as

Analysis of Noise: Part 5

263

140

Transmittance noise

120

100

80

60

40

20 0

0

1

2

3

4

6

5

7

8

9

10

S/N (Er /ΔEr)

Figure 44-7a-1 Transmittance noise as a function of reference S/N ratio, for alternate anal ysis (equation 44-68a). The sample transmittance was set to unity. The limit for the value of Es + Es /Er + Er was set to 10,000 for the upper curve and to 1000 for the lower curve. (see Color Plate 6)

1.2

Transmittance noise

1

0.8

0.6

0.4

0.2

0 4

5

6

7

8

9

10

S/N (Er /ΔEr)

Figure 44-7a-2 Expansion of Figure 44-7a-1. (see Color Plate 7)

the S/N ratio increases. This was verified by other graphs (not shown) that extended the value of S/N ratio beyond what is shown here. The “high-noise” regime seen in Figure 44-7a-1 is the range of values of S/N ratio where the computed standard deviation is grossly affected by the closeness of the approach of individual values of Er to Er . This is, in fact, a probabilistic effect, since

264

Chemometrics in Spectroscopy 140

120

Transmittance noise

Monto-Carlo (equation 44-76a) 100

80

Theory (equation 44-19) Approx (equation 44-52b)

60

40

20

0

0

1

2

3

4

5

6

7

8

9

10

S/N (Er /ΔEr)

Figure 44-8a Comparison of empirically determined transmittance noise value with those determined according to the low-noise approximations of equation 44-19 and equation 44-52b. (see Color Plate 8)

it depends not only on how closely the two numbers approach each other, but also on how often that occurs; a single or only a few “close approaches” will be lost in a large number of readings where that does not happen. As we will see below, there is indeed a regime where the theoretical “low-noise” approximation differs from the results we find here, without becoming randomized. Changing the number of values of Es + Es /Er + Er used for the computa tion of the variance made no difference in the nature of the graph. As is the case in Figure 44-7a-1, the transition between the low- and high-noise regimes continues to occur at a value between 4 and 5. Figure 44-8a shows the graph of transmittance noise computed empirically from equation 44-76a, compared to the transmittance noise computed from the theory of the low-noise approximation, as per equation 44-19 [2] and the approach, under question, of using equation 44-52b. We see that there is a third regime, where the difference between the actual noise level and the low-noise approximation is noticeable, but the computed noise has not yet become subject to the extreme fluctuations engendered by the too-close approach of Er to Er . Since the empirically determined curve approaches the theoretical curve asymptotically as the S/N increases, where the separation becomes “noticeable” will depend on how hard you look, but there is certainly a region in which this occurs, in any case. This is the situation we alluded to above, representing the “middle ground” of the transmittance noise. Figure 44-9a-1 shows what happens to the noise level, for the same condition of constant “sample transmittance” as a function of S/N, for different values of sample transmittance. As we see, in the “low noise” regime the noise has the behavior we have derived for it. However, the effect of the exaggeration of the random variations very quickly takes over, and in the “high noise” regime there is virtually no difference in the

Analysis of Noise: Part 5

265

140

Transmittance noise

120 100 80 60 40 20

0

0

1

2

3

4

5

6

7

8

9

10

S/N (Er /ΔEr)

Figure 44-9a-1 Transmittance noise as a function of reference S/N ratio, at various val ues of sample transmittance. Blue curve: T = 1. Green curve: T = 0.5. Red curve: T = 0.1. (see Color Plate 9)

1.2 1.1

Transmittance noise

1 0.9 0.8 0.7 0.6 0.5 0.4

T=1

0.3

T = 0.5

0.2

T = 0.1 4.2

4.4

4.6

4.8

5

5.2

S/N (Er /ΔEr)

Figure 44-9a-2 Expansion of Figure 44-9a-1. (see Color Plate 10)

noise behavior at different values of transmittance, since that is now dominated by the divergence of the integrals involved. A verification of the effects is seen in Figures 44-9a-1 and 44-9a-2; which is also an investigation that is part of our original plan, and is presented in Figure 44-10a where we present a graph showing the transmittance noise as a function of the sample transmittance Es /Er . As we see, except for the occasional spike, when the S/N ratio is

266

Chemometrics in Spectroscopy S/N = 4 1.2 1.1

Transmittance noise

1 0.9

S/N = 4.5

0.8 0.7 0.6 0.5 0.4 0.3 0.2 0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Transmittance

Figure 44-10a Transmittance noise as a function of transmittance, for different values of refer ence energy S/N ratio (recall that, since the standard deviation of the noise equal unity, the set value of the reference energy equals the S/N ratio). (see Color Plate 11)

5 and even when it is only 4.5, the transmittance noise varies essentially as we saw in working out the exact solution for transmittance noise in the low-noise case. Naturally, the underlying transmittance noise value is higher when the reference S/N ratio is lower. When the S/N ratio decreases to 4, then “spikes” happen frequently enough that it becomes almost impossible to tell where the “underlying” transmittance noise level is, since the computed values are again dominated by the divergent integrals.

ABSORBANCE NOISE IN THE “HIGH NOISE” REGIME Just as equation 41-5, which led to equation 44-76a, was the starting point for investigat ing the behavior of transmittance noise in the high noise regime, so too is equation 42-24 the starting point for investigating the behavior of absorbance noise in the high noise regime. While we presented equation 42-24 above, in the original analysis, we did not follow through to investigate its behavior, since we went directly to the analysis of the behavior of VarA/A instead. Therefore we present equation 44-24 again, and take this opportunity to investigate it: � � −04313Er Er Es − Es Er (44-24) A = Es Er Er + Er We therefore take the variance of A: � � �� −04313Er Er Es − Es Er VarA = Var Er Er + Er Es

(44-77a)

Analysis of Noise: Part 5

267

Then we multiply through: � VarA = Var

−04313Er2 Es − Er Es Er Es Er Er + Er

� (44-78a)

Using the definition of variance, we get: n �

VarA =

��

i=1

� � ��2 −04313Er2 Es − Er Es Er −04313Er2 Es − Er Es Er − Es Er Er + Er Es Er Er + Er n−1

(44-79a)

Again, the mean value of Es and Es are both zero; therefore the mean term of equation 44-79a vanishes, leaving us with: �2 � n � −04313Er2 Es − Er Es Er Es Er Er + Er i=1 VarA = n−1

(44-80a)

Again we see that the variance of the absorbance equals n − 1/n times the mean value of the summand of equation 44-80a, and also that we can ignore the premultiplier term n − 1/n for large values of n. We begin our investigation of the behavior of the absorbance noise by comparing it to the theoretical expectation from the low-noise condition according to equation 42-32 [3]. This comparison is shown in Figures 44-11a-1 and 44-11a-2. These figures show what we might expect: that as the S/N increases the computed value approaches the theoretical 8

7

Absorbance noise

6

5

Computed 4

3

2

Theory 1

0

0

5

10

15

20

25

30

35

40

45

50

S/N (Er /ΔEr)

Figure 44-11a-1 Comparison of computed absorbance noise to the theoretical value (accord ing to equation 44-32), as a function of S/N ratio, for constant transmittance (set to unity). (see Color Plate 12)

268

Chemometrics in Spectroscopy

0.35

Absorbance noise

0.3 0.25 0.2

Computed

0.15

Theory

0.1 0.05 0 5

10

15

20

25

30

35

40

45

S/N (Er /ΔEr)

Figure 44-11a-2 Expansion of Figure 44-11a-1. (see Color Plate 13)

value for the low-noise approximation, and also an excessive bulge at very low values of S/N, apparently similar to the abnormally large values observed in the behavior of the transmittance at very low values of S/N. After performing this comparison, we will not pursue the analysis any further, since we will obtain the results we would expect to get from the analysis of the transmission behavior. There is, however, something unexpected about Figure 44-11a-1. That is the decrease in absorbance noise at the very lowest values of S/N, i.e., those lower than approxi mately Er = 1. This decrease is not a glitch or an artifact or a result of the random effects of divergence of the integral of the data such as we saw when performing a similar computation on the simulated transmission values. The effect is consistent and reproducible. In fact, it appears to be somewhat similar in character to the decrease in computed transmittance we observed at very low values of S/N for the low-noise case, e.g., that shown in Figure 43-6.

REFERENCES 1. 2. 3. 4. 5. 6. 7. 8. 9. 10.

Mark, H. and Workman, J., Spectroscopy 15(10), 24–25 (2000). Mark, H. and Workman, J., Spectroscopy 15(11), 20–23 (2000). Mark, H. and Workman, J., Spectroscopy 15(12), 14–17 (2000). Mark, H. and Workman, J., Spectroscopy 16(2), 44–52 (2001). Voigtman, E., Analytical Instrumentation 21(1&2), 43–62 (1993). Voigtman, E., Analytical Chemistry 69(2), 226–234 (1997). Voigtman, E., Analytical Chemistry 65, 1029A–1035A (1993). Voigtman, E., Analytical Chemistry 64, 2590–2598 (1992). Voigtman, E., Analyst 120(February), 325–330 (1995). Hald, A., Statistical Theory with Engineering Applications (John Wiley & Sons, Inc., New York, 1952).

Analysis of Noise: Part 5

269

11. Korn, G.A. and Korn, T.M., Mathematical Handbook for Scientists and Engineers, 1st ed. (McGraw-Hill Book Company, New York, 1961). 12. Ingle, J.D. and Crouch, S.R., Spectrochemical Analysis (Prentice-Hall, Upper Saddle River, NJ, 1988). 13. Mark, H. and Workman, J., Spectroscopy 16(4), 34–37 (2001).

This page intentionally left blank

45

Analysis of Noise: Part 6

This chapter is the continuation of Chapters 40–44 referenced from their original papers [1–5] dealing with the rigorous derivation of the expressions relating the effect of instrument (and other) noise to their effects on the spectra we observe. Chapter 40 in this noise series was an overview; since then we have been analyzing the effect of noise on spectra, when the noise is constant detector noise, that is noise that is independent of the strength of the optical signal. Inasmuch as we are dealing with a continuous set of chapters, we again continue our discussion by serially numbering our equations, figures, and use of symbols, and so on as though there were no break. We left off in our previous Chapter 44 with having concluded that the noise level becomes infinite, both for individual noise pulses, and for the variance of the noise, when value of the reference signal actually crosses zero and becomes negative; we learned this from the following equation, which we reproduce from our previous chapter: �

A Var A

�

� =

1 T ln T

�2

�

Es Var Er + Er

�

�

1 + ln T

�2

�

−Er Var Er + Er

� (45-77)

and we showed that both variance terms become infinite at sufficiently small values of Er . However, that still leaves open the question of the behavior of the noise while the reference signal is not quite low enough to become infinite, but still small enough for the noise level to not be considered completely negligible. First of all, we must note that the two terms of equation 45-77 are not exactly the same. While we tested the behavior of the expressions using a random number generator that produces a Normal distribution of numbers with unity variance, the variance of the entire term is not necessarily unity, especially when, as in the second term of equation 45-77, the same random variable appears in both the numerator and the denominator. The first task, then, is to compare the behavior of those two terms. It was necessary to empirically determine the variances of the two terms in equation 45-77 for comparison. To do this, 10,000 random values for Er , created by the MATLAB random number generator to be Normally distributed with variance = 1, were used for each of the two terms in equation 45-77, then the variance is computed for various values of Er between 3 and 20. A different set of 10,000 random numbers were used for each different value of Er . Figure 45-9 presents the two curves obtained. It is clear that, while the variance of Er /Er − Er ) is larger than that of Es /Er − Er when Er is small, the two curves converge for values of Er above approximately 8 times the variance of the noise. From this it would seem, then, that when the reference signal is at least approximately 3 times its noise level as measured by its standard deviation, we are entering the “low-noise” regime that we discussed previously in Chapters 41 and 42, where the approximations made there apply [2, 3].

272

Chemometrics in Spectroscopy Variances of the two terms in equation 45-77 8 7

Variance

6 5 4 3 2 1 20

19

18

18

17

16

16

15

14

14

13

12

11

11

9

10

8

9

7

7

6

5

4

3

4

0

Er Expansion of plot of terms in equation 45-77 0.50 0.45 0.40

Variance

0.35

Er /(Er – ΔEr)

0.30

Er /(Er – ΔEr)

0.25 0.20 0.15 0.10 0.05

20

19

18

17

17

16

15

14

13

13

12

11

10

9

9

8

7

6

5

5

4

3

0.00

Er

Figure 45-9 Values of the variance of Er /Er − Er ) and Es /Er − Er ) for various value of Er , with a Normal distribution of values for the errors.

Now, in this regime, where the two variances become equal we can again equate Es and Er and replace them both with a generic term, E, then the variance can be factored from equation 45-77: �2 � � � � � �� �2 � A 1 1 −E Var = + Var (45-78) A T ln T ln T Er + E so that now, when standard deviations are taken, it can be put into terms of the standard deviation of the expression involving the generic E. However, that only addresses the limiting case. We are interested in the behavior of the standard deviation of A/A in this whole intermediate regime, so that we can determine the optimum sample transmittance, just as we did before, for data measured

Analysis of Noise: Part 6

273

in the regime where signal is always much greater than the noise. This also assumes that we can assign a meaning to the word “optimum”, in a situation where the noise is comparable to or even greater than the signal. But that is a philosophical question, which we will not attempt to address here; we want to simply follow where the mathematics lead us. Since we can, however, compute the variances corresponding to the two terms in equation 45-77 for various values of Er , we can plot the family of curves of SD(A/A, with Er as the parameter of the family. Since the two variances are, in the regime of interest, unequal and are multiplied by different functions of T , it is not unreasonable to expect that the minima of those curves corresponding to different members of the family will occur at different values of T . Figure 45-10 presents this family, for values of Er between 3 and 10, and for %T between 0.1 and 0.9. It is clear that there is indeed a family of curves. However, the variation on the ordinate is due mainly to the changes in signal-to-noise ratio as Er decreases. What is of more concern to us here is whether the value of %T at which the curve passes through a minimum changes, and if so how, as Er changes. To this end, the program that computed the curves in Figure 45-10 was modified, and instead of simply computing the values of variance it also computed the derivative (estimated as the first difference) of those curves, and then solved for the value at which the derivative was zero, for the various values of Er . The results are shown in Figure 45-11. It is obvious that for values of Er greater than five (standard deviations of the noise), the optimum transmittance remains at the level we noted previously, 33 %T . When the reference energy level falls below five standard deviations, however, the “optimum” transmittance starts to decrease. The erratic nature of the variance at these low values of Er , however, makes it difficult to ascertain the exact amount of falloff with any degree of precision; nevertheless it is clear that as much as we can talk about an optimum transmittance level under these conditions, where variance can become infinite and the actual transmittance value itself is affected, it decreases at such low values of Er . Nevertheless, a close look reveals that when 12.00 10.00

Er = 10

SD (A)/A

8.00

Er = 3 6.00 4.00 2.00

0.86

0.82

0.78

0.74

0.7

0.66

0.62

0.58

0.54

0.5

0.46

0.42

0.38

0.34

0.3

0.26

0.22

0.18

0.1

0.14

0.00

%T

Figure 45-10 Family of curves for SD(A/A for different values of Er . (see Color Plate 14)

274

Chemometrics in Spectroscopy Optimum transmittance using 5,000 values in variance computation 0.40 0.35

Optimum %T

0.30 0.25 0.20 0.15 0.10 0.05 10.0

0 0

9.60

10.0

9.20

8.80

8.40

8.00

7.60

7.20

6.80

6.40

6.00

5.60

5.20

4.80

4.40

4.00

3.60

3.20

2.80

2.40

2.00

1.60

1.20

0.80

0.40

0.00

0.00

Er Optimum transmittance using 100,000 values in variance computation 0.40 0.35

Optimum %T

0.30 0.25 0.20 0.15 0.10 0.05 9.60

9.20

8.80

8.40

8.00

7.60

7.20

6.80

6.40

6.00

5.60

5.20

4.80

4.40

4.00

3.60

3.20

2.80

2.40

2.00

1.60

1.20

0.80

0.40

0.00

0.00

Er

Figure 45-11 Optimum transmittance as a function of Er .

Er has dropped to five standard deviations, the optimum transmittance has dropped to 3.2, and then drops off quickly below that value. Surprisingly, the optimum value of transmittance appears to reach a minimum value, and then increase again as Er continues to decrease. It is not entirely clear whether this is simply appearance or actually reflects the correct description of the behavior of the noise in this regime, given the unstable nature of the variance values upon which it is based. In fact, originally these curves were computed only for values of Er equal to or greater than three due to the expectation that no reasonable results could be obtained at lower values of Er . However, when the unexpectedly smooth decrease in the optimum value of %T was observed down to that level, it seemed prudent to extend the calculations to still lower values, whereupon the results in Figure 45-11 were obtained. Verifying the nature of the curve for at least two sets of variances, calculated from different numbers of random values, was necessary in light of the larger values of

Analysis of Noise: Part 6

275 Variances using 5,000 and 100,000 values

20,000 18,000 16,000

Variance

14,000

Er, 100,000 values

12,000 10,000 8,000

Es, 100,000 values

6,000 4,000 2,000

9.65

9.30

8.95

8.60

8.25

7.90

7.55

7.20

6.85

6.50

6.15

5.80

5.45

5.10

4.75

4.40

4.05

3.70

3.35

3.00

0

Er

Expansion of plot 0.20 Er term, 100,000 values

Variance

0.15

Es term, 100,000 values 5,000 values

0.10

0.05

9.65

9.30

8.95

8.60

8.25

7.90

7.55

7.20

6.85

6.50

6.15

5.80

5.45

5.10

4.75

4.40

4.05

3.70

3.35

3.00

0.00

Er

Figure 45-12 Values of the variances in the two terms of equation 45-77, using different numbers of values. (see Color Plate 15)

variance for the two terms of equation 45-77 encountered when more values were included in the calculation, as described above. However, as Figure 45-12 shows, at even moderate values of Er , all the calculated values of the variance converge. From Figure 45-12 , it appears that once the signal level has fallen low enough to include zero with non-negligible probability, the optimum transmittance varies randomly between zero and a well-defined upper limiting value. This upper limit varies in a well-defined manner, from 0.3 at large values of signal as we saw previously, through a minimum at roughly 2.5 standard deviations above zero. In fact, while it does not seem possible to observe this directly. However, comparing Figure 45-12 with the results we found for the maximum value for computed transmittance under high-noise conditions (see Figure 45-6 and the discussion of that) it would not be surprising if the minimum actually occurred when the signal was 2.11 standard deviations above zero.

276

Chemometrics in Spectroscopy

The overall conclusion of all this work is that it is surely unfortunate that the effect of noise in the reference reading was not considered for lo these many a year, since that is where all the action seems to be. We continue in our next chapter by considering a special case of constant noise, with characteristics that give somewhat different results than the ones we have obtained here.

REFERENCES 1. 2. 3. 4. 5.

Mark, Mark, Mark, Mark, Mark,

H. H. H. H. H.

and and and and and

Workman, Workman, Workman, Workman, Workman,

J., J., J., J., J.,

Spectroscopy Spectroscopy Spectroscopy Spectroscopy Spectroscopy

15(10), 24–25 (2000). 15(11), 20–23 (2000). 15(12), 14–17 (2000). 16(2), 44–52 (2001). 16(4), 34–37 (2001).

46 Analysis of Noise: Part 7

This chapter is the continuation of Chapters 40–45 found as papers first published as [1–6] dealing with the rigorous derivation of the expressions relating the effect of instrument (and other) noise to their effects to the spectra we observe. Our first chapter in this set was an overview; since then we have been analyzing the effect of noise on spectra, when the noise is constant detector noise, that is noise that is independent of the strength of the optical signal. As we do in each chapter in this section of the book we take this opportunity to note that we are dealing with a continuous set of chapters, and so we again continue our discussion by continuing our equation numbering, figure numbering, use of symbols, and so on as though there were no break. We left off in Chapter 45 with having found an expression for the optimum value of transmittance, in situations where the noise is large compared to the signal (or, alterna tively, where the signal is small enough to be comparable to the noise), a regime we have investigated for the previous three chapters. Most of the derivations and mathematical analyses we have done so far have been very general, applying to any and all types of noise that might be superimposed on the spectral signal, as long as the noise level was constant and independent of the signal level. Stating it somewhat more rigorously, we assumed that regardless of the signal level, the noise contribution to each measured value represented a random sample taken from a fixed population of such values. In particular, for the most part we made no assumptions about the distribution of the values in the population of the noise readings. In Chapters 43–45 [6], however, we found it necessary to introduce the assumption that the noise was Normally distributed, in order to be able to determine the expected value for the average transmittance and for the expected standard deviation of the noise level in the case where the signal level was small enough to be comparable to the noise. The Normal distribution is, of course, an important and a common distribution to solve for in this development, but there is another important case where a noise contribution also has a constant standard deviation (i.e., independent of the signal level) but does not have a Normal distribution. These days, this contribution is probably almost as common as the ones having the Normal distribution, although it is not as obvious. Also, it is arguably less important than the other contributions, one reason being that it usually (at least in well-designed instruments) will be swamped out by the other noise sources, and therefore rarely observed. Nevertheless, this contribution does exist and therefore is worthy of being treated in this compilation of the effects of noise, if only for the purpose of completeness. This source of noise is not usually called noise; in most technical contexts it is more commonly called “error” rather than noise, but that is just a label; since it is a random contribution to the measured signal, it qualifies as noise just as much as any other noise source. So what is this mystery phenomenon? It is the quantization noise introduced by the analog-to-digital (A/D) conversion process, and is engendered by the fact that for

278

Chemometrics in Spectroscopy

any analog signal with a value between two adjacent levels that the A/D converter can assign, the difference between the actual value of the electrical voltage and the value represented by the assigned digital value is an error, or noise, and the distribution of this error is uniform. In the past, when instruments were not computer-controlled and all signal processing was done using analog circuits, digitization was not an important consideration. Nowadays, however, since almost all instruments use computerized data collection, this noise source is much more important, since it is so much more common than it used to be. The situation is illustrated in Figure 46-13. The actual voltage is a continuous, linear physical phenomenon. The values represented by the output of the A/D converter, however, can only take discrete levels, as illustrated. The double-headed arrows represent the error introduced by digitizing the continuous physical voltage at various points. The error cannot be greater than 1/2 the difference between representing adjacent levels of the converter; if the voltage increases beyond 1/2 the difference between levels, then the conversion will provide the next step’s representation of the value. Furthermore, if the sampling point is random with respect to the A/D conversion levels, as happens, for example, with any varying signal, then the actual voltage at the sampling point can be anywhere between two adjacent levels with equal probability, therefore the error (or noise) introduced will be uniformly distributed between +1/2 and −1/2 of the step size. This can happen even in the absence of other noise sources; as long as the signal varies, as it would, say, when a source is modulated. In that case, then, the measurement points will have a random relationship to the digitization levels. This effect could conceivably even become observable as the dominant error source, if the instrument has an extremely low noise level (a favorable case) or toolarge differences between A/D levels due to the A/D converter having too few bits (an unfavorable case).

Measured value

A/D step Error

Actual voltage

Applied voltage

Figure 46-13 The actual voltage is a continuous, linear function. The values represented by the output of the A/D converter, however, can only take discrete levels. The double-headed arrows represent the error introduced by digitizing the continuous physical voltage at various points.

Analysis of Noise: Part 7

279

EFFECT OF NOISE ON COMPUTED TRANSMITTANCE Therefore it is necessary at this point to repeat the investigations we did for Normally distributed noise, but to consider the effect of range-limited, uniformly distributed, noise. We will find that investigating this special case is relatively simple compared to the previous derivations, both because the expressions we find are much simpler than the previous ones and also because we have previously derived much of what we need here, and so can simply start at an appropriate point and continue along the appropriate path. The point in our previous discussions where the distribution of the noise was found to matter was the point at which we had to introduce the distribution of the errors in the first place; all previous discussion, derivations, and so on prior to that were independent of the distribution of the errors. That point was equation 43-60 in Chapter 43 first published as [4], where we introduced the weighted average in order to be able to compute the expected value for the measured transmittance, under conditions where the signal was small enough to be comparable to the noise. So let us repeat our previous work, starting at the appropriate point, and investigate both the computed transmittance and the noise of the transmittance, when the noise and signal have comparable magnitudes, but the noise is now uniformly distributed: � Wxfxdx (46-60) XW = � Wxdx In the case we investigated there, we had previously derived that the calculated trans mittance for an individual reading was fx =

Es Er + Er

(46-52a)

and in that case, we set the weighting function Wx to be the Normal distribution. We are now interested in what happens when the weighting function is a uniform distribution. Therefore the formula for the expected value of the mean transmittance, found by using equation 46-52a for fx and (1) for Wx in the interval from −1/2 to +1/2 (and zero outside that interval), becomes � 1/2 TWU =

−1/2

Es 1 dEr Er + Er � 1/2 1dEr −1/2

(46-79)

In equation 46-79, TWU represents the mean computed transmittance for Uniformly distributed noise and the parenthesized (1) in both the numerator and the denominator is a surrogate for the actual voltage difference between successive values represented by the A/D steps: essentially a normalization factor for the actual physical voltages involved. In any case, if the actual voltage difference were used in equation 46-79, it would be factored out of both the numerator and the denominator integrals, and the two would then cancel. Since the denominator is unity in either case, equation 46-79 now simplifies to � 1/2 Es TWU = dEr (46-80) −1/2 Er + Er

280

Chemometrics in Spectroscopy

Equation 46-80 is of reasonably simple form; indeed, the evaluation of this integral is considerably simpler than when the noise was Normally distributed. Not only is it possible to evaluate equation 46-80 analytically, it is one of the Standard Forms for indefinite integrals and can be found in integral tables in elementary calculus texts, in handbooks such as the Handbook of Chemistry and Physics and other reference books. The standard form for this integral is �

1 1 dx = ln a + bx a + bx b

To convert equation 46-80 to its Standard Form, we simply move Es outside the integral, whereupon equation 46-80 becomes TWU = Es

�

1/2 −1/2

1 dEr Er + Er

(46-81)

by setting a = Er and b = 1, the integral of equation 46-81 is 2 TWU = Es ln Er + Er 1/ −1/2

(46-82)

On setting Es = TEr and expanding equation 46-82 out by substituting the limits of integration: �� �� �� �� � � 1� 1� TWU = TEr ln ��Er + �� − TEr ln ��Er − �� (46-83) 2 2 From equation 46-83 we see that expectation for the measured value of TW is proportional to the true value of T (i.e., Es /Er , multiplied by a multiplier that is a function of Er . Figure 46-14 presents this function. Just as the expected value for transmittance (TW 2.5

Multiplication factor

2

1.5

1

0.5

2.4

2.3

2.2

2

2.1

1.9

1.8

1.7

1.6

1.5

1.4

1.3

1.2

1

1.1

0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0

0.1

0

Er

Figure 46-14 Plot of the multiplication factor of equation 46-83 as a function of Er . Abscissa unit is the difference between digitization levels.

Analysis of Noise: Part 7

281

in the case of Normally distributed noise went through a maximum, so too does the expected value for uniformly distributed noise, and the multiplier approaches unity at large values of Er , as it should. We note, however, that the value of the function at Er = 05 is not a valid value. When Er = 05, the argument of the logarithm in the second term of equation 46-83 is zero, and the value of the log becomes undefined. The value approaches an asymptote at Er = 05, indicating the mathematical undecidability of the value of the function, even though an actual physical A/D converter will indeed produce one or the other value at that point.

COMPUTED TRANSMITTANCE NOISE Here again, our task is simplified by the two facts we have mentioned above: first, that we can reuse many of the results we obtained previously for the case of Normally distributed noise, and second, that the nature of uniformly distributed noise characteristics simplify the mathematical analysis. Our first step in this analysis starts with equation 44-71, that we derived previously in Chapter 44 referenced as [5] as a general description of noise behavior: � 1 E2 SDT = + s 4 SD E (44-71) from Chapter 44 2 Er Er In our previous development, we presented a family of curves, corresponding to different values of SD(E. In the case of uniformly distributed noise, which is of necessity contained within a limited range √of values, the well-known fact that the standard deviation of the noise equals the range/ 12 helps us, in that it requires only one curve to display, rather than a family of curves. ([7], p. 146). For this case, then, equation 44-71 becomes equation (46-84): � 1 1 Es2 SDT = √ + (46-84) 12 Er2 Er4 where the unit of measure for Es and Er is the digitization interval of the A/D converter. We forebear plotting this function since it is simply one of the family we have presented previously in Chapter 44, as Figures 44-1 and 44-3 (referenced in [5]). Similarly, in Chapter 44, we have previously derived the absorbance noise and relative absorbance noise, and presented those as equations 44-24 and 44-77, respectively. �

A Var A

�

� =

1 T lnT

�2

�

Es Var Er + Er

�

�

1 + lnT

�2

�

−Er Var Er + Er

� (44-77)

In order to evaluate equation 44-77 it is necessary to assume a distribution for the variability of Es and Er , and in the earlier chapter the distribution used was the Normal distribution; here, therefore, we want to now evaluate this function for the case of a uniform distribution. We note here that much of the discussion in the earlier chapter concerning the evaluation of equation 44-77 applies now as well, so it behooves

282

Chemometrics in Spectroscopy Variances for uniformly distributed noise 2.0

Variance

1.5 1.0 0.5

9.9

9.3

8.8

8.3

7.7

7.2

6.6

6.1

5.5

5.0

4.4

3.9

3.3

2.8

2.2

1.7

1.1

0.6

0.0

0.0

Er

Figure 46-15 Values of the variance of Er /Er − Er ) and Es /Er − Er ) for various value of Er , with a uniform distribution of values for the errors.

the reader to review the procedures used there, and also in Chapter 45, immediately preceding this one (first published as [6]), since we will apply those procedures again, with the difference that we will use a uniform distribution for the variability of the noise terms. Figures 44-6 and 44-1 from our Chapter 44 (referenced as [5]) are unchanged, since they do not depend on the distribution of the errors. The figure corresponding to Figure 45-9 (which appeared in Chapter 45 [6]) that was calculated for Normally distributed noise is Figure 46-15, which presents the results of calculating the variance of the two terms of equation 44-77 for uniformly distributed noise instead. We note that while these terms follows the same trends as the Normally distributed errors, these errors do not become appreciable until Er has fallen below 0.6, which corresponds to the point where values occur close to or less than zero. For values of Er below 0.6 the values of both terms of equation 44-77 become very large and erratic. Following along the developments in Chapter 45, we find that the plot of A/A depends on T , but the variance terms that depend on Er as the parameter are essen tially independent of T . Therefore we expect that the plots of A/A as a function of T will result in a family of curves similar to what we found in Figure 45-11, but different in the values of A/A. However, Figure 45-11 shows only the net result of seeking the minimum of the function; it does not reveal the nature of the curves con tributing to the erratic behavior of the minimum. Therefore, we now present a set of the curves for which the minimum can be found, in Figure 46-16. We see in Figure 46-16A that while the behavior of the curve of A/A is systematic when Er is large enough for the variance to remain small, Figure 46-16B shows how the erratic behav ior of the two standard deviation terms in equation 44-77 result in a set of curves that form a family, but an erratic family rather than a well-ordered and well-behaved family. At this point we have completed our analysis of spectral noise for the case where the noise is constant (or at least independent of the signal level). Having completed this part of the analyses originally proposed in Chapter 40 (referenced as [1]) we will continue by doing a similar analysis for a complicated case.

Analysis of Noise: Part 7

283

(a) Er

Er

Er

Er

Er

Er

1.00000E – 01

2.00000E – 01

3.00000E – 01

4.00000E – 01

5.00000E – 01

6.00000E – 01

0.001

1.07737E + 08

2.68976E + 03

9.07148E + 02

4.86867E + 02

2.99824E + 02

1.96293E + 02

0.002

3.32775E + 07

8.30808E + 02

2.80198E + 02

1.50383E + 02

9.26091E + 01

6.06308E + 01

0.003 0.004

1.69267E + 07 1.05393E + 07

4.22594E + 02 2.63126E + 02

1.42524E + 02 8.87422E + 01

7.64928E + 01 4.76280E + 01

4.71060E + 01 2.93304E + 01

3.08401E + 01 1.92025E + 01

0.005

7.32527E + 06

1.82886E + 02

6.16802E + 01

3.31038E + 01

2.03861E + 01

1.33467E + 01

0.006 0.007

5.45604E + 06 4.26147E + 06

1.36219E + 02 1.06395E + 02

4.59413E + 01 3.58831E + 01

2.46567E + 01 1.92585E + 01

1.51842E + 01 1.18598E + 01

9.94105E + 00 7.76459E + 00

0.008

3.44565E + 06

8.60277E + 01

2.90140E + 01

1.55718E + 01

9.58951E + 00

6.27823E + 00

0.009

2.86035E + 06

7.14152E + 01

2.40858E + 01

1.29268E + 01

7.96068E + 00

5.21184E + 00

0.010

2.42412E + 06

6.05245E + 01

2.04128E + 01

1.09555E + 01

6.74670E + 00

4.41706E + 00

0.011

2.08898E + 06

5.21577E + 01

1.75910E + 01

9.44109E + 00

5.81408E + 00

3.80647E + 00

0.012

1.82508E + 06

4.55692E + 01

1.53690E + 01

8.24853E + 00

5.07967E + 00

3.32566E + 00

0.013

1.61296E + 06

4.02735E + 01

1.35830E + 01

7.28997E + 00

4.48937E + 00

2.93919E + 00

0.014

1.43948E + 06

3.59426E + 01

1.21224E + 01

6.50605E + 00

4.00662E + 00

2.62314E + 00

0.015

1.29549E + 06

3.23479E + 01

1.09100E + 01

5.85540E + 00

3.60593E + 00

2.36081E + 00

0.973

0.919

0.865

0.811

0.757

0.703

0.649

0.595

0.541

0.487

0.433

0.379

0.325

0.271

0.217

0.163

0.109

0.055

50000 45000 40000 35000 30000 25000 20000 15000 10000 5000 0 0.001

Δ(A)/A

(b)

T

Figure 46-16 The behavior of the family of curves of A/A. Figure 46-16a shows the systematic behavior obtained when Er is greater than 0.2 (in this case 02 < Er < 1). Figure 46-16b shows the erratic behavior obtained when Er is less than 0.2, in this case 006 < Er < 02.

REFERENCES 1. 2. 3. 4. 5. 6. 7.

Mark, H. and Workman, J., Spectroscopy 15(10), 24–25 (2000). Mark, H. and Workman, J., Spectroscopy 15(11), 20–23 (2000). Mark, H. and Workman, J., Spectroscopy 15(12), 14–17 (2000). Mark, H. and Workman, J., Spectroscopy 16(2), 44–52 (2001). Mark, H. and Workman, J., Spectroscopy 16(4), 34–37 (2001). Mark, H. and Workman, J., Spectroscopy 16(5), 20–24 (2001). Ingle, J. D. and Crouch, S. R., Spectrochemical Analysis (Prentice-Hall, Upper Saddle River, NJ, 1988).

This page intentionally left blank

47

Analysis of Noise: Part 8

This chapter further continues the set of chapters 40 through 46 first published as [1–7] dealing with the rigorous derivation of the expressions relating the effect of instrument (and other) noise to their effects to the spectra we observe. Our Chapter 40 was an overview; since then we have been analyzing the effect of noise on spectra by considering the case when the noise is constant detector noise, that is noise that is independent of the strength of the optical signal, which is the typical behavior of detectors for the IR and near-IR. As we do in each chapter in this section of the book we take this opportunity to note that we are dealing with a continuous set of chapters, and so we again continue our discussion by continuing our equation numbering, figure numbering, use of symbols and so on as though there was no break in the chapters. However, this chapter differs somewhat from the previous seven chapters in that, as we will see shortly, we will be performing parts of the same derivations all over again. Therefore, when we re-use previously derived equations, we will use the same equation numbers as we did for the original derivation. When we change course from the previous derivation, then we will number the equations starting with the next higher equation number from the last one we used (which we will note was equation 46-84 [7]). This procedure will also allow us to use some of our previous results to save time and space, allowing us to move along somewhat faster without sacrificing either rigor or detail. We left off in Chapter 46 by noting that we had just about exhausted the topic of the constant-noise (and by implication, a relatively “simple”) case (although not completely, in fact: there is still more to be said about the constant noise case, but that is for the future, right now it is time to move on), with the threat to begin discussion of a complicated case. Whether in fact it is more complicated than what we have been discussing remains to be seen; the question of whether something is “complicated” and “difficult” is partially subjective, since it depends on the perceptions of the person doing the evaluating. Something that is “difficult” for one may be “easy” for another because of a better background or more familiarity with the topic. Be that as it may, having decided to move on from the constant-detector-noise case, there remained the question of what to move on TO, that is which of the ten or so types of noise we originally brought up [1] should be tackled next. Tossing a mental coin, the decision was to analyze the case of noise proportional to the square root of the signal. This, as you will recall, is Poisson-distributed noise, characteristic of the noise encountered when the limiting noise source is the shot noise that occurs when individual photons are detected and represent the ultimate sensitivity of the measurement. This is a situation that is fairly commonly encountered, since it occurs, as mentioned previously, in UV-Vis instrumentation as well as in X-ray and gamma-ray measurements. This noise source may also enter into readings made in mass spectrometers, if the detection method includes counting individual ions. We have, in

286

Chemometrics in Spectroscopy

fact, discussed some general properties of this distribution quite a long time ago (see [8] or p. 175 in [9]). Now, we are not particular experts in X-ray and gamma-ray spectroscopy (nor mass spectroscopy, for that matter), but our understanding of those technologies is that they are used mainly in emission mode. Even when the exciting source is a continuum source, such as is found when an X-ray tube is used to produce the exciting X-rays for an X-ray Fluorescence (XRF) measurement, the measurement itself consists of counting the Xrays emitted from the sample after the sample absorbs an X-ray from the source. These measurements are themselves the equivalent of single-beam measurements and will thus also be Poisson-distributed in accordance with the basic physics of the phenomenon. The interesting parts occur when we calculate the transmittance (or reflectance) or absorbance of the sample under consideration, and therefore we must take a dual-beam measurement (or, at least the logically equivalent measurement of sample and reference readings) and compute the transmittance/reflectance or absorbance from those readings. Therefore, while the underlying physics results in the same form of noise characteristic in all those technologies, our results will be applicable mainly to UV-Vis measurements, where the quantity actually of interest is the amount of energy removed from the optical beam by absorption in the sample. Therefore, for the mathematical development we wish to pursue, we will again assume (as we did for the constant-noise case) that we are measuring transmittance through a clear (non-scattering) solution, and that Beer’s law applies. Examining Ingle and Crouch ([10], p. 152) we find the same situation as we found for constant detector noise: the computed noise of absorbance values does not take into account the effect of the noise of the reference reading. Hence, we can expect the results of our derivations to differ from the classic values for this situation as it did for the constant-detector noise case. We have recently found out and it is interesting to note, however, that in a much more obscure part of the book [10], in Table 6-2, there are expressions for absorbance noise that include terms for the noise of both sample and reference beam readings. The expressions given there are very complicated, since they include the combined effect of several different noise sources. However, since the main discussion in that book does not deal with the broader picture, the relegating of the full expression to such an obscure part of the book with no pointer to it in the text causing it to be missed, we are forced to treat Poisson noise as though it too, has not been derived for the full situation despite our finding it in that table. Indeed, the main discussion in Chapter 5 gives expressions, and results that, as we shall see, conform to the expressions obtained when the reference noise is neglected. Also, we just received a last-minute bulletin: one of the authors of [10] has kindly pointed out a typographical error in Table 6-2, so that we might put the matter right. The T within the parenthesis in the first expression for sT should be squared; this will correct an otherwise erroneous result that might be derived from that expression (J.D. Ingle, 2001, personal communication). With this correction, the expression in Table 6-2 results in exactly the same expression we obtained in our own derivation for the constant-noise case [2]. We begin, as we did before with the basic expression for the transmittance of a sample; since this is a repeat of previous equations we use the same numbers instead of starting with new numbering for the same equations: T=

Es − E0s Er − E0r

(47-1)

Analysis of Noise: Part 8

287

and, with the addition of noise affecting the computation of T : T + T =

Es + Es − E0s + E0s Er + Er − E0r + E0r

(47-2)

At this point we make a slight alteration to what we did previously. Strictly speaking we are being slightly premature here, but the gain in simplification of the equations more than compensates for the slight departure from complete rigor. Since the noise for the pure Poisson case is related to the signal, the noise at zero signal is zero; that is E 0s and E 0r are both zero. Therefore, for this case Es = E s and Er = E r . With this substitution, we can write equation 47-4 unchanged; however, we must keep in mind the difference in the meaning of these two terms (Es and Er ) compared to the meaning in the previous chapters. Hence, T + T =

Es − E0s + Es Er − E0r + Er

(47-4)

From this point, up to and including equation 47-17, the derivation is identical to what we did previously. To save time, space, forests and our readers’ patience we forbear to repeat all that here and refer the interested reader to Chapter 41 referenced as [2] for the details of those intermediate steps, here we present only equation 47-17, which serves as the starting point for the departure to work out the noise behavior for case of Poisson-distributed detector noise: � � � �2 1 −T 2 VarT = VarEs + VarEr (47-17) Er Er This is the point at which we must depart from the previous work. At this point in the previous (constant-noise) case we noted that SD(Es = SDEr and therefore we set both of those quantities equal to SD(E); We cannot make this equivalency in this case, since the noise values (or, at least, the expected noise values) will in general NOT be equal except when Es = Er , that is the transmittance (or reflectance) of the sample is unity. Poisson-distributed noise, however, has an interesting characteristic: for Poissondistributed noise, the expected standard deviation of the data is equal to the square root of the expected mean of the data ([11], p. 714), and therefore the variance of the data is equal (and note, that is equal, not merely proportional) to the mean of the data. Therefore we can replace Var(Es ) with Es in equation 47-17 and Var(Er ) with Er : � �2 � � 1 −T 2 VarT = Es + Er (47-85) Er Er The next transformation we are going to have to do in really tiny little baby steps, lest we be accused of doing something illegal to equation 47-85: VarT =

Es Er T 2 + Er 2 Er 2

(47-86)

T T2 + E r Er

(47-87)

VarT =

288

Chemometrics in Spectroscopy

And upon converting variance to standard deviation: � T +T2 SDT = Er

(47-88)

Compare equation 47-87, for Poisson noise with equation 47-18, or equation 47-88 with equation 47-19 as we derived for constant detector noise [2]. Equation 47-88 has also been previously derived by Voigtman, it turns out [12], in the course of his √ simulation studies. We note that now, instead √ of varying over a relative range of 1 to 2, the noise will vary over a range of zero to 2 as the sample transmittance varies from zero to unity. What is even more interesting is that nowhere in equation 47-88 is there a term representing the S/N (or N/S) ratio, as we found in equation 47-19. This is because the noise level of a detector with Poisson-distributed noise is predetermined by the signal level, and was implicitly introduced with which we substituted Es and Er for Var(Es ) and Var(Er ) in equation 47-85. Therefore the shape of the transmittance noise curve as a function of sample transmittance is constant (as it was for the case of constant noise). However, as equation 47-88 shows, the value of the noise is scaled by the reference signal, and varies inversely with the square root of the reference signal. We present the curve of SD(T ) as a function of T in Figure 47-17. From Figure 47-17 we note several ways in which the behavior of the transmittance noise for the Poisson-distributed detector noise case differs from the behavior of the constant-noise case. First we note as we did above that at T = 0 the noise is zero, rather than unity. This justifies our earlier replacement of E0 by E0 for both the sample and the reference readings. Second, we note that the curve is convex upward rather than concave upward. Third we note that for values of T greater than roughly 0.25, the curve appears almost linear, at least to the eye. This is a consequence of the fact that, at small values of T , the square of T inside the radical becomes negligible√compared to T , causing the overall value of the curve to be roughly proportional to T , while at large values of T , the Poisson-distributed transmittance noise 1.6 1.4

Relative noise

1.2 1 0.8 0.6 0.4 0.2

%T

Figure 47-17 Standard deviation of T as a function of T .

0.99

0.95

0.9

0.86

0.81

0.77

0.72

0.68

0.63

0.59

0.5

0.54

0.45

0.41

0.36

0.32

0.27

0.23

0.18

0.14

0.09

0.05

0

0

Analysis of Noise: Part 8

289

square √ term dominates, causing the overall value of the curve to be roughly proportional to T 2 , or, in other words, roughly proportional to T . Another issue to bring up is the question of units. In the case of constant noise, as expressed by equation 47-19, T was dimensionless, being a ratio of two numbers (Es and Er with the same units, whatever those units might be, and the other term in equation 47-19: SD(Er /Er is also a ratio of two numbers with the same units. In equation 47-88, on the other hand, T is still dimensionless, but Er is not dimen sionless; since it is a measurement, it must have units. The question of the units of Er bring us to an important caveat concerning the interpretation of equation 47-88 and Figure 47-17. First, to answer the question of units, we recall that the Poisson distribution applies to measurements for X-ray, UV, and visible detectors, and the reason that distribution applies is because it is the distribution describing the behavior of the number of discrete events occurring in a given time interval; the actual data, then, is the number of counts occurring during the measurement time. The unit of Er , then, is the absolute number of counts, and this brings us to our caveat. Equation 47-88 and Figure 47-17 are presented as describing a continuous series of values, and if Er is sufficiently large (large enough that a change of 1 count is small compared to the total number of counts), these equations and figures are a good approximation to a continuum. However, suppose Er is small. Let us pick a small number and see what happens: let us say Er is five. That means that the reference reading is five counts. Now it is immediately clear that we simply cannot have any value of T along the X-axis of Figure 47-17. Since Es can take only integer values (0, 1, 2, 3, ) T can take only discrete values of 0, 0.25, 0.5, 0.75, and unity, since you cannot have a fraction of a count as data. For those values of T , Figure 47-17 will provide an accurate measure of the expected value for SD(T ), but not necessarily the actual value you will measure in any particular measurement. This is a result of the randomness inherent in the measurement and the discreteness of the measurement of Es as well as Er . We discussed these issues a long time ago, when our series was still called “Statistics in Spectroscopy” rather than its current appellation of “Chemometrics in Spectroscopy”; we recommend our readers to go back and reread those columns, or the book that they were collected into [9], or any good book about elementary Statistics. Another consequence of the behavior of the Poisson distribution is that for small values of Er , the N/S ratio becomes large, to the point where values of T appreciably greater than unity may be measured. For example, if Er = 5 as we presented just above, the standard deviation of Er can be calculated as SD(Er = 223. Given a ±2 standard deviation range, we can expect (truncating to the nearest integer) that values of Es (when T = 1) as high as 5 + 2 × 223 = 5 + 4 = 9 counts will be observed, corresponding to a calculated value of T = 9/5 = 18 Furthermore, one of the steps taken during the omitted sequence between equation 47-4 and equation 47-17 was to neglect Er compared to Er . Clearly this step is also only valid for large values of Er , both for the case of constant detector noise and for the current case of Poisson-distributed detector noise. Therefore, from both of these considerations, it is clear that equation 47-88 and Figure 47-17 should be used only when Er is sufficiently large for the approximation to apply. Therefore our caveats. Equation 47-88 and Figure 47-17 are best reserved for cases of high signal, where the continuum approximation will be valid.

290

Chemometrics in Spectroscopy

Now that we have completed our expository interlude, we continue our derivation along the same lines we did previously. The next step, as it was for the constantnoise case, is to derive the absorbance noise for Poisson-distributed detector noise as we previously did for constant detector noise. As we did above in the derivation of transmittance noise, we start by repeating the definition and the previously derived expressions for absorbance [3]. A = − logT

(47-20a)

A = −04343 lnT

(47-20b)

We take the derivative dA = −04343

dT T

(47-21)

and substitute the expressions for T (47-6) and dT , replacing the differentials by finite differences: so that we can use the expression for T found previously (J.D. Ingle, 2001, personal communication): � −04343 A =

Es Er Er Es − Er Er + Er Er Er + Er Es Er

� (47-22)

Again in the interests of saving time, space, and so on, we skip over the repetition of the intermediate steps between equation 47-22 and equation 47-29: � VarA =

−04343 Es

�

�2 Var Es +

04343 Er

�2 Var Er

(47-29)

And again our departure from the derivation for the constant detector noise case is to note and use the fact that for Poisson-distributed noise, Var(Er = Er and Var(Es = Es : � VarA =

−04343 Es

�2

�

04343 Es + Er

�2 Er

(47-89)

And simplifying as we did above: VarA =

043432 043432 Es + Er 2 Es Er 2

(47-90)

043432 043432 + Es Er

(47-91)

VarA =

Analysis of Noise: Part 8

291

and since T = Es /Er , we solve for Es = TEr and substitute this into equation 47-91: VarA =

043432 043432 + TEt Er

VarA =

043432 Er

(47-92)

and factor out 0.43432 /Er : �

1 +1 T

� (47-93)

and upon taking square roots: 04343 SDA = √ Er

�

1 +1 T

(47-94 – for Poisson noise)

Again we can compare the expression in equation 47-94 with the equivalent expres sion for the constant detector noise case, which starts with equation 42-32, also equation 47-32 [3]. � SDA = 04343SDE

1 1 + 2 2 Er Es

(47-32 – for constant noise)

It is instructive to put equation 47-32 into similar form as equation 47-94 – for Poisson noise by replacing Es with TEr : � 1 1 + T 2 Er 2 E r 2 � SDE 1 SDA = 04343 +1 Er T2

SDA = 04343SDE

(47-95 – for constant noise)

(47-96 – for constant noise)

Thus, in the constant-noise case the absorbance noise is again proportional to the N/S ratio, although this is clearer now than it was in the earlier chapter; there, however, we were interested in making a different comparison. The comparison of interest here, of course, is the way the noise varies as T varies, which is immediately seen by comparing the expressions in the radicals in equations 47-94 – for Poisson noise and 47-96. Also, as equation 47-94 shows, the absorbance noise is again inversely proportional to the square root of the reference signal, as was the transmittance noise. And once again we remind our readers concerning the caveats under which equation 47-94 is valid. We present the variation of absorbance noise for the two cases (equations 47-94 – for Poisson noise and 47-96, corresponding to the Poisson noise and constant noise cases) in Figure 47-18. While both curves diverge to infinity as the transmittance → 0 (and the absorbance → ), the situation for constant detector noise clearly does so more rapidly, at all transmittance levels. Again, we continue our derivations in our next chapter.

292

Chemometrics in Spectroscopy Absorbance noise

Relative absorbance noise

12 10 8 6

Constant noise

4 2 Poisson noise 1

0.9

0.95

0.85

0.8

0.75

0.7

0.6

0.65

0.5

0.55

0.45

0.4

0.35

0.3

0.2

0.25

0.1

0.15

0

%T

Figure 47-18 Comparison between absorbance noise for the constant-detector noise case and the Poisson-distributed detector noise case. Note that we present the curves only down to T = 0.1, since they both asymptotically → as T → 0, as per equations 94 and 96.

REFERENCES 1. 2. 3. 4. 5. 6. 7. 8. 9.

Mark, H. and Workman, J., Spectroscopy 15(10), 24–25 (2000). Mark, H. and Workman, J., Spectroscopy 15(11), 20–23 (2000). Mark, H. and Workman, J., Spectroscopy 15(12), 14–17 (2000). Mark, H. and Workman, J., Spectroscopy 16(2), 44–52 (2001). Mark, H. and Workman, J., Spectroscopy 16(4), 34–37 (2001). Mark, H. and Workman, J., Spectroscopy 16(5), 20–24 (2001). Mark, H. and Workman, J., Spectroscopy 16(7), 36–40 (2001). Mark, H. and Workman, J., Spectroscopy 5(3), 55–56 (1990). Mark, H. and Workman, J., Statistics in Spectroscopy, 1st ed. (Academic Press, New York, 1991). 10. Ingle, J.D. and Crouch, S.R., Spectrochemical Analysis (Prentice-Hall, Upper Saddle River, NJ, 1988). 11. Hald, A., Statistical Theory with Engineering Applications (John Wiley & Sons, inc., New York, 1952). 12. Voigtman, E., Analytical Instrumentation 21(1&2), 43–62 (1993).

48

Analysis of Noise: Part 9

We keep learning more about the history of noise calculations. It seems that the topic of the noise of a spectrum in the constant-detector-noise case was addressed more than 50 years ago [1]. Not only that, but it was done while taking into account the noise of the reference readings. The calculation of the optimum absorbance value was performed using several different criteria for “optimum”. One of these criteria, which Cole called the Probable Error Method, gives the same results that we obtained for the optimum transmittance value of 32.99%T [2]. Cole’s approach, however, had several limitations. The main one, from our point of view, is the fact that he directed his equations to represent the absorbance noise as soon as possible in his derivation. Thus his derivation, as well as virtually all the ones since then, bypassed consideration of the behavior of noise of transmittance spectra. This, coupled with the fact that the only place we have found that presented an expression for transmittance noise had a typographical error as we reported in our previous column [3], means that as far as we know, the correct expression for the behavior of transmittance noise has still never been previously reported in the literature. On the other hand, we do have to draw back a bit and admit that the correct expression for the optimum transmittance has been reported. Not only that, but Cole points out and laments that, at that time, other scientists were already using the incorrect formulas for noise behavior. That means that the same situation that exists now, existed over 50 years ago, and in all the intervening time has not been corrected. This, perhaps, explains why the incorrect theory is still being used today. We can only hope that our efforts are more successful in persuading both the practitioners and teachers of spectroscopic theory to use the more exact formulations we have developed. Getting back to the current state of the columns, this column is one more in the set [2–9] dealing with the rigorous derivation of the expressions relating the effect of instrument (and other) noise to their effects to the spectra we observe. The impetus for this was the realization that the previously existing theory was deficient in that the derivations extant ignored the effect of noise in the reference reading, which turns out to have appreciable effects on the nature of the derived noise behavior. Our first chapter in this set [4] was an overview; the next six examined the effects of noise when the noise was due to constant detector noise, and the last one on the list is the first of the chapters dealing with the effects of noise when the noise is due to detectors, such as photomultipliers, that are shot-noise-limited, so that the detector noise is Poisson-distributed and therefore the standard deviation of the noise equals the square root of the signal level. We continue along this line in the same manner we did previously: by finding the proper expression to describe the relative error of the absorbance, which by virtue of Beer’s law also describes the relative error of the concentration as determined by the spectrometric readings, and from that determine the

294

Chemometrics in Spectroscopy

value of transmittance a sample should have in order to optimize the analysis, in the sense that the relative error of the concentration is minimized. As we do in each chapter in this section of the book we take this opportunity to note that we are dealing with a continuous set of chapters, and so we again continue our discussion by continuing our equation numbering, figure numbering, use of symbols, and so on as though there were no break, except that when we repeat an equation or series of equations that were derived and presented previously, we retain the original number(s) for those equation(s). So let us continue. We now wish to generate the expression for the relative error of the absorbance, A/A, which we again obtain by using the expression in equation 48-25 −04343 Er Es − Es Er (48-25) A = Es Er for A, and the expression in equation 42-20b: A = −04343 lnT , for A. This results in the same expression we obtained previously, which we present, as usual, without repeating all the intermediate steps: A 1 Es Er = − (48-36) Er A lnT Es We again go through the usual sequence of steps needed to pass to the statistical domain, which we do in detail here since, looking back we find that we had neglected to present them previously due to somewhat of a feeling of being rushed. First we take the variance of both sides of equation 48-36: A 1 Es Er Var = Var − (48-97) A lnT Es Er A 1 Es 1 Er Var = Var − (48-98) A lnT Es lnT Er Then we apply the theorem that Var(A + B) = Var(A) + Var(B): −1 Er A 1 Es Var = Var + Var A lnT Es lnT Er

(48-99)

And then we apply the theorem that, if a is a constant, then VaraX = a2 VarX: A 1 1 Var = Var Es + Var Er (48-100) 2 A E r lnT 2 E s lnT Again we use the property of the Poisson distribution that the variance of a value is equal to the value, so that Var(Es = Es and Var(Er = Er : Er A Es Var = + (48-101) A E s lnT 2 E r lnT 2 A 1 1 1 Var = + (48-102) A lnT 2 Es Er

Analysis of Noise: Part 9

295

and finally:

A SD A

1 = lnT

1 1 + E s Er

(48-103)

Interestingly, in Voigtman’s development of these equations, his expression correspond ing to equation 48-103 is missing the 1/Er term inside the radical, even though he arrived at the correct equation corresponding to equation 47-88, as we noted in Chapter 47 referenced as the paper [3]. There are now two ways to proceed with equation 48-103. One way is to replace T in the denominator with Es /Er , which makes it easier to compare with equation 42-37, which is the corresponding equation describing the constant-noise case. Alternatively, we can replace Es in the denominator of equation 48-103 with TEr , which is more convenient for plotting the expression. Since we wish to explore both phenomena, we will do both transformations of equation 48-103. First we will replace T in the denominator with Es /Er , which makes it easier to compare with equation 42-37:

A SD A

A SD A

1 = lnEs /Er 1 = lnEs /Er

E Er + s E s Er E s E r

(48-104)

Es + Er Es Er

(48-105)

Equation 48-105 is the closest we can come to the form of equation 42-37, so compare the functions describing the relative precision for the constant-noise case to that of the Poisson-noise case. To put equation 48-103 into a form easier to plot, we now replace Es in the denominator of equation 48-103 with TEr

A SD A

A SD A

A SD A

1 = lnT

1 1 + TEr Er

1 = lnT

1 Er

1 =√ Er lnT

1 +1 T 1 +1 T

(48-106) (48-107)

(48-108)

Qualitatively we can note that equation 48-108 also passes through a minimum, since it will diverge as T → 0 (in the denominator of the radical) and also as T → 1, which causes lnT → 0. Again, we see that the actual value of the relative error is scaled inversely with the square root of the reference reading, as it did for both transmittance 1 1 and absorbance noise. We verify the behavior of equation 48-108 by plotting lnT +1 T 1 versus T in Figure 48-19 (actually, we plot lnT T1 + 1 , for reasons that will be

296

Chemometrics in Spectroscopy 3 2.5

SD(Δ(A))/A

2 1.5 1 0.5

0.53

0.48

0.505

0.43

0.455

0.405

0.38

0.33

0.355

0.28

0.305

0.23

0.255

0.18

0.205

0.155

0.13

0.08

0.105

0.055

0.03

0.005

0

%T

Figure 48-19 Relative absorbance precision for Poisson-distributed detector noise.

discussed below). Unsurprisingly, the optimum transmittance (roughly T = 011 from the data table used to plot Figure 48-19 ) differs appreciably from what was found for the corresponding situation when the detector noise was constant. The more interesting and important question, however, is how the value we arrived at compares with the “optimum” obtained from the previously derived expression, that neglected the effect of the noise in the reference reading. To continue, therefore, we proceed in the usual manner for finding a minimum: we take the derivative of equation 48-108 and then set the derivative equal to zero. Since equation 48-108 is complicated, and the derivative more so, we will generate the derivative in several steps:

d A 1 d 1 1 d 1 SD =√ +1 + +1 (48-109) √ dT A T T dT Er lnT dT Er lnT d A d 1 1 1 1 1 1 1 d SD +1 + +1∗ √ =√ dT A T Er lnT 2 1 + 1 dT T Er dT lnT T

(48-110) d d 1 A 1 +1 SD = √ dT A 2 Er lnT T1 + 1 dT T +

1 1 −1 d +1∗ √ lnT 2 T Er lnT dT

(48-111)

Analysis of Noise: Part 9

297

− T1 + 1 −1 d A 1 1 SD = √ + √ ∗ 2 2 dT A lnT Er T 2 Er lnT T1 + 1 T

d A SD dT A

1 − +1 −1 T + = √ 2 √ 2T 2 Er lnT T1 + 1 T Er lnT

It will help our cause to factor out from equation 48-113 what we can ⎤ ⎡ 1 − + 1 d −1 A 1 T ⎥ ⎢ + SD = √ ⎦ ⎣ lnT dT A T Er lnT 2T 1 + 1 T and then combine the terms:

d A SD dT A

⎡

(48-112)

(48-113)

(48-114)

⎤ 1 + 1 + 1 1 − lnT T ⎢ ⎥ = √ + ⎣ ⎦ T Er lnT 2T lnT 1 + 1 1 2T lnT T + 1 T −2T

1 T

(48-115)

d A SD dT A

⎡

=

1

⎤

1 ⎢ − lnT − 2T T + 1 ⎥ ⎣ ⎦ T Er lnT 2T lnT 1 + 1 √

(48-116)

T

Now we can set the derivative equal to zero: ⎡ 0=

1

⎤

⎢ − lnT − 2T T + 1 ⎥ ⎣ ⎦ T Er lnT 2T lnT T1 + 1 √

1

(48-117)

and simplify the expression: 0 = − lnT − 2T 0 = lnT + 2T + 2

1 +1 T

(48-118) (48-119)

Equation 48-119 is a much simpler equation than most of the ones we have had to deal with before, including equation 42-50 (which is the corresponding equation for the constant-detector-noise case [2]); nevertheless, it is still a transcendental equation and is best solved by successive approximations. The solution to 5 decimal places is 0.10886 , or 10.886 %T . The solution given by Ingle and Crouch for this case, which again, does not take into account the variation of the reference channel is 13.5%T ([10], p. 153).

298

Chemometrics in Spectroscopy

We therefore see that in this case also, neglecting the reference channel error also causes a noticeable change in the answer from the correct one. 1 To finish up this chapter, we discuss the use of lnT T1 + 1 as the expression we plotted in Figure 48-19. In passing from equation 48-102 to 48-103, we did the usual and intuitive step of using the positive square root of the expression in equation 48-102, which seems reasonable, since we are working with variances, which must always be positive, and standard deviations, which we also want to have positive values. However, when we come to plot the expression in equation 48-108, we find that since T is always less than unity, lnT is negative, and therefore the entire expression is negative. Thus, plotting this expression directly results in the curve having a maximum rather than a minimum at the point where the derivative is zero. Since this does not conform to reality, where we obtain the best precision rather than the worst, it is clear that this is an artifact of our choice of sign for the square root; the way we obtain a unique answer, and one that is in conformance with the real world, is to use the absolute value of the expression. Again, we continue our derivations in our next chapter.

REFERENCES 1. 2. 3. 4. 5. 6. 7. 8. 9. 10.

Cole, R., Journal of the Optical Society of America 41, 38–40 (1951). Mark, H. and Workman, J., Spectroscopy 15(12), 14–17 (2000). Mark, H. and Workman, J., Spectroscopy 16(11), 36–40 (2001). Mark, H. and Workman, J., Spectroscopy 15(10), 24–25 (2000). Mark, H. and Workman, J., Spectroscopy 15(11), 20–23 (2000). Mark, H. and Workman, J., Spectroscopy 16(2), 44–52 (2001). Mark, H. and Workman, J., Spectroscopy 16(4), 34–37 (2001). Mark, H. and Workman, J., Spectroscopy 16(5), 20–24 (2001). Mark, H. and Workman, J., Spectroscopy 16(7), 36–40 (2001). Ingle, J.D. and Crouch, S.R., Spectrochemical Analysis (Prentice-Hall, Upper Saddle River, NJ, 1988).

49 Analysis of Noise: Part 10

This chapter is one more in the set of chapters starting at Chapter 40 and first published as [1–9], dealing with the rigorous derivation of the expressions relating the effect of instrument (and other) noise to their effects to the spectra we observe. The impetus for this was the realization that the previously existing theory was deficient in that the derivations extant ignored the effect of noise in the reference reading, which turns out to have appreciable effects on the nature of the derived noise behavior. Chapter 40 in this set referenced as [1] was an overview; Chapters 41–46 examined the effects of noise when the noise was due to constant detector noise (e.g., IR/NIR spectroscopy), and the last two chapters (47 and 48) began by considering the effects of noise when the noise is due to detectors, such as photomultipliers, that are shotnoise-limited, so that the detector noise is Poisson-distributed and therefore the standard deviation of the noise equals the square root of the signal level. The path we are taking pretty well follows the one we used for the constant-detector-noise case, and those two chapters derived the effects when the noise is small compared to the measured signal. Since we wish to continue following that same path, we now need to consider what happens when the optical signal falls to the point where the noise becomes an appreciable fraction of the measured signal, and the effects of the noise, such as induced nonlinearities, can no longer be neglected. And as we do in each chapter in this section of the book we once more take this opportunity to remind our readers that we are dealing with a continuous series of chapters, and so we again continue our discussion by continuing our equation numbering, figure numbering, use of symbols, and so on as though there were no break, except that when we reuse an equation or series of equations that were derived and presented previously, we retain the original number(s) for those equation(s). So let us continue. In Chapter 43 [4], which the interested reader may wish to go back and refresh themselves about, we discussed the general descriptions of how and why the equations came about, we noted that the point of departure for investigating what happens when the noise level becomes large enough that it can no longer be ignored was equation 49-5:

T + T =

Es + Es Er + Er

(49-5)

and we noted that in that case, that of Normally distributed noise, the expected computed value of T was

T=

Es Er + Er

(49-52a)

300

Chemometrics in Spectroscopy

the reason being, as we pointed out, the other term that arose, Es /Er + Er , would vanish from the expression for the expected value of T because of symmetry. In the current case, however, we cannot rely on that argument. The Poisson distribution is not symmetric around any particular value, as we will observe shortly when we present a graph of the members of the family of Poisson distributions, despite the fact that this distribution approaches the Normal distribution in the limit as the parameter → . However, in addition to the fact that the distribution never becomes exactly Normal, our interest in this chapter is specifically to examine the effects occurring at small values of . Hence, in this case we must work with equation 49-5, rather than the simpler equation 49-52a: T + T =

Es + Es Er + Er

(49-5)

We next noted that the expected value of T is computed from the general equation for an expected value: � i

TW =

Wi FXi � Wi

(49-59)

i

Fx, here, is Es +Es /Er +Er , as we just noted. In the previous case, the weighting function was the Normal distribution. Our current interest is the Poisson distribution, and this is the distribution we need to use for the weighting factor. The interest in our current development is to find out what happens when the noise is Poisson-distributed, rather than Normally distributed, since that is the distribution that applies to data whose noise is shot-noise-limited. Using P to represent the Poisson distribution, equation 49-59 now becomes � X WP =

i

PXi FXi � Pi

(49-120)

i

and since probability distributions have integrals that always equal unity (reflecting the reality that the argument must have SOME value every time it is evaluated, so that it is certain that some value will be obtained over the entire range of summation; certainty of obtaining the value of a means that Pa = 1). The denominator of equation 49-120 vanishes, therefore, and equation 49-120 reduces to � X WP = PXi FXi (49-121) i

The Poisson distribution is actually a special case of the binomial distribution, a fact that is only of mild peripheral interest here, as we will not be using that fact. The formula for the Poisson distribution is PX =

e− X X!

(49-122)

Analysis of Noise: Part 10

301

In our terminology, the parameter corresponds to Er or Es , the (fixed) value of the energy to be measured, and X corresponds to Er or Es , as appropriate. Therefore equation 49-122 becomes PX =

e−Er Er Er Er !

(49-123)

Figure 49-20 presents the Poisson distribution; Figure 49-20a shows the distribution for integer values of up to = 11, and Figure 49-20b shows this distribution for 1 ≤ λ ≤ 11

(a)

Poisson distribution

0.4 0.35

λ=1

0.3

P(X)

0.25 0.2 0.15

λ = 11

0.1 0.05

19

18

17

16

15

14

13

12

11

10

9

8

7

6

5

4

3

2

1

0

0

X 0<λ≤2

(b)

Poisson distribution

0.9 0.8

λ = 0.2

0.7

P(X)

0.6 0.5 0.4

λ=2

0.3 0.2 0.1 0 0

1

2

3

4

5

6

7

8

X

Figure 49-20 Poisson distribution for several values of . Figure 49-20b is an expansion of Figure 49-20a, for values of between 0.2 and 2. (see Color Plate 16)

302

Chemometrics in Spectroscopy

fractional values of up to = 2. Now, one point in which the Poisson distribution differs from the Normal distribution is the presence of the parameter . As we show in Figure 49-20, different values of lambda give rise to different curves. While they all share the property that their integral is unity, they also differ in several respects. The characteristic that we draw attention to first at this point is that the curves have different shapes. This is a key difference from the Normal distribution; as we will recall, when we integrated equation 49-58 where the weighting factor was the Normal distribution, the resulting family of curves had similar shapes, and differed only in their expansion along the abscissa, which then allowed describing their behavior as the same basic curve, but scaled by the standard deviation of the underlying distribution. In the case of the Poisson distribution, we would not expect that to happen, since the different curves in the family of the Poisson distribution have different distributions to start with. This might be expected to give rise to a double family of curves, corresponding to different values of standard deviation and different values of . However, this is obviated by the fact that for the Poisson distribution, the standard deviation is “locked” to the underlying value of and cannot vary independently. We also note, that as we see in Figure 49-20, to the eye the Poisson distribution resembles the Normal distribution very closely at large values of lambda, so the differences in the integral may not be easily seen by the eye, either, except at the very lowest values of lambda. There are some other characteristics of the Poisson distribution that differ from the Normal distribution in ways that are of importance to us here. The chief one is that the Poisson distribution does not admit of negative values. This makes intuitive sense; since the Poisson distribution is a distribution that results from a counting operation, the smallest number that you can achieve when counting objects is zero. We will be using this fact during the course of our derivations. Another point to be made is that, in fact, a value of zero is indeed a legitimate value for X. This comes from the generation of the distribution as the result of a counting operation: when counting photons in X-ray analysis for example, if the average count in any given time interval is a small number, less than five, say, then it can happen and there is a reasonable probability for it to happen that in some of those time intervals there will in fact be no counts occurring in a given time interval. Lambda (), however, is not restricted to integer values. Since represents the mean value of the data, and in fact is equal to both the mean and the variance of the distribution, there is no reason this mean value has to be restricted to integer values, even though the data itself is. We have already used this property of the Poisson distribution in plotting the curves in Figure 49-20b. To start our current derivation, we substitute the appropriate expressions for PX and FX into equation 49-121, and letting Es = TEr we obtain the following: X WP =

� e−Er Er Er X

Er !

�

TEr + Es Er + Er

� (49-124)

However, equation 49-124 is incomplete, the cause of the incompleteness being the presence of Es in the formula. As mentioned above, Es is also a random variable, is independent of Er , and we do not expect its effect to cancel as it did with the Normal distribution. Therefore we must also compute the weighted sum over the (also

Analysis of Noise: Part 10

303

Poisson-distributed) values of Es , which, corresponding to the expression for the first term of equation 49-124, is � � � � �� � e−Er Er Er e−Er Er Er � e−TEr TEr Es Es TEr X WP = + Er ! Er + Er Er ! Es Es ! Er + Er Er (49-125) To investigate the behavior of equation 49-125, we start by investigating the properties and behavior of the inner summation alone. Therefore let us break out that part of the equation and see what we have. � � � e−TEr TEr Es 1 (49-126) S WP = Es Er + Er Es Es ! where we have taken 1/Er + Er outside the summation (since it is not included in that summation and is therefore a constant for the summation), and we are now using the symbol S WP to indicate the weighted averaging to be done over the sample noise term alone. As we see, this is in itself the expected value of Es , which is thus the product of the Poisson distribution of the sample readings multiplied by the values of the readings. The values of Er and Er are constant for the summation over Es and therefore mainly act as a scaling factor; however, they do also affect the values and distribution achievable by the expected value since the value of Es is limited to be no larger than Er , or equivalently, 0 <= T <= 1. The summation over Es , therefore, is still subject to the values of two parameters, Er and T. Let us take a look at the behavior of this system. Figure 49-21 shows, corresponding to Figure 43-5 [4], the Poisson distribution (for two values of , in Figures 49-21a and 49-21b respectively) overlaid with PTEs ∗ FTEs , and also, as we showed in Figure 43-5, the cross-product of the two functions. The factor 1/Er + Er was set at unity. Since Es appears in the numerator, FTEs is linear with T . These figures show how each curve increases in magnitude and is shifted toward larger values of X. Figure 49-22 shows more members of the family of curves described by this function. It may be compared to Figure 49-20 to see how they relate to the original Poisson distri butions. Integrating the curves in Figure 49-22 (by performing the indicated summation) reveals that those integrals equal . In the limit of large values of this behavior is obvious, for the following reason: since the standard deviation of equals the square root of itself, at large enough values the distribution becomes essentially a “spike” of unit integral at , and effectively zero elsewhere; when this is multiplied by the function F = , then the unit value is multiplied by , giving as the value of the integral. From Figures 49-20 and 49-21 it is not at all obvious that this same result is obtained at small values of . However, neither is it very surprising that the expected value of Es = Es and it is gratifying to find that it is so. Given this result, we may now replace the entire inner summation in equation 49-126 by , which as we have seen is Es , and in equation 49-125 we therefore set it equal to TEr . Therefore, S WP =

TEr Er + Er

(49-127)

304

Chemometrics in Spectroscopy

λ=1

(a)

Weighted Poisson distribution

0.4 0.35

P(S)

0.3

P(S) × ΔE s

P(S) × E s 0.25 0.2 0.15

E s (scaled)

0.1 0.05 0 0

1

2

3

λ=2

(b)

4

5

6

ΔE s

7

8

9

10

Weighted Poisson distribution

0.6 0.5 P(S) × E s

P(S) × ΔE s

0.4 P(S) 0.3 0.2

E s (scaled)

0.1 0 0

1

2

3

4

5

6

7

8

9

10

ΔE s

Figure 49-21 Poisson distribution multiplied by Es . P(S) × Es 1.4

λ = 11

1.2

λ=1

P(S) × Es

1 0.8 0.6 0.4 0.2

ΔEs

Figure 49-22 Family of functions of PS × Es at various values of .

21

20

19

18

17

16

15

14

13

12

11

9

10

8

7

6

5

4

3

2

1

0

0

Analysis of Noise: Part 10

305

and we may now also substitute this result in equation 49-125 � � � � �� � e−Er Er Er TEr e−Er Er Er TEr + X WP = Er ! Er + Er Er ! Er + Er Er

(49-128)

which then simplifies to � � � e−Er Er Er TEr X WP = 2 Er ! Er + Er Er

(49-129)

This is a result we could have obtained directly (and much more simply) simply by setting Es = TEr in equation 49-124, but at that point we had justification to do so. We are now interested in integrating equation 49-126; in this equation Er corresponds to and Er corresponds to X, the variable of integration (or summation, actually). Thus the equation has two parameters that can affect the result: Er and T . Our interest here is in the effect of Er on the nature of the computed transmittance at small values of Er , therefore we consider T to be a constant as we integrate (sum) over values of Er and therefore for the integration we take T outside the summation: � � � e−Er Er Er Er X WP = 2T Er ! Er + Er Er

(49-130)

� Equation 49-130 is now exactly in the form of X WP = PX∗ FX (times a scaling factor) as we started with in equation 49-121, and is now in a form that can be more easily worked with. More importantly, it is also in a form that is useful and convenient: it is in the form of T times a multiplying factor. It now remains to find out the nature and behavior of the multiplying factor. We will therefore now investigate the behavior of equation 49-130, similarly to the way we investigated equation 49-126, and for that matter, the corresponding equation 43-62 for the case of Normally distributed noise [4]. Therefore we start by plotting the term Er /Er + Er (which we call FEr ) against Er in Figure 49-23, with Er as the parameter distinguishing the curves. While Er can in fact take non-integer values as described above, for our current discussion we will consider it having integer values for the sake of convenience, although toward the end we will plot it using non-integer values when this serves our purpose. Therefore in Figure 49-23 we plot the values of FEr corresponding to integer values of the parameter Er . One point we note is what we might expect from the nature of the term for FEr in equation 49-130: as Er assumes larger values, the term Er /Er + Er becomes less sensitive to the effect of Er , becoming flatter and flatter as Er increases. This behavior is expected since, if we consider the behavior of FEr as Er becomes indefinitely large, Er will become negligible compared to Er , thus giving the results for the large-signal situation that we obtained in the previous two chapters. At that point, with Er negligible compared to Er , the expression reduces to Er /Er which, of course, is unity. In Figure 49-24 we present the plots of PEr , FEr , and their cross-product (as we previously did for PEs , FEs , and their cross-product), as functions of Er .

306

Chemometrics in Spectroscopy F(ΔEr)

1 0.9 0.8 Er = 1

F(ΔEr)

0.7 0.6 0.5

Er = 11

0.4 0.3 0.2 0.1 0 1

6

11 16 21 26 31 36 41 46 51 56 61 66 71 76 81 86 91

ΔEr

Figure 49-23 FEr at various values of Er . Er = 1

(a)

P(ΔE r) × F(ΔE r), λ = 1

1 0.9

P(ΔEr) × F(ΔEr)

0.8

F(X)

0.7 P(X)

0.6 0.5 0.4

Product

0.3 0.2 0.1 0 1

2

3

4

5

6

7

8

9

10

ΔE r

Er = 2

(b)

P(ΔE r) × F(ΔE r), λ = 2

1 0.9 0.8

Function values

0.7

F(E r )

0.6 0.5 0.4

P(E r )

0.3 Product

0.2 0.1 0 1

2

3

4

5

6 ΔE r

Figure 49-24 Terms for PEr , FEr , and their product.

7

8

9

10

Analysis of Noise: Part 10

307

1 ≤ λ ≤ 11

(a)

Family of terms of P(X ) × F(X )

0.4 0.35 0.3

λ=1

Value

0.25 0.2 0.15

λ = 11

0.1 0.05 0 1

2

3

4

5

6

7

8

9

10 11 12 13 14 15 16 17 18 19 20 21

Er

0.2 ≤ λ ≤ 2

(b)

P(S) × Es

0.9 0.8

P(S) × F(ΔEs)

0.7

λ = 0.2

0.6 0.5 0.4 0.3

λ=2

0.2 0.1 0 0

1

2

3

4

5

6

7

ΔEs

Figure 49-25 Family of terms for PEr × FEr . Figure 49-25b is an expansion of Figure 49-25a, for small values of . (see Color Plate 17)

In Figure 49-25 we present the family of cross-products, for various values of the parameter Er , again corresponding to our treatment of Es . Figure 49-25 presents this family in two parts: Figure 49-25a presents the family for integer values of Er up to 11, while Figure 49-25b concentrates on the family members corresponding to values of Er less than 2.0. It becomes clear that when Er becomes small enough, the inflation of the value of the function at small values of Er can become indefinitely large.

308

Chemometrics in Spectroscopy Multiplier factor for T 1.8 1.6 1.4

Multiplier

1.2 1 0.8 0.6 0.4 0.2 9.8

9

9.4

8.6

8.2

7.8

7

7.4

6.6

6.2

5.8

5

5.4

4.6

4.2

3.8

3

3.4

2.6

2.2

1.8

1

1.4

0.6

0.2

0

Er

Figure 49-26 Multiplying factor for T from equation 49-130.

Finally, Figure 49-26 presents the multiplying factor of T from integrating the terms of equation 49-130: Multiplying factor = 2

� � � e−Er Er Er Er Er ! Er + Er Er

as a function of Er . As we might have expected at this point, the multiplying factor takes values above unity as Er → 0, and approaches unity as Er grows large. The behavior of noise data following the Poisson distribution differs from the behavior of that following the Normal distribution that we observed previously, in that the multiplying factor obtained from the Poisson distribution does not go through a maximum and then approach zero as Er → 0, which was the behavior we observed for the Normal distribution. The reason for this difference is clear, and is due to one of the characteristics of the Poisson distribution we noted above: the Poisson distribution does not admit of negative values, while the Normal distribution does. Thus, when data following the Normal distribution is averaged, including these negative values in the averaging process reduces the average that is computed, and the computed mean therefore approaches zero as Er → 0 at which point the data contains as many negative values as positive values. Since data following the Poisson distribution has no negative values, this effect cannot occur, and therefore in this case the multiplying factor → as Er → 0. As we noted, as Er grows large, the multiplying factor approaches unity, as it must in order that T approach its defined value of Es /Er for the large-signal situation.

DISCUSSION Equation 49-130, and the plot of the multiplying factor presented in Figure 49-26, seem pretty straightforward, but in fact there is a significant problem attendant on its application to the real world, that is to actual measurements. We were able to generate

Analysis of Noise: Part 10

309

that equation and figure based on the fact that Er can, in fact, take non-integer values. Since Er can be zero, equation 49-130 is prevented from diverging only by the fact that Er is non-zero, even if non-integer. In a sense, however, that is only a mathematical fiction, since in a real-world measurement we do not know the value of Er . If we did, we would not need to make the measurement. In the case of a real-world measurement, however, we do not know Er , as we said. The only quantity we know are the values of Er , that is the individual readings, for which Poisson distribution effectively provides us with estimates of the probability of obtaining various values of Er : 0, 1, 2, 3, from given values of Er (i.e., ). This represents a key difference between the Poisson and the Normal distributions. As we discussed at the appropriate point in our derivations dealing with the Normal distribution, a value of exactly zero is never obtained in that case [4]. When we make an actual, real-world measurement from data following the Poisson distribution, the actual reading we obtain will be one of those values of Er from the list 0, 1, 2, 3 each time we make the measurement. Some of the time, the probability of which will depend on the value of Er , the reading will be exactly zero, a situation which would not actually occur when the Normal distribution was the operative distribution. For example, if = 0 5, we will never ever obtain an actual reading of 0.5 counts; what will actually happen is that half the measurements (or roughly 6/10 of the measurements, actually) will contain zero counts, and (slightly less than) half the measurements will contain one count, and a few stragglers will contain more than one count: only the average number of counts from many measurements will be 0.5. In this case, putting even a single value of zero for Er into equation 41-6 [2], unless the measurement of the corresponding Es is also zero (which will give a computed value for T that is undefined, in both the mathematical and the real-world senses), the computed value of T for that reading will be infinite. Clearly, averaging together an infinite value with any number of finite values will still result in a computed average whose value is also infinite. What can we make of this situation? If we knew Er we could deal with the real-world case. In principle we can find out Er by measuring sufficiently many times and averaging together many readings (some of which may still be zero, but that’s OK in this case). To make those measurements, however, will take a longer time and if we are willing to spend the time to make the measurements, we can simply do that at the start, and let the counts accumulate so that we can work in a regime farther removed from the Er → 0 situation that is causing all this trouble in the first place. That is certainly one solution. To measure for many short time intervals and average together the readings certainly is, in principle, another solution, but one that we cannot find a justification for. Perhaps some of our readers knows of, or can do a thought experiment to come up with a scenario that would require many separate short data collection sessions that would provide data that could be averaged as we describe, but does not allow for a single protracted measurement. The bottom line is that the underlying reason for the problem we ran into is the fundamental difference between a continuous (the Normal case) and a discrete (Poisson) distribution. In the first case, values of exactly zero will never be obtained, although a value may come arbitrarily close to zero and the difference from zero may be unmea surable by a particular instrument, although we can argue that even in this case the measurement of an exact zero value is an artifact of the discrete measurement levels

310

Chemometrics in Spectroscopy

inherent in the use of A/D converters. As we will see, however, the solution to this dif ficulty is the same as the solution to the creation of distortions we found when operating at low signal-to-noise levels when the Normal distribution is the operative one. In any case, using single readings of Er when the individual values equal zero is not an option, due to the generation of the infinity. However, as long as no single reading comes up with a value of zero, then there is nothing wrong with making the short-time measurements and averaging together the computed values of transmittance. We simply need to make sure that in any given series of readings the probability of obtaining a value of zero for any reading is small enough that it does not actually occur during our series of readings. Toward this end we present these probabilities in Table 49-3, which were simply computed directly from the formula for the Poisson Distribution, for several values of , for X = 0. From this table, and some elementary probability theory which can be found in virtually any book on elementary Statistics, or in our early chapters (collected in [10]) the interested reader can pick a value for which will virtually always give high enough counts that no reading will never be zero. In the practical matter of performing the summations indicated for the various formulas that must be evaluated, the question arises as to how many terms need to be included; this question is analogous to the need to decide the limits of integration that was implicit in evaluating the analogous expressions for the Normal Distribution. In the case of the Poisson distribution this is one decision that is actually easier to make. The reason is

Table 49-3 Probability of obtaining a reading of zero, for various values of the parameter Lambda 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Poisson probability at X = 0 0 367879441 0 135335283 0 049787068 0 018315639 0 006737947 0 002478752 0 000911882 0 000335463 0 000123410 0 000045400 0 000016702 0 000006144 0 000002260 0 000000832 0 000000306 0 000000113 0 000000041 0 000000015 0 000000006 0 000000002

Analysis of Noise: Part 10

311

twofold. The first reason is that we are in fact doing a summation. In the case of the Normal distribution, the summation that was done was an approximation to an integral, and therefore engendered questions as to how closely the summation we performed approximated that integral, a question that was affected by the size of the interval used for the summations. The Poisson distribution, as equation 49-122 shows, is defined directly as a summation, and the question of approximating an integral does not arise. The second reason is that the expression in the denominator of equation 49-122 contains the factorial of the term number. This factorial increases much faster than any of the expressions in the numerator, and therefore successive terms fairly quickly become very small, once the term number exceeds the value of . Hence it only requires a relatively small number of terms in order for the summation to converge to a point such that the sum of the remaining terms is less than the precision of the computer; with standard double-precision number representation, this is approximately 10−16 . Inspection of the values of individual terms reveals that up to = 10; this point is reached at the 46th term in the worst case. To gain some margin, however, all computations were done using 100 terms of the summation expressed in equation 49-122. Again in the worst case (i.e., = 10), the value of the 100th term is 4 86 × 10−63 . From Table 49-3, we see that for values of greater than about 5, the probability of obtaining a value of zero becomes very small. From Figure 49-26 (or, strictly speaking, from the table of values from which Figure 49-26 was plotted) the value of FEr PEr is 1.052283, so that gives us an upper limit of approximately 5% as the amount of distortion that we can expect to be realized in an actual measurement situation. Again, we continue our derivations in our next chapter.

REFERENCES 1. 2. 3. 4. 5. 6. 7. 8. 9. 10.

Mark, H. and Workman, J., Spectroscopy 15(10), 24–25 (2000). Mark, H. and Workman, J., Spectroscopy 15(11), 20–23 (2000). Mark, H. and Workman, J., Spectroscopy 15(12), 14–17 (2000). Mark, H. and Workman, J., Spectroscopy 16(2), 44–52 (2001). Mark, H. and Workman, J., Spectroscopy 16(4), 34–37 (2001). Mark, H. and Workman, J., Spectroscopy 16(5), 20–24 (2001). Mark, H. and Workman, J., Spectroscopy 16(7), 36–40 (2001). Mark, H. and Workman, J., Spectroscopy 16(11), 36–40 (2001). Mark, H. and Workman, J., Spectroscopy 16(12), 23–26 (2001). Mark, H., Workman, J., Statistics in Spectroscopy, 1st ed. (Academic Press, New York, 1991).

This page intentionally left blank

50

Analysis of Noise: Part 11

This chapter is a continuation of a series of chapters starting with Chapter 40 up to 49 [1–10] dealing with the rigorous derivation of the expressions relating the effect of instrument (and other) noise to their effects to the spectra we observe. As we do in each chapter in this section of the book, we again take this opportunity to remind our readers that we are dealing with a continuous series of chapters, and so we again continue our discussion by continuing our equation numbering, figure numbering, use of symbols, and so on as though there were no break, except that when we repeat an equation or series of equations that were derived and presented previously, we retain the original number(s) for those equation(s). We forego summarizing all the previous work, except to note that so far we have treated similarly the cases of detector noise following both the Normal and the Poisson distributions, finding expressions for the noise of transmittance readings, then the noise of absorbance readings, followed by finding the transmittance value at which the optimum analytical accuracy can be obtained (defined as the transmittance corresponding to the minimum relative absorbance S/N), that was followed by the derivation of the expected value of transmittance for the case where the signal falls so low as to be comparable to the detector noise level. We are currently at the point in the treatment of Poisson-distributed noise where, to continue following the procedure set up for the case of Normally distributed noise, we wish to derive the value of the expected noise of transmittance readings when the signal falls so low that the optical signal level is comparable to the detector noise. So we are ready to continue. Before doing so, however, let us remind ourselves of one of the key points we learned during our examination of the properties of the expression for the transmittance of samples when the reference energy is low: since the Poisson distribution is a discrete distribution, when the reference energy is low there is a reasonably high probability that a reading containing zero counts will be obtained. To obtain reading of exactly zero will effectively never occur when a continuous distribution is the governing distribution, so we have a situation that we have not run into before: a high likelihood that a divide-by-zero computation will occur with computing the transmittance. This will give rise to a computed value of infinity for the transmittance. It remains to be seen whether there will also be a similar effect on the computed noise. As we start on this next piece of the pie, we remind ourselves that we wish to build on the work we have done previously, so as to not have to repeat the derivations of alreadyderived expressions. Hence we will note the high points and provide the references to where the interested reader can review these pertinent mathematical steps. We start by following the derivation of the transmittance noise for the constant-noise situation,

314

Chemometrics in Spectroscopy

which we presented in [5]. We began with the definition of transmittance T according to equation 50-6: T=

Es Er

(50-6)

We then applied the Propagation of Uncertainties expression: FC D =

fC D fC D C + D C D

(50-64)

where C = Es and D = Er to obtain T =

Es −Es Er + Er Er 2

(50-66)

and after taking the variance of equation 50-66, applying the two statistical theorems that allow us to simplify the expressions we obtained � � 1 −Es 2 VarT = 2 VarEs + VarEr Er 2 Er

(50-70)

Previously, in the case of constant detector noise, we then set Var(Es and Var(Er equal to the same value. This is the point at which must we now depart from the previous derivation, since in the case of Poisson-distributed noise the sample and reference noise levels will rarely, if ever, be the same. However, we are fortunate in this case that Poisson-distributed noise has a unique and very useful property that we have indeed previously made use of: the variance of Poisson-distributed noise is equal to the mean signal value. Hence we can substitute Es for Var(Es and Er for Var(Er : VarT =

� � 1 −Es 2 E + Er s Er 2 Er 2

(50-131)

VarT =

Es Es 2 + E r 2 Er 3

(50-132)

and setting Es = TEr : T T2 + Er Er � T T2 SDT = + Er Er

VarT =

(50-133)

(50-134)

Figure 50-27 plots the transmittance noise as a function of Er according to equa tion 50-134, for several values of the transmittance. As we observed from inspecting equation 50-134, at all values of Er , the noise increases with T , while for all values of T , the transmittance noise decreases inversely with the square root of the reference signal level. However, we remind our readers that, as we discussed in the previous chapter,

Analysis of Noise: Part 11

315 Transmittance noise from Poisson distribution

7 6

Noise

5 4

T = 0.1 T=1

3 2 1

4.85

4.65

4.45

4.25

4.05

3.85

3.65

3.45

3.25

3.05

2.85

2.65

2.45

2.25

2.05

1.85

1.65

1.45

1.25

1.05

0.85

0.65

0.45

0.25

0.05

0

Er

Figure 50-27 Transmittance noise for Poisson-distributed data as a function of Er at different values of parameter T , from equation 50-134.

values of Er less than 5 provide only a mathematical expectation, and a mathematical fiction, since any value of Er that is small enough to result in an actual zero reading, will give an infinite value for the transmittance, and for the noise level. Nevertheless, equation 50-134 is valid for all values of Er , and therefore, while the plot we constructed includes values that cannot be achieved in reality, both the plot and the equation are valid in the range that can be actually measured. It is also interesting to compare equation 50-134 with equation 50-72, which is the corresponding equation that describes the transmittance noise when the detector noise level is constant [5]: SDT =

T SD E Er

(50-72)

As usual, we will continue in the next chapter, where we will discuss the various aspects of absorbance noise that are of concern.

REFERENCES 1. 2. 3. 4. 5. 6. 7. 8. 9. 10.

Mark, Mark, Mark, Mark, Mark, Mark, Mark, Mark, Mark, Mark,

H. H. H. H. H. H. H. H. H. H.

and and and and and and and and and and

Workman, Workman, Workman, Workman, Workman, Workman, Workman, Workman, Workman, Workman,

J., J., J., J., J., J., J., J., J., J.,

Spectroscopy Spectroscopy Spectroscopy Spectroscopy Spectroscopy Spectroscopy Spectroscopy Spectroscopy Spectroscopy Spectroscopy

15(10), 24–25 (2000). 15(11), 20–23 (2000). 15(12), 14–17 (2000). 16(2), 44–52 (2001). 16(4), 34–37 (2001). 16(5), 20–24 (2001). 16(7), 36–40 (2001). 16(11), 36–40 (2001). 16(12), 23–26 (2001). 17(1), 42–49 (2001).

This page intentionally left blank

51

Analysis of Noise: Part 12

This chapter is one more in the set of 40 through 50 [1–11] dealing with the rigorous derivation of the expressions relating the effect of instrument (and other) noise to their effects to the spectra we observe. As we do in each chapter in this section of the book, we again take this opportunity to remind our readers that we are dealing with a continuous series of chapters, and so we again continue our discussion by continuing our equation numbering, figure numbering, use of symbols, and so on as though there were no break, except that when we repeat an equation or series of equations that were derived and presented previously, we sometimes retain the original number(s) for those equation(s). We can also report that work similar to that in the first few chapters of this “noise” subseries has been reported in Applied Spectroscopy [12]. This paper derives the expres sions for the constant-detector-noise case using a calculus-based approach rather than the algebraic approach used in these chapters [2, 3]. It also includes experimental data that verifies the correctness of the theoretical development reported in these chapters. In the previous chapter, we have examined the situation in regard to determining the effect of noise on the computed transmittance. Now we wish to examine the behavior of the absorbance for Poisson-distributed noise when the reference signal is small. Our starting point for this is equation 51-24, which we derived previously [3] for the case of constant detector noise, but at the point we take it up the equations have not yet had any approximations, or any special assumptions relating to the noise behavior: � � −04343Er Er Es − Es Er A = Er Er + Er Es

(51-24)

Our equation numbering system now causes us to jump from equation 51-24 to 51-135 for our next equation number: � � � � Er Es Es Er 04343Er −04343Er A = + Er Er + Er Es Er Er + Er Es A =

−04343Er Es 04343Er + Es Er + Er Er + Er

(51-135)

(51-136)

and upon taking the variance of A and applying the theorem for the variance of a sum: � VarA = Var

� � � −04343Er Es 04343Er + Var Er + Er Es Er + Er

(51-137)

318

Chemometrics in Spectroscopy

and upon applying the theorem for the variance of a constant times a random variable: � VarA = �

�2

� Var

�2

� Var

−04343Er Es �

SDA =

−04343 T

� � � Es Er 2 + 04343 Var (51-138) Er + Er Er + Er

� � � Es Er + 043432 Var Er + Er Er + Er

(51-139)

Here we again have the problem we previously encountered [4], of not being able to separate the individual variances out of the formulas, because of its occurrence in the denominator along with Er . There is no help for it but to calculate the individual terms Es Er and for all meaningful values of the distributions of Es and Er Er + Er Er + Er and then compute their variance. There are several programming issues involved here, which we discuss a bit later in this chapter. The results are presented in Figure 51-28. Continuing on to ascertain the value of transmittance corresponding to the optimum relative noise, we find that in [5] we demonstrated that Var(A/A was given by the following expression, which is still completely general: � Var

A A

�

� =

1 T ln T

�

�2 Var

Es Er + Er

�

� +

1 ln T

�2

� Var

−Er Er + Er

� (51-77)

The problem we had then, which is the same problem we have now, is that again due to the presence of Er in the denominator of each term in the variance calculations, we cannot further separate the terms, to extract the variances of the sample and reference signals by mathematical analysis. Our solution to this problem previously was to use a Monte-Carlo numerical computer simulation to examine the performance of the noise described by these equations, since we could not do a numerical integration.

Absorbance noise 2.00 1.80 1.60

SD(ΔA)

1.40 1.20 1.00 0.80 0.60 0.40 0.20 0.96

0.91

0.86

0.81

0.76

0.71

0.66

0.61

0.56

0.51

0.46

0.41

0.36

0.31

0.26

0.21

0.16

0.11

0.06

0.01

0.00

T

Figure 51-28 Absorbance noise for Poisson-distributed data at low values of the reference signal.

Analysis of Noise: Part 12

319

In the case of Poisson-distributed noise, we can do a systematic numerical calculation. The reasons we can do this now, when we could not do it for the Normal distribution, are the ones we have discussed previously: 1) The Poisson distribution is discrete, and so is more amenable to numerical compu tation 2) Er is never negative 3) Er occurs in the denominator Er always summed together with Er . Together with point 2, this means that the denominator is never zero as long as the reference energy Er is non-zero. Therefore all terms to be included in the computation are finite. Again we repeat our reminder that the results of these computations are mathematical expectations, in a real measurement situation denominators of zero can be expected to occur when Er is less than approximately five. Equation 51-77 was programmed in MATLAB, using the Poisson distribution for both Er and Es ; the actual distribution used corresponded to the value of Er and Es , respectively. The computations were done for 001 ≤ T ≤ 099, and for values of 1 ≤ Er ≤ 10. The computation is not straightforward (neither was the one for evaluating equation 51-139). The terms whose variance are to be computed have to themselves be computed. For the first term of equation 51-77, this means that all possible combinations of values of Er and Es have to be generated, the terms corresponding to each combination com puted and then each term weighted by its frequency according to the Poisson distribution with appropriate arguments. Since the Poisson distribution gives fractional values of the probabilities of occurrence of each value in the distribution, these probability values have to be multiplied by a number that will then provide an integer number for the values for the terms that would have their variance computed. The first attempt created the actual full lists of the terms in equation 51-77. A few short runs, with small values of the multipliers (of about 100, for each of Er and Es , giving 10,000 terms total), was quickly found to be unsatisfactory: the resulting plots were found to be very ragged and uneven. The number of terms was increased 5×105 , using a multiplier of 500 for Es and a multiplier of 1,000 for Er . It was found that using a larger number of terms than that, although desirable because it made the curve smoother, caused “out of memory” problems in MATLAB. At this number of computation points, although smoother than with fewer points, the curves were still visibly ragged to the eye. The attempt to create the full lists of terms was abandoned. Instead, one term of each combination of values of Es and Er was computed, and the program kept track of how many times that term would appear in the full list. This allowed the programming to use the computation of weighted averages and variances, the weighting factors being the number of times a given term would appear in the full list of terms. While more complicated to program, this scheme allowed the computation of the results for the equivalent of very large lists indeed. The actual results presented here are based on the use of 10,000 values to represent the Poisson distribution for Er and the same number for Es , providing a result equivalent to a list of 108 terms. Another issue that must be kept in mind when setting up the program is that despite appearances, the computation of the variance terms is not independent of the value of T that appears in the coefficients of the variance terms in equations 51-139 and 51-77.

320

Chemometrics in Spectroscopy

The reason is that Er determines the distribution that must be used for Er , and t and Er together determine Es and hence the distribution of Es that must be used in the variance computation. The resulting plot is presented in Figure 51-29. From the plot, and from examining the list of values from which the plot was made, there appears to be no shift in the transmittance corresponding to the optimum value of relative absorbance, as the reference reading varies. As usual, we will continue in the next chapter; we will now start on the derivations of formulas relating to the effects of what we have previously called “scintillation noise”, and which is also called “flicker noise”, “source noise”, and other labels. Basically this

(a)

Relative absorbance noise

2.00 1.80 1.60

Er = 1

ΔA/A

1.40 1.20 1.00 0.80 0.60 0.40 0.20

0.71

0.76

0.81

0.86

0.91

0.96

0.71

0.76

0.81

0.86

0.91

0.96

0.66

0.61

0.51

0.46

0.41

0.36

0.31

0.26

0.21

0.16

0.11

0.06

0.01

0.56

Er = 10

0.00

T (b) Relative absorbance noise 0.50 0.45

Er = 3

ΔA/A

0.40 0.35 0.30 0.25

Er = 10 0.66

0.61

0.56

0.51

0.46

0.41

0.36

0.31

0.26

0.21

0.16

0.11

0.06

0.01

0.20

T

Figure 51-29 Relative absorbance noise for Poisson-distributed data, determined by numerical computation using equation 51-77. Figure 51-29b is an ordinate expansion of Figure 51-29a. (see Color Plate 18)

Analysis of Noise: Part 12

321

refers to noise caused by effects that cause the variations of the signal to be proportional to the signal. Are we having fun yet?

REFERENCES 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12.

Mark, Mark, Mark, Mark, Mark, Mark, Mark, Mark, Mark, Mark, Mark, Mark,

H. and Workman, J., Spectroscopy 15(10), 24–25 (2000). H. and Workman, J., Spectroscopy 15(11), 20–23 (2000). H. and Workman, J., Spectroscopy 15(12), 14–17 (2000). H. and Workman, J., Spectroscopy 16(2), 44–52 (2001). H. and Workman, J., Spectroscopy 16(4), 34–37 (2001). H. and Workman, J., Spectroscopy 16(5), 20–24 (2001). H. and Workman, J., Spectroscopy 16(7), 36–40 (2001). H. and Workman, J., Spectroscopy 16(11), 36–40 (2001). H. and Workman, J., Spectroscopy 16(12), 23–26 (2001). H. and Workman, J., Spectroscopy 17(1), 42–49 (2001). H. and Workman, J., Spectroscopy 17(6), 24–25 (2002). H.L. and Griffiths, P.R., Applied Spectroscopy; 56(5), 633–639 (2002).

This page intentionally left blank

52 Analysis of Noise: Part 13

This chapter is a continuation of the set of Chapters 40 to 51 [1–12] dealing with the rigorous derivation of the expressions relating the effect of instrument (and other) noise to their effects to the spectra we observe. As we do in each chapter in this section of the book we again take this opportunity to remind our readers that we are dealing with a continuous series of chapters, and so we again continue our discussion by continuing our equation numbering, figure numbering, use of symbols, and so on as though there were no break, except that when we repeat an equation or series of equations that were derived and presented previously, we retain the original number(s) for those equation(s). We have now gone through the analysis of two cases pretty thoroughly. It should be apparent to the reader what our approach is, and how the analysis of these situations is attacked. Hopefully, therefore, we can now go a little faster than we have been. In the previous chapter, we pretty much finished up our discussion of noise that was Poisson-distributed. In one sense Poisson-distributed noise is a special case since, for example, when we analyzed the effects of noise that was constant, we did not, until it became pertinent, consider the noise to have any particular distribution. However, since Poisson noise arises naturally out of a particular noise mechanism, and it is one that occurs in several different spectral regions and in conjunction with different technologies, it was appropriate to consider it as an entity unto itself. The next noise source we consider is what we originally called “scintillation noise” and which, as we noted in the previous chapter, is also called by several other labels: source noise, flicker noise, and other labels. The defining characteristic of this noise is that the variability is directly proportional to the intensity of the signal. One way this can arise is through a mechanical vignetting of an optical beam. If a piece of metal were to block, say, 1% of a homogeneous beam, then the intensity of the beam would be reduced by 1% of its total intensity, so that the absolute reduction of the signal from an intense beam would be greater than for a weak beam. Another way this can arise is if, for example, a photoresistive detector is in use, then the detector current will be proportional to the detector voltage as well as the intensity of radiation impinging on it. Then if the detector voltage varies, the change in detector current for a given voltage change will again be proportional to the optical intensity, but with a sensitivity proportional to the detector voltage. In either case, if the amount of beam blockage or the detector voltage is random, then this sensitivity change becomes a source of random variation proportional to the signal intensity. Another characteristic of scintillation noise is that, since it represents the amount of energy in the optical beam, it can never attain a negative value. In this respect it is similar to the Poisson distribution, which also can never attain a negative value. On the other hand, since it is a continuous distribution it will behave the same way as the constant-noise case in regard to achieving an actual zero: any given reading can become

324

Chemometrics in Spectroscopy

infinitesimally close to zero, but there is zero probability of actually achieving an exact value of zero, except in the case of a complete absence of signal. It differs from the Poisson case in two respects, however. First, the distribution of the variations is not predetermined, but depends on the nature of the changes causing the signal variation. Secondly, the magnitude of the changes is not predetermined, but depends on the amount of variation of the cause. At an appropriate point we will have to accommodate this by introducing a constant representing the magnitude of the variations. We will go a bit farther in characterizing the various types of noise sources we consider, and in Table 52-4 we list, for comparison purposes, the corresponding characteristics of the three types of noise we have or are considering: So let us begin our analysis. As we did for the analysis of shot (Poisson) noise [8], we start with equation 52-17, wherein we had derived the expression for variance of the transmittance without having introduced any special assumptions except that the noise was small compared to the signal, and that is where we begin our analysis here as well. For the derivation of this equation, we refer the reader to [2]. So, for the case of noise proportional to the signal level, but small compared to the signal level we have � �2 � � 1 −T 2 VarT = VarEr (52-17) VarEs + Er Er In the derivation of the transmittance noise in the case of Poisson-distributed noise, at this point we noted that the variances of Er and Es were proportional to Er and Es respectively. In the current case, the corresponding relationship is that the standard

Table 52-4 Comparisons between noise characteristics, including the expressions for low-noise behavior Type of noise

Constant detector noise

Shot noise

Scintillation noise

Relation to signal

Independent of signal

Square root

Proportional

Continuous

Yes

No

Yes

Variance locked to signal level?

No

Yes

No

Distribution

Not predetermined

Poisson

Not predetermined

Negativity

Negative values possible

Non-negativity constraint

Non-negativity constraint

Probability of zero value

Zero

Finite

Zero

Expression for transmittance noise Expression for relative absorbance noise: SDA/A

�

√

SDE Er � SDE Es 2 + Er 2 Es Er lnEs /Er 1+T2

T +T2 Er

1 �1 + 1 lnT Es Er

√

2 kT

√

2k lnT

Analysis of Noise: Part 13

325

deviation of the noise on Er and Es is proportional to Er and Es , with a proportionality factor, k, that is related to the magnitude of the physical cause of the noise, i.e.: SDEr = kEr SDEs = kEs The variances of Er and Es , then, are proportional to k2 Er 2 and k2 Es 2 respectively, and substituting these values in equation 52-17 gives � −T 2 2 2 k Es + k Er VarT = Er � 2 2� k Es VarT = + k2 T 2 Er 2 �

1 Er

�

�2

2

2

VarT = 2k2 T 2 √ SDT = 2 kT

(52-140) (52-141)

(52-142) (52-143)

Given the simplistic nature of this relationship, we forbear to plot the function, although we note that there are again a family of functions, corresponding to the various values of k. We note that, in contrast to the previous two cases, the transmittance noise depends on the magnitude of the effect, and on the transmittance of the sample, but does not depend on the energy of the reference beam; in other words, whereas in the previous two cases the signal-to-noise level of the reference beam was a key factor in deter mining the behavior of the transmittance noise, here it does not. This conforms to intuition, since when we state that the noise superimposed on the signal is proportional to the signal, the implicit consequence is that the signal-to-noise (or noise-to-signal) is constant. Let us now, as we normally do, continue to derive the expressions for absorbance noise again referring to our previous chapter [8], we can start with equation 52-29: � VarA =

−04343 Es

�2

� Var Es +

04343 Er

�2 VarEr

(52-29)

Again substituting k2 Er 2 and k2 Es 2 for the two variances in equation 52-29 � VarA =

−04343 Es

�

�2 k2 Es 2 +

04343 Er

VarA = 2 × 043432 k2 √ SDA = 2 × 04343k

�2 k2 Er 2

(52-144)

(52-145) (52-146)

326

Chemometrics in Spectroscopy

Here again, in the low-noise case of scintillation noise, the absorbance noise is again independent of the reference signal level, and is now independent of the sample characteristics, as well, and depends only on the magnitude of the external noise source. In conformance with our regular pattern, we now derive the behavior of the rela tive absorbance noise for the low-noise case. Here we start with equation 52-100, the derivation of which is found in [9]: � Var

A A

� =

1 1 VarEs + VarEr Er lnT 2 Es lnT 2

(52-100)

And once more substituting k2 Er 2 and k2 Es 2 for the two variance terms: �

A Var A

� =

1 1 k 2 Es 2 + k 2 Er 2 Es lnT 2 Er lnT 2 �

�

2k2 lnT 2 √ � � A 2k SD = A lnT

Var

A A

(52-147)

=

(52-148)

(52-149)

Equation 52-149 presents a minor difficulty; one that is easily resolved, however, so let us do so: the difficulty actually arises in the step between equation 52-148 and 52-149, the taking of the square root of the variance to obtain the standard deviation; conventionally we ordinarily take the positive square root. However, T takes values from zero to unity; that is, it is always less than unity. the logarithm of a number less than unity is negative, hence under these circumstances the denominator of equation 52-149 would be negative, which would lead to a negative value of the standard deviation. But a standard deviation must always be positive; clearly then, in this case we must use the negative square root of the variance to compute the standard deviation of the relative absorbance noise. In Figure 52-30 we plot the function −1/ lnT to complete this part of the analysis. We note that there is no minimum to the curve, and the noise from source continu ally improves as the transmittance decreases; in this case the previous, √ conventional derivations agree with our results, although they do not indicate the 2 factor. Noting the transitions from equation 52-140 to 52-142 (and the corresponding portions of the derivation for absorbance noise and relative absorbance noise), we see that this factor arises from the equal noise contributions of the sample and reference channels; therefore we conclude that in this case also, the missing factor is due to the neglect of the reference channel noise contribution. The rate of increase in noise also increases faster as T increases (not surprising for a logarithmic function!), so that working at transmittance values less than, say, 0.7 or 0.8 is prudent. Of course, we must also remember that our derivations are idealizations, and as Ingle and Crouch point out ([13], p. 153), in a real measurement situation, at some point another noise source would become dominant and limit the actual noise observed.

Analysis of Noise: Part 13

327

(a)

–1/ln(T ) 100 90 80

–1/ln(T )

70 60 50 40 30 20 10

0.8

0.88

0.92

0.96

0.88

0.92

0.96

0.76 0.76

0.84

0.72 0.72

0.84

0.68 0.68

0.8

0.6

0.64 0.64

0.56

0.6

0.52

0.56

0.48

0.4

0.44

0.36

0.32

0.28

0.2

0.24

0.16

0.12

0.08

0

0.04

0

T

(b)

–1/ln(T ) 25

–1/ln(T )

20

15

10

5

0.52

0.48

0.44

0.4

0.36

0.32

0.28

0.24

0.2

0.16

0.12

0.08

0

0.04

0

T

Figure 52-30 (a) Plot of −1/ lnT (b) Ordinate expansion of Figure 52-30a.

REFERENCES 1. 2. 3. 4. 5. 6. 7. 8. 9. 10.

Mark, Mark, Mark, Mark, Mark, Mark, Mark, Mark, Mark, Mark,

H. H. H. H. H. H. H. H. H. H.

and and and and and and and and and and

Workman, Workman, Workman, Workman, Workman, Workman, Workman, Workman, Workman, Workman,

J., J., J., J., J., J., J., J., J., J.,

Spectroscopy Spectroscopy Spectroscopy Spectroscopy Spectroscopy Spectroscopy Spectroscopy Spectroscopy Spectroscopy Spectroscopy

15(10), 24–25 (2000). 15(11), 20–23 (2000). 15(12), 14–17 (2000). 16(2), 44–52 (2001). 16(4), 34–37 (2001). 16(5), 20–24 (2001). 16(7), 36–40 (2001). 16(11), 36–40 (2001). 16(12), 23–26 (2001). 17(1), 42–49 (2001).

328

Chemometrics in Spectroscopy

11. Mark, H. and Workman, J., Spectroscopy 17(6), 24–25 (2002). 12. Mark, H. and Workman, J., Spectroscopy 17(12), 38–41, 56 (2002). 13. Ingle, J.D. and Crouch, S.R., Spectrochemical Analysis (Prentice-Hall, Upper Saddle River, NJ, 1988).

53

Analysis of Noise: Part 14

This chapter is a continuation of chapters 40 to 52 [1–13] dealing with the rigorous derivation of the expressions relating the effect of instrument (and other) noise to their effects to the spectra we observe. As we do in each chapter in this section of the book we again take this opportunity to remind our readers that we are dealing with a continuous series of chapters, and so we again continue our discussion by continuing our equation numbering, figure numbering, use of symbols, and so on as though there were no break, except that when we repeat an equation or series of equations that were derived and presented previously, we retain the original number(s) for those equation(s). In the previous chapter we analyzed the effect of scintillation noise that is noise that is proportional to the signal, for the case of low noise (i.e., noise small compared to the signal level). Now we wish to analyze, as we have done previously, the situation when the noise is no longer negligible compared to the signal level. Here again we enter territory where extra care is needed. In the previous cases, we were able to assume that all conditions of the measurement were constant except that the reference energy was reduced until it was of comparable magnitude to the noise level. In the case of scintillation noise, however, that is not an option. As we noted earlier, as the signal level is reduced (in either channel), the noise is reduced correspondingly, leaving the N/S (i.e., the inverse of the S/N, in our notation SDEr /Er ratio constant. Therefore we cannot consider reducing the S/N by reducing the reference signal level. The only way we can change the signal-to-noise ratio is by changing the proportionality parameter k, which expresses the noise as a fraction of the signal. As we did for the low-noise case, we will introduce this parameter as the appropriate point in the derivation. Following our usual sequence, our next step here then, as it was for the previous two cases we treated (constant detector noise and Poisson-distributed noise), is to ascertain the effect of this noise on the expected value of computed transmittance. To do this we start with equation 53-5 (reference [2]): T + T =

Es + Es Er + Er

(53-5)

In the low-noise case we were able to justify separating equation 53-5 into two terms and setting T equal to Es /Er + Er . Here we cannot do that for several reasons: 1) In the large-noise case, Er in the denominator of equation 53-5 is non-negligible and therefore induces an asymmetry that will prevent it from vanishing upon integration. 2) An even larger asymmetry in introduced by a fact we have discussed previously: the physical causes of the error source under consideration preclude both the numerator

330

Chemometrics in Spectroscopy

and the denominator from becoming negative. Thus when we evaluate equation 53-5 to ascertain the expected values, we cannot continue integration below zero; the integration must be truncated at that point. For a physical picture to describe the situation, we can imagine an optical beam, and some opaque component vibrating randomly into and out of the beam. A schematic picture of this is shown in Figure 53-31a. Clearly, the further into the beam the obstruction intrudes, the more the beam is blocked and the less energy reaches the detector. The instantaneous blockage depends on multiple factors: the average position of the obstruction and the magnitude of the vibration. For our purposes we will assume that the position of the obstruction varies around its central location in such a manner that the distribution of energy in the optical beam varies according to a Normal distribution. Anyone who wants to calculate the actual distribution for an optical beam of interest to them is certainly free to do so and follow through on the calculation of the distribution of noise that actual optical geometry will cause. We will certainly appreciate hearing about any efforts in that direction, and the results obtained. For our purposes, however, we simply wish to point out that as the center of the obstruction’s motions moves close to the beam, more and more of the beam is blocked. Also, if the vibrational amplitude of the obstruction’s motion is constant, then the blockage will represent larger and larger fractions of the beam’s energy. However, there is a limit to that: if the obstruction moves so that it completely blocks the beam, then the instantaneous energy transmitted will be zero. As the obstruction continues to move closer to the optical beam, then complete blockage can occur more and more often, or equivalently, for larger and larger fractions of the time, but at no time can the energy transmitted become less than zero. From Figure 53-31a we can also see a corollary: that if the amplitude of the obstruc tion’s vibration is large enough, it will move completely out of the optical beam, resulting in truncation of the Normal distribution due to the fact that there will also be a max imum possible value for the energy, corresponding to the situation when the beam is completely unblocked. For our current analysis, however, we will consider only the case where the average position of the obstruction is within the beam and so the beam is always at least partially blocked. Then the vibrations of the obstruction cause it to block varying amounts of energy, down to complete blockage, but not to complete passage.

Obstruction

Optical beam

SIDE VIEW

END VIEW

Figure 53-31a An obstruction vignetting the beam randomly can affect the signal level, but cannot do more than block the entire beam, reducing the energy to zero.

Analysis of Noise: Part 14

331

–2.09 –1.85 –1.61 –1.36 –1.12 – 0.88 – 0.64 – 0.39 – 0.15 0.09 0.33 0.58 0.82 1.06 1.30 1.55 1.79 2.03 2.27 2.52 2.76 3.00 3.24 3.48 3.73

1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1

0

Energy

Figure 53-31b Once the optical beam is completely blocked, no less light can pass through the optical system. The average light that then can pass is the integral of the shaded area.

The effect of this on the beam energy is indicated in Figure 53-31b. If the distribution of energies is Normal, then the part below the lower limiting edge is truncated, since it is not possible to have less than zero energy. As we did in the analysis of Poisson-distributed noise, we compute the expected value of T as the weighted sum of the transmittance described by equation 49-5 (reference [10]): � Wi FXi i (53-59) TW = � Wi i

The evaluation of equation 53-59 for the current case of scintillation noise carries with it its own set of difficulties and cautions, just as the previous cases did. Some of them, caused by the physical limitation of not allowing the energy to go below zero, were mentioned above. Others mirror the two cases we have previously discussed in the past several chapters; the case of scintillation noise seems to combine some of the more difficult aspects of the two previous cases. Like the Poisson distribution, the value of the function and therefore of the integration does not go below zero, mirroring the physical effect that the actual optical energy cannot go below zero. Unlike the Poisson distribution (which was discrete) on the other hand, the values of the energy form a continuum, as does the Normal distribution we are assuming that the noise follows. Therefore we cannot simply add together the relatively small number of discrete values that the function can assume, but must perform a numerical integration over the range of values that will make appreciable contributions to the result. Another consequence of not following the Poisson distribution is that the noise level is not locked to the energy level. Rather, the value of k, which determines the N/S ratio, is independent of the energy. This precludes any simplification of the equations such as we were able to apply to the Poisson-distributed noise case. On the other hand, neither can we apply some of the simplifications we used in the case of constant detector noise, particularly the fact that the integral of the Normal distribution

332

Chemometrics in Spectroscopy

is unity. Since in the case we deal with now the distribution is truncated, we must perform numerical integrations of the distribution corresponding to the amount of truncation, in order to ascertain the behavior of this situation. This creates another complication in the analysis. While at first glance this limitation of not allowing the signal to go below zero seems like a benefit because it gives us a hard limit for the computation of the integrals, we also have to consider the effect of this limitation on the denominator of equation 53-59, as well as on the numerator. In the previous cases we have considered, the weighting function was a well-behaved mathematical probability function, either the Normal or the Poisson distribution. Both of these distributions evaluated to unity over the range of interest, and therefore we could replace the denominator with unity and ignore it thereafter. Now we wish to use the Normal distribution to describe the behavior of the error contribution we are evaluating, but cannot consider evaluating the integral from − to +, since we have seen that there is a lower limit to the integral. Furthermore, the lower limiting value remains at zero regardless of the value corresponding to the maximum of the distribution.

PRELIMINARY STEPS The evaluation of equation 53-59, therefore, starts with the evaluation of the truncated Normal distribution. This the value of the Normal distribution obtained by integrating the Normal distribution not between − and +, but between the lower cutoff value, whatever that is, and +. The Normal distribution, being simply another name for the error function, is well-known to not be integrable analytically, therefore numeric approximations are needed to ascertain the value. Indeed, it has been computed to high accuracy and the values available in tables, see for example ([14], p. 3). It is necessary, however, for us to be able to perform these computations ourselves. so that we can also use them in evaluating the weighted averages specified by equation 53-59. This is similar to the computations we performed previously for the case of constant (detector) noise, but differs from that computation in that the previous computation was done over the full significant range of the Normal distribution, instead of the truncated distribution. As a test, then, the computation was written in MATLAB (Mathworks, Natick, Mass.) The result of the computation, for a continuum of values of the point of truncation, is shown in Figure 53-32. The accuracy of the integration was evaluated by comparing the values computed from the MATLAB program to the tables available ([14], p. 3) at several selected values of X (where X represents the number of standard deviations at which the truncated SD was evaluated from) as a function of the integration interval. The results are presented in Table 53-5. We also inspect the nature of the function that we will be integrating. In the picture corresponding to the small-noise case, the underlying energy of the optical beam is effectively constant over the range of variation of energy, indeed this is the definition of “small noise”. For a Normally distributed vibration, the energy would thus also be Normally distributed. For the large-noise case, however, the energy varies appreciably over the range of vibration, and the variation increases with k, distorting the shape of the curve. This behavior, which is shown in Figure 53-33, corresponds to the plot in Figure 43-5 (Chapter 43) for the case of constant detector noise (reference [4]). Since

Analysis of Noise: Part 14

333

1.20 1.00 0.80 0.60 0.40 0.20 2.94

2.61

2.28

1.95

1.62

1.29

0.96

0.63

0.3

–0.03

–0.36

–0.69

–1.02

–1.35

–1.68

–2.01

–2.34

–3

–2.67

0.00

Figure 53-32 The integral of the truncated Normal distribution for values of the point of trun cation between −3 and +3 standard deviations. In all cases the integration was continued to +4 standard deviations.

Table 53-5 Accuracy of the integral of the truncated Normal curve for different values of the integration interval Integration interval

Error of integral

0.1 0.01 0.001 0.0001

0019 00019 000017 0000019

1.1 0.9 0.7 0.5

Energy

Truncated normal distribution

0.3 Product

–0.1

0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 105 110 115 120

0.1

Figure 53-33 The relation between the Normal distribution (truncated at −1 SD), the energy variation and the product of the two curves.

334

Chemometrics in Spectroscopy 0.4 0.35 –1

0.3 0.25 –0.6

0.2 –0.3

0.15 0.1

0 +0.3

0.05 0 1 6 11 16 21 26 31 36 41 46 51 56 61 66 71 76 81 86 91 96

Figure 53-34 Family of curves of the Energy-Distribution product, corresponding to various truncation points. The numbers indicate the truncation point of the Normal distribution, as the number of standard deviations from the peak of the Normal distribution.

the nature of the curve will vary as the degree of truncation varies, this also represents a family of curves. Figure 53-34 presents this family of curves, for various values of the point of truncation with respect to the Normal distribution curve. There is a point that we have implied in the forgoing discussion but have not made explicitly, so let us correct that oversight now: in the previous discussions of the math ematics behind the analysis of scintillation noise, we pointed out that, since the noise decreases with the signal, changes in S/N cannot be accomplished by changing the refer ence signal energy (or, for that matter, the sample energy), sine the noise will be reduced proportionately. Therefore, the noise level must be expressed as a multiplier, which we called k, times the signal level. This parameter, k, expresses the standard deviation of the noise as a fraction of the signal energy. Thus, the value at which the Normal distribution becomes truncated can be expresses as a function of k. Thus, for example, if k = 1 (the standard deviation of the energy due to movement of the obstruction equals the energy at the average position of the obstruction), then 95% of the time more energy will be present than the value E − 2k and the cutoff will be at −2k (strictly speaking, at −198k, but we will use the common approximation of 2 since it will be simpler to deal with other cases. Anyone concerned about the discrepancy can adjust the probability levels to compensate). If k = 2, then the corresponding cutoff value will be at −1k. This relates the mathematical quantity k to the properties of the Normal distribution that we will be working with in the evaluations of the integrals. Indeed, there is a “gotcha” to watch out for. In the picture of Figure 53-31a we show an obstruction obscuring part of the optical beam. If the physical amplitude of the vibrations of the obscuration are small, then k will increase as the average position of the obstruction moves closer and closer to the center of the beam, thus obscuring more of it, and reducing the energy by leaving a smaller and smaller crescent of the beam available. This movement of the obstruction corresponds to larger and larger values of k. Assuming that the distribution of positions of the obstruction is Normal, then the value of k varies inversely with the average size of the crescent left available. When the average position of the obstruction corresponds to just being at the edge of the

Analysis of Noise: Part 14

335

beam, then truncation occurs at 0 SD from the maximum of the Normal distribution. But there is nothing to prevent the obstruction from moving even further into obscuring the beam; in such a case light would be passing less than half the time, and the truncation point will have passed the center of the Normal distribution. This behavior is indicated in Figure 53-34, shown as the change in sign of the number of standard deviations corresponding to the point of truncation of the SD in that figure. The “gotcha” is that it would require k to assume an infinite value in order to express the situation where the average position of the obstruction coincided with the edge of the optical beam, a situation which is physically reasonable but mathematically intractable. Therefore our evaluations of the integrals will be based on specifying the truncation point in terms of the standard deviation of the position of the obstruction, rather than in terms of the obscuration of the optical beam.

EVALUATION OF THE FUNCTION We are now ready to evaluate the expressions in equation 53-59 and substitute then into equation 53-5. We will use the same value of k for both sample and reference beams. By having k the same, the results will be independent of the transmittance of the sample, as discussed previously. It also eases our task, since we will not have to compute a family of curves, but only one curve representing the change in computed transmittance as k varies. Evaluating it this way also eliminates the need to perform a double integration; we can simply keep the sample transmittance constant at unity, and plot the variation in computed transmittance. As described above, we do not compute the integral as a function of k directly. Rather, we compute it as a function of the point of truncation of the Normal distribution, which we allow to vary from +3 SDs to −3 SDs as the parameter. Figure 53-35 shows

Transmittance multiplification factor

2.0

Center = –3

1.8 1.6 1.4 1.2 1.0 0.8 0.6 0.4

Center = 3

0.2 –2.8

–2.5

–2.2

–1.9

–1.6

–1.3

–1.0

–0.7

–0.4

–0.1

0.2

0.5

0.8

1.1

1.4

1.7

2.0

2.3

2.6

2.9

0.0

Lower cutoff limit

Figure 53-35 Transmittance multiplication factor as a function of the lower cutoff limit of the Normal distribution, for varying values of the center of the distribution.

336

Chemometrics in Spectroscopy

how the multiplication factor varies, for the various places where the center of the obstruction is. When the noise is small the multiplication factor approaches unity, as we would expect. As we have seen for the previous two types of noise we considered, the non linearity in the computation of transmittance causes the expected value of the computed transmittance to increase as the energy approaches zero, and then decrease again. For the type of noise we are currently considering, however, the situation is complicated by the truncation of the distribution, as we have discussed, so that when only the tail of the distribution is available (i.e., when the distribution is cut off at +3 standard deviations), the character changes from that seen when most of the distribution is used.

Noise To derive the transmittance noise for the case of large scintillation noise, we begin at a somewhat earlier point than we did for the low-noise case, with equation 41-14 [2]: � � � � Er Es −Es Er + Var (53-14) VarT = Var Er Er + Er Er Er + Er Attempting to solve this equation for the scintillation noise situation raises the same difficulties as the previous investigations of noise in the low-signal (high-noise) regime: the inseparability of the Er and Es terms in the denominator, and the generation of infinities in the integrals when attempting to evaluate it. In this case, however, we cannot make the infinities go away or ignore their existence. In the case of constant detector noise, we assumed the infinity away by making the assumption that no measurement would ever coincide with an exact zero value of the noise, since the probability of that would be infinitesimally small. In the case of Poisson noise we were also able to assume the infinity away by making the assumption that since the Poisson distri bution represented a discrete distribution, then even though it could in fact take the value of exactly zero, if this occurred in the denominator of the transmittance compu tation the user would reject that reading, and it would not be included with the data. Therefore we were justified in rejecting readings with zero in the denominator from our calculations. In the case of scintillation noise, however, we cannot do either of those things. By the physical picture we set up to describe the situation, the situation can in fact occur that the obstruction would completely block the optical beam and allow zero energy through, yet since it represents a continuum of values we do not see a justification to arbitrarily reject those readings. Therefore we cannot see a clear path to trying to determine the noise performance of such a system, since it will inevitably come out as infinite in all cases. This seems to be a good stopping point. The title of this chapter is “Chemometrics in Spectroscopy” and for the past several chapters we have departed somewhat from that general topic to discuss in some detail the very specialized question of noise in spectra. While not outside the range of interest covered by the chapter’s intent, it is somewhat near the edges of what might be considered the mainstream purview of the chapter, and it is time to return to a more mainstream discussion, or at least one closer to the center of the topic.

Analysis of Noise: Part 14

337

In creating chemometric calibrations, it is common to transform the spectrum, for any of various reasons, from the measured format, which is usually absorbance, into a different format. One common, widely used transformation is to compute a derivative of the spectrum. First (dA/d) and second (d2 A/d2 ) derivatives are often used. Hence, in our next few chapters we will be discussing the properties and behavior of derivatives.

REFERENCES 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14.

Mark, H. and Workman, J., Spectroscopy; 15(10), 24–25 (2000). Mark, H. and Workman, J., Spectroscopy; 15(11), 20–23 (2000). Mark, H. and Workman, J., Spectroscopy; 15(12), 14–17 (2000). Mark, H. and Workman, J., Spectroscopy; 16(2), 44–52 (2001). Mark, H. and Workman, J., Spectroscopy; 16(4), 34–37 (2001). Mark, H. and Workman, J., Spectroscopy; 16(5), 20–24 (2001). Mark, H. and Workman, J., Spectroscopy; 16(7), 36–40 (2001). Mark, H. and Workman, J., Spectroscopy; 16(11), 36–40 (2001). Mark, H. and Workman, J., Spectroscopy; 16(12), 23–26 (2001). Mark, H. and Workman, J., Spectroscopy; 17(1), 42–49 (2001). Mark, H. and Workman, J., Spectroscopy; 17(6), 24–25 (2002). Mark, H. and Workman, J., Spectroscopy; 17(12), 38–41, 56 (2002). Mark, H. and Workman, J., Spectroscopy; 17(12), 123–125 (2002). Owen, D.B., Handbook of Statistical Tables (Addison-Wesley Publishing Co., Inc., Reading, MA 1962).

This page intentionally left blank

54 Derivatives in Spectroscopy: Part 1 – The Behavior of the Derivative

THE BEHAVIOR OF THEORETICAL DERIVATIVES Derivatives of spectra (dT /d� or dA/d�, and their wavenumber equivalents in FTIR) have been known and used in spectroscopy for a long time. Both first derivatives and second derivatives (d2 T /d�2 or d2 A/d�2 � are in common use in modern spectroscopy, particularly in NIR spectroscopy. We also note that they also enjoy widespread use in some nonoptical spectroscopic techniques, such as NMR and ESR spectroscopies. The mathematics and behavior of the derivative is independent of the particular spectroscopic technique to which it is applied, however. But since our own backgrounds are in optical spectroscopy, where pertinent we will discuss it in terms of the spectroscopy we are familiar with. Studies of the application of derivatives to spectroscopy go back at least as far as 1953 [1–3]. A more recent paper available contains a good bibliography of the work prior to its appearance [4]. Since the advent of NIR spectroscopy becoming a popular analytical technique, the routine use of derivative spectra has burgeoned along with the application to this method of spectroscopic analysis. Along with the increased applicability, interest has grown in the background and behavior of derivatives. Dave Hopkins especially has led the way in understanding the behavior of first and second derivatives, particularly their computation using Savitzky-Golay convolution functions [5, 6]. We do not plan to deal with that aspect too extensively at this time, however. The application of derivatives is not without problems, however, especially when the concern is to accurately represent the derivative of a given data spectrum. Therefore understanding the nature of the problems encountered, so that the proper decisions can be made regarding how the derivative should be calculated is crucial to obtaining optimum results. Figure 54-1 illustrates some of the problems of derivatives. This figure also illustrates some of the basic behaviors underlying the use of the derivatives for spectroscopic analysis. The top curve in Figure 54-1 represents a synthetic spectrum, with two Gaussian (Normal) bands, one of 20 nm bandwidth and one of 60 nm bandwidth. Spectroscopic band shapes are conventionally considered to be either Gaussian or Lorentzian; in this chapter we will concentrate on Gaussian band shapes, therefore all our figures are based on Gaussian-shaped bands. We will, however, treat Lorentzian bands at appropriate points. Therefore in Figure 54-1 we present Normal bands with spacing between wavelength points in Figure 54-1 of 1 nm, a number that will become important later on. The middle curve represents the first “derivative” and the bottom curve the second “derivative” of the absorbance band. We have been putting the term “derivative” in quotes, because they are, in fact, not true derivatives. The definition of a derivative

340

Chemometrics in Spectroscopy

ΔY

ΔY ΔX ΔX

1

0

1499

1480

1461

1442

1423

1404

1385

1366

1347

1328

1309

1290

1271

1252

1233

1214

1195

1176

1157

1138

1119

0

1100

0

Wavelength

Figure 54-1 Two Gaussian absorbance bands and their respective first and second “derivatives” (finite differences). The top spectrum represents a synthetic Gaussian absorbance spectrum, the middle a first “derivative” and the bottom a second “derivative”. Note that the ordinate of the first “derivative” has been expanded by a factor of 10 and the second “derivative” by another factor of 10. The wavelength spacing between data points is 1 nm. The narrow band has a bandwidth (FWHH) of 20 nm, the broad one is 60 nm.

includes the step of taking a limit as differences approach zero. In the real world, with real data we can never calculate a true derivative, since we must compute the differences between finite data points, and these must be taken over finite intervals, so that computed derivatives are approximations to the actual derivative. The absorbance spectrum in Figure 54-1 is made from synthetic data, but mimics the behavior of real data in that both are represented by data points collected at discrete and (usually) uniform intervals. Therefore the calculation of a “derivative” from actual data is really the computation of finite differences, usually between adjacent data points. We will now remove the quotation marks from around the term, and simply call all the finite-difference approximations a derivative. As we shall see, however, often data points that are more widely spread are used. If the data points are sufficiently close together, then the approximation to the true derivative can be quite good. Nevertheless, a true derivative can never be measured when real data is involved. Figure 54-1, however, still shows a number of characteristics that reveal the behavior of derivatives. First of all, we note that the first derivative crosses the X-axis at the wavelength where the absorbance peak has a maximum, and has maximum values (both positive and negative) at the point of maximum slope of the absorbance bands. These characteristics, of course, reflect the definition of the derivative as a measure of the slope of the underlying curve. For Gaussian bands, the maxima of the first derivatives also correspond to the standard deviation of the underlying spectral curve.

Derivatives in Spectroscopy: Part 1

341

The second derivative, in contrast, has its maximum value at the same wavelength as the underlying peak, although in the negative-going direction. The second derivative crosses the X-axis at the point of maximum slope of the underlying (first derivative) curve, and because of that presents a much sharper-appearing band than the underlying absorbance band does. The problem arises, however, that this “sharpening” effect is accompanied by the creation of two artifact peaks, the two positive-going peaks that flank the negative-going portion of the second derivative. In complicated spectra, therefore, it can sometimes be difficult to distinguish true spectral features from the artifacts created by the second derivative calculation. Finally, we note that the magnitude of both the first and the second derivatives of the narrow absorbance band is considerably greater than corresponding magnitudes for the wider absorbance band. This characteristic is a consequence of the fact that the slope of the narrower band really is greater than that of the broader band of the same height, as can be seen in the expanded views of the two absorbance bands in Figure 54-1. For the same �X, the narrow absorbance band has a much larger value of �Y than the broad absorbance band does, therefore �Y/�X (the derivative) is larger for that band. A similar situation is true for the second derivative as well. There is an additional consideration as well, however: the mathematical definition of a Normal curve includes a premultiplying factor of 1/�� × �2��1/2 �, which makes the area under the Normal curve equal to unity. Therefore, the wider the bandwidth, the smaller the maximum value of the curve will be, further reducing the slope as compared to a narrower band. It is interesting and useful to consider this quantitatively. The expression for the Normal distribution is (54-7) Y=

2 1 − 21 � X−� � � e ��2��1/2

(54-1a)

The corresponding expression for the Lorentzian distribution is [8] (see p. 211): Y=

2 × ��

1+

1 2��−X� �

2

(54-1b)

where � is the measure of bandwidth (and equals the standard deviation for the Normal curve); and � is the wavelength corresponding to the peak center. We note parenthetically here that equation 54-1a includes the premultiplying factor for constant area. The expression for a Normal curve of constant maximum height (of unity) will be simply: Y = e− 2 � 1

X−� �

2

�

(54-2)

The first derivative of the Normal distribution, from the expression in Equation 54-1a, then, is 2 2 d 1 X − � dY 1 − 21 � X−� � � = e − (54-3) dX 2 � dX ��2��1/2

342

Chemometrics in Spectroscopy

2 dY 1 1 d 2 − 21 � X−� � � − �X = e − �� dX ��2��1/2 2� 2 dX

(54-4)

2 dY 1 1 − 21 � X−� � � = e − 2 2 �X − �� dX ��2��1/2 2�

(54-5)

2 dY − �X − �� − 21 � X−� � � = 3 e dX � �2��1/2

(54-6a)

Equation 54-6a is derived from the constant-area expression for the Normal curve, from the constant-height expression we obtain 2 dY − �X − �� − 21 � X−� � � e = (54-6b) dX �2 The origin of the features seen qualitatively in Figure 54-1 can be observed in either of equations 54-6a or 54-6b. When X = �, then the derivative is zero, and the sign of the derivative changes from positive when X < � to negative when X > �. The presence of the negative exponential term ensures that the derivative will asymptotically approach zero as X approaches infinity in both directions. Similarly, from equation 54-6a we can derive the expression for the second derivative of the Normal distribution: 2 d 2 d d2 Y 1 X −� 2 − �X − �� − �X − �� − 21 � X−� − 21 � X−� � � � � = 3 e − +e � �2��1/2 dX 2 � dX � 3 �2��1/2 dX 2 (54-7)

2 d2 Y − �X − �� − 21 � X−� 1 −1 1 X−� 2 � � e = 3 − 2 2 �X − �� + e− 2 � � � dX 2 � �2��1/2 2� � 3 �2��1/2 �X − ��2 d2 Y 1 1 X−� 2 = − 3 e− 2 � � � 2 5 1/2 1/2 dX � �2�� � �2�� And from equation 54-6b we similarly obtain � 2 d2 Y �X − ��2 1 − 21 � X− � � = − e �4 �2 dX 2 For the Lorentzian distribution, from equation 54-1b the first derivative is

dY 2 −1 d 2 �� − X� 2 = ×

1+ 2 �

dX �� 2 �� − X� 2 dX 1 +

� dY 2 8� 2 �� − X� = × 2 dX �� � 2 + 4 �� − X�2

(54-8)

(54-9a)

(54-9b)

(54-10)

(54-11)

Derivatives in Spectroscopy: Part 1

343

And then the second derivative of the Lorentzian distribution is ⎧ 2 d 2 2 ⎪ ⎨ �� 8� 2 �� − X� + 4 − X� � 2 2 dY dX = × 4 ⎪ dX 2 �� ⎩ 2 � + 4 �� − X�2 2 ⎫ d 2 ⎬ 8� 2 �� − X� � + 4 �� − X�2 ⎪ dX − 4 ⎪ ⎭ � 2 + 4 �� − X�2

(54-12)

⎧ 2

2

⎪ 2 2 ⎨ 2 �� � −8� + 4 − X� dY 2 = × 4 ⎪ dX 2 �� ⎩ � 2 + 4 �� − X�2 d ⎫ ⎬ � 2 + 4 �� − X�2 ⎪ 8� 2 �� − X� × 2 � 2 + 4 �� − X�2 dX − 4 ⎪ ⎭ � 2 + 4 �� − X�2

(54-13)

⎧ 2 2 ⎪ ⎨ −8� 2 � 2 + 4 �� − X�

dY 2 = × 4

dX 2 �� ⎪ ⎩ � 2 + 4 �� − X�2 2

d ⎫ d 2 2 ⎪ 2 2 ⎪ 4 �� − X� 16� �� − X� � + 4 �� − X� � + ⎬ dX dX (54-14) − 4 ⎪ ⎪ ⎭ � 2 + 4 �� − X�2 2

⎧ 2 ⎪ ⎨ −8� 2 � 2 + 4 �� − X�2 dY 2 = × 4 �� ⎪ dX 2 ⎩ � 2 + 4 �� − X�2 2

⎫ ⎪ 16� 2 �� − X� � 2 + 4 �� − X�2 �−8 �� − X�� ⎬ − 4 ⎪ ⎭ � 2 + 4 �� − X�2 ⎧ ⎫ ⎪ ⎪ 2 ⎨ 3⎬ 16 dY 12� �� − X� − � = × 3 � ⎪ dX 2 ⎩ � 2 + 4 �� − X�2 ⎪ ⎭

(54-15)

2

(54-16)

Going back to equations 62 and 54-11, how do the magnitudes of the derivatives change with � ? Since the maximum first derivative occurs when X − � = �, let us substitute � for X − � in equation 54-6a, for the Normal distribution we get: −1

dY −� −e 2 −1 � 2 = 3 e 2 �� � = 2 1/2 � �2��1/2 dX MAX � �2��

(54-17)

344

Chemometrics in Spectroscopy

and in equation 54-11 for the Lorentzian distribution: 16 dY 2 8� 2 ��� 2 8� 2 = × = 2 = × 2 2 �2 � 25�� dX �� 2 �5� � 2 + 4 ���

(54-18)

For the Normal distribution, the exponential term has become a constant, and we see that the maximum magnitude of the derivative is inversely proportional to � 2 (for the constant area expression) or inversely as � (for the constant height expression). This confirms our observation from figure 54-1. For the Lorentzian distribution, we see that the derivative decreases with the second power of the bandwidth. Similarly, the maximum second derivative occurs when X = �, so inserting this equality into equation 54-9a for the Normal distribution gives us: 1 e0 −1 �� − ��2 d2 Y 1 �−� 2 − e− 2 � � � = 0 − = = 2 1/2 1/2 1/2 dXMAX � 5 �2�� � 3 �2�� � 3 �2�� � 3 �2��1/2 (54-19) And substituting X − � = 0 into equation 54-16 gives us the corresponding value for the Lorentzian distribution: 2 4 2 4 4 2 2 �0� �0� �0� �0� + 128 � −8 � + 8 + 64 + 4 2 dY = × 4 dX 2 MAX �� � 2 + 4 �0�2 =

2 × �−8� 4 � −16 = �� × � 8 �� 5

(54-20)

The negative sign in equations 54-19 and 54-20 reflect the fact that the maximum second derivative is a negative value, which also agrees with Figure 54-1, and it also tells us that the magnitude of the second derivative decreases inversely as the cube of � (for the Normal band shape) and inversely as the fifth power of � (for the Lorentzian band shape), that is as the bandwidth of the absorbance band increases. This explains why the derivatives of the broad absorbance band decrease with respect to the narrow absorbance band as we see in Figure 54-1, and more so as the derivative order increases.

THE BEHAVIOR OF COMPUTED DERIVATIVES Now, equations 54-6 and 54-9 are mathematically exact. But we observed when discussing Figure 54-1 that a representation of a derivative based on finite differences is only an approximation. How good is this approximation, and how quickly does it get bad? That depends somewhat on how the derivative is calculated. We made a point of noting that the derivative in Figure 54-1 was calculated from synthetic data, with abscissa (wavelength) spacing of 1 nm. This value of spacing was chosen so that the two methods of calculation would default to the same result. We note above that the definition of a derivative includes the operation of division by �X (or by dX, in the mathematically exact case). Some computer programs that purport to calculate

Derivatives in Spectroscopy: Part 1

345

derivatives do not include the step of performing that division, while others do. The results will vary considerably in the two cases. We will begin our discussion by consid ering the simpler case, where we do not divide by �X. This provides the numerator term for the derivative definition, and also for the approximation; this allows us to examine the behavior of that term in isolation. In some cases, this is all that is used or needed: it provides a qualitative observation of the overall shape of the spectrum that is of interest, for example. Sometimes it is done this way when the data is used for quantitative or qualitative analysis, and the spectral data from the “unknown” samples, the samples which are to be analyzed on a routine basis are treated the same way as the calibration data. Indeed, since the numerator term differs from the correct derivative approximation only by a scaling factor, it can be difficult to tell just from looking at the derivative curve whether it is a correctly calculated derivative or not, especially if the scale is not present. On the other hand, computing only the numerator term is not recommended when results are to be compared between different instruments or laboratories. It is also not recommended when performing theoretical studies are of interest, or when the results of experiments are to be compared to theoretical expectations, since it does not, in general, reflect the actual value of the true derivative. Given the minor computational burden, however, the proper computation of including the division should always be done. Here we start with the examination of the numerator term alone for its pedagogical value. The question arises: since the definition of the derivative specifies taking a limit as differences approach zero, would not the best results be obtained from using the smallest possible differences? The answer is “yes, but � � � ”. The “but” reflects the fact that while synthetic data is noise-free, real data contains noise. In this chapter we consider only the noise-free synthetic data we create, but it is clear that with real data, containing real and irreducible noise, computing smaller and smaller differences will eventually bring us to the point where the differences equal and then become less than the noise level. Derivative calculations are indeed known to be fraught with noise problems. In the interest of examining the behavior of the derivative, however, we are going to ignore the effect of the noise in this chapter, although we will eventually return to that question. One way to minimize noise effects is to exaggerate the differences, by computing finite differences at larger and larger wavelength intervals, and this is often done in practice. Figure 54-2 illustrates an example of this. In Figure 54-2 we present the results of computing finite difference approximations to a derivative (for the Normal case), using different spacings (i.e., the wavelength difference between the data points we compute the finite difference between; we will sometimes call this �X and freely intermix the two terms). For the derivatives in Figure 54-2, the underlying absorbance curve is the narrower one from Figure 54-1, having a 20 nm bandwidth. We see from Figure 54-2 that, in contrast to the mathematically ideal behavior of a true derivative, the behavior of a finite difference depends on how it is calculated. As Figure 54-2a shows, at small spacings, the shape of the computed difference curve closely mimics the true derivative, and has a magnitude that is proportional to the spacing. Figure 54-2b shows that as the spacing increases, several changes occur. 1) The relationship between the difference spacing and the magnitude of the derivative departs from the degree of proportionality we observe at smaller spacings. As the spacing increases, the maximum value of the computed difference asymptotically approaches the value of unity.

346

Chemometrics in Spectroscopy

2) There is a shift in the wavelength corresponding to the maximum value of the derivative 3) Close examination of Figure 54-2b will reveal a decrease in the slope of the difference curve at the point it crosses the X-axis, even though we are not using the denominator term of the derivative calculation. Figure 54-2c shows that at sufficiently large spacing values, the concept of this being a derivative breaks down entirely. The derivative curve has separated into two features, each of them appearing to be a Normal curve, although one of them is negative. As the spacing continues to increase, the two features move further and further apart. (a) 0.25 0.2

Spacing = 5 nm

First difference

0.15 0.1

Spacing = 1 nm

0.05 0 1 12 23 34 45 56 67 78 89 100 111 122 133 144 155 166 177 188 199 –0.05 –0.1 –0.15 –0.2 –0.25

Wavelength (b) 1 0.8

Spacing = 40 nm

First difference

0.6 0.4 0.2 0 1 11

21

31

41

51

61

71

81

91 101 111 121 131 141 151 161 171 181

–0.2 –0.4 –0.6

Spacing = 5 nm

–0.8 –1

Wavelength

Figure 54-2 First differences calculated using different spacings between the data points used to calculate the finite difference for the numerator term only, as an approximation to the derivative. The underlying curve is the 20 nm bandwidth absorbance band in Figure 54-1, with data points every nm. Figure 54-2a: Difference spacings = 1−5 nm; Figure 54-2b: Spacings = 5−40 nm� Figure 54-2c: Spacings = 40−90 nm. (see Colour Plate 19)

Derivatives in Spectroscopy: Part 1

347

(c) 1 0.8

Spacing = 40 nm

0.6

First difference

0.4 0.2 0 1 11 21 31 41 51 61 71 81 91 101 111 121 131 141 151 161 171 181 191 201 –0.2 –0.4 –0.6

Spacing = 90 nm

–0.8 –1

Wavelength

Figure 54-2 (Continued)

1 0.9 0.8

Absorbance

0.7 0.6 d3

0.5 d4 0.4

d2

0.3 0.2

d1

0.1 0 1

101

201

301

Wavelength

Figure 54-3 Showing a “derivative” computed over a very large spacing explains how the difference approximation to the derivative breaks down. With a large spacing, one point used for the difference is on the baseline, while the other traces over the shape of the curve.

Figure 54-3 shows how this occurs. When the spacing is very wide, that is wider than the breadth of the absorbance band near the baseline, one of the points used to compute the difference is always on the baseline, while the other point “rides” over the peak and traces its shape. As the point of the “derivative” slides along the X-axis, eventually the two points exchange roles, and the other feature is traced out, but with the opposite sign. Now we look at the second derivative similarly. Some of this has been presented previously in the literature [9, 10], although in less detail than we do here. Figures 54 4a to 54-4c present second derivatives calculated using the same spacing as for the

348

Chemometrics in Spectroscopy

differences in Figure 54-2. In Figure 54-4 we see that the second derivative is subject to some of the same effects as the first derivative: • Linear (proportional) change in amplitude at small spacings • Nonlinear change in amplitude at large spacings On the other hand, there is no shift in the wavelength of the central maximum, although Figures 54-4b and 54-4c show that the artifact peaks do change their wave length. Replacing the shift in wavelength, however, is a broadening of the central peak. (a) 0.06 0.04

Second difference

0.02 0 1

101

201

–0.02 –0.04

Spacing = 1 nm

–0.06 –0.08 Spacing = 5 nm

–0.1 –0.12 –0.14

Wavelength (b) 1

Second difference

0.5

0 1

101

201

–0.5 Spacing = 5 nm –1

–1.5

Spacing = 40 nm

–2

Wavelength

Figure 54-4 Second differences calculated using different spacings between the data points used to calculate the finite difference for the numerator term only, as an approximation to the derivative. The underlying curve is the 20 nm bandwidth absorbance band in Figure 54-1, with data points every nm. Figure 54-4a: Difference spacings = 1–5 nm; Figure 54-4b: Spacings = 5–40 nm; Figure 54-4c: Spacings = 40–90 nm. (see Colour Plate 20)

Derivatives in Spectroscopy: Part 1

349

(c) 1 Spacing = 40 nm

Second difference

0.5

0 1

101

201

–0.5 Spacing = 90 nm –1

–1.5

–2

Wavelength

Figure 54-4 (Continued)

We noted above that one characteristic of the second derivative is the narrowing of this peak compared to the underlying absorbance band. As the spacing over which the derivative is computed increases, however, this resolution enhancement effect decreases and eventually disappears. The reason is similar to that for the first derivative, as shown in Figure 54-3; at very large spacings the points used to compute the derivative eventu ally wind up simply tracing over the underlying absorbance band, with the result that, since second derivatives are essentially computed from three points, three copies of the underlying absorbance band are produced, albeit with different signs. In Figure 54-5 we show the variation of the computed derivatives as determined by the spacing used in the computation. Another feature that can be seen in Figure 54-5, 2.5 Second derivative

Derivative value

2

1.5 First derivative

1

0.5

0 0

10

20

30

40

50

60

70

80

90

Spacing

Figure 54-5 Maximum computed derivative magnitude determined by the spacing of the points used in the computation. Note that the sign of the second derivative has been reversed to simplify comparison with the first derivative behavior.

350

Chemometrics in Spectroscopy

which is also observable in Figure 54-4 albeit with some difficulty, is that at small spacing the maximum second derivative value is not simply proportional to the spacing but changes faster than proportionately to the spacing; the overall curve of calculated derivative value versus spacing is sigmoidal. We continue in our next chapter by examining the behavior of the derivative cal culation when the division of the �Y term is divided by the �X term, to form an approximation to the true derivative.

REFERENCES 1. 2. 3. 4. 5. 6. 7. 8.

Singleton, F. and Collier, G.L., Britain 760, 729 (1953). Singleton, F. and Collier, G.F., (London), 1519 (1955). Giese, A.T. and French, C.S., 9, 78 (1955). Low, M.J.D. and Mark, H., 241, 129–130 (1970). Hopkins, D., NIR News 12(3), 3–5 (2001). Hopkins, D., Near Infrared Analysis 2(1–13), (2001). Mark, H. and Workman, J., Spectroscopy 2(9), 37–43 (1987). Ingle, J.D. and Crouch, S.R., Spectrochemical Analysis (Prentice-Hall, Upper Saddle River, NJ, 1988). 9. Ritchie, G.E. and Mark, H. NIR News 13(1), 4–6 (2002). 10. Ritchie, G.E. and Mark, H., NIR News 13(2), 3–5 (2002).

55

Derivatives in Spectroscopy: Part 2 – The “True”

Derivative

We continue where we left off in Chapter 54 [1], and we start with some discussion regarding the observations we made concerning the change in the magnitude of the computed values of the derivatives (first and second) as the wavelength spacing over which they are computed is changed. As we normally do when continuing a subseries, we continue the equation numbering and figure numbering from where we left off in the previous chapter. To recap, we noted that at small spacings, the numerator of the computed approximation to the derivative was a close approximation to the shape of the true derivative, and the magnitude increased as the spacing increased, linearly for the first derivative and faster than linearly for the second derivative. In fairly short order, however, in both cases the rate of increase of the derivative magnitude started falling off as the spacing continued to increase. The falloff in the rate of increase was accompanied by some secondary effects: wavelength shifts of the peak derivative value, and various kinds of distortion of the shape of the derivative. At very large spacings (larger than the bandwidth of the peak) the “derivative” was replaced by what was essentially a tracing of the shape of the underlying peak, a double tracing for the first derivative and triple for the second derivative. At small spacing values, however, it now becomes clear why increasing the spacing is desirable. Since in real data the noise of the measured spectrum in constant (because the underlying spectrum from which various derivative approximations are calculated) is the same spectrum each time, increasing the spacing of the derivative computation increases the “signal” part of the signal-to-noise (S/N) ratio, thereby improving the S/N ratio. As we saw, however, too-large spacing were deleterious, both for distorting the shape of the peak and for producing inaccurate approximations to the derivative numerator. So now the question arises, what is the nature of the way the magnitude increases with spacing? In Figure 55-6a we show an expanded view of the region of the first derivative of the Normal curve, around the region of the maximum of the underlying (Normal) absorbance band. Similarly, Figure 55-6b shows the corresponding view for the second derivative. The first derivative is well-approximated by a straight line in this region. The second derivative is seen to be approximated by a parabola; a not unexpected result when considering that this represents the result obtained from a truncated Taylor series approximation of the curve. Therefore, for a first derivative, as the X spacing increases, the magnitude of the calculated “derivative” increases proportionately. In the case of the second derivative, increasing magnitude of the X spacing causes the magnitude of the calculated derivative to increase as the square of the spacing; this is the source of the initial upward curvature we noted in Figure 54-5 (reference [1]) for the second derivative.

352

Chemometrics in Spectroscopy (a) 0.05 0.04 0.03

First difference

0.02 0.01 0 0

1

51

–0.01 –0.02 –0.03 –0.04 –0.05

Wavelength

(b) 0.0015 Parabola 0.0005

Response

–0.0005 1 5

9

13 17 21 25 29 33 37 41 45 49 53 57 61 65 69 73 77 81

–0.0015

Second derivative

–0.0025 –0.0035 –0.0045 –0.0055

Wavelength

Figure 55-6 Expansions of the first and second derivative curves. Figure 55-6a: The region around the zero-crossing of the first derivative can be approximated with a straight line. Figure 55-6b: The region around the peak of the second derivative can be approximated with a parabola.

BETTER DERIVATIVE APPROXIMATIONS Now we will examine the behavior of the derivative approximation when both the numerator and the denominator terms are used. In Figure 55-7, we present the curves of this computation of the derivative corresponding to the numerator-only computation presented in Figure 54-2 of Chapter 54 [1]. Here we note several differences between

Derivatives in Spectroscopy: Part 2

353

Figure 55-7 and Figure 54-2. In Figure 55-7a we see that there is virtually no difference between any of the five curves, they are all producing essentially the same values, in contrast to Figure 55-2, in which the differences were increasing with spacing. The reason is that for the range of spacings used, all the derivative approximations calculated are reasonably good approximations to the true derivative. Therefore, since they all estimate the same true value, they are all essentially equal to each other. In Figure 55-7b, we notice even more differences from the corresponding part of Figure 55-2. The first thing we notice is one characteristic that is the same: the maximum (a) 0.05 0.04 0.03

First difference

0.02 0.01 0 1 12

23 34

45 56

67 78

89 100 111 122 133 144 155 166 177 188 199

–0.01 –0.02 –0.03 –0.04 –0.05

Wavelength

(b) 0.05 Spacing = 5

0.04 0.03

First difference

0.02 0.01 0 1 11

21

31

41

51

61

71

81

91 101 111 121 131 141 151 161 171 181

–0.01 –0.02 Spacing = 40

–0.03 –0.04 –0.05

Wavelength

Figure 55-7 First derivatives calculated using different spacings for finite difference approxi mation to the true derivative. The underlying curve is the 20 nm bandwidth absorbance band in Figure 54-1, with data points every nm. Figure 55-7a: Difference spacings = 1–5 nm; Figure 55-7b: Spacings = 5–40 nm; Figure 55-7c: Spacings = 40–90 nm. (see Color Plate 21)

354

Chemometrics in Spectroscopy (c) 0.025 0.02

Spacing = 40

0.015

First difference

0.01 0.005 0 1 12 23 34 45 56 67 78 89 100 111 122 133 144 155 166 177 188 199 –0.005 –0.01 –0.015

Spacing = 90

–0.02 –0.025

Wavelength

Figure 55-7 (Continued)

values of the derivative curves shift as the spacing increases. But another difference, that is at least as prominent, is that the maximum value decreases as the spacing increases, this is exactly opposite to the behavior we noticed in the numerator term where the maximum increased with the spacing. A third difference we notice from the corresponding part of Figure 55-2 is that at the point where the first derivative crosses the X-axis, the slope of the derivative also decreases with increasing spacing, while in Figure 55-2b the slope increased with the spacing except for the largest values of spacing included in that plot. Similarly, in Figure 55-7c both the maximum value of the derivative and the slope at the zero-crossing decrease, where as in Figure 55-2c the maximum of the calculated derivative remained constant, although the slope at the zero-crossing decreased. In the three parts of Figure 55-8, we see that the second derivative behaves similarly, except that it starts out smaller than the first derivative does, by almost an order of magnitude. Figure 55-9 confirms this: the second derivative is smaller than the first (remem ber, all this is for the Normal distribution; other distributions may behave differently). Figure 55-9 also shows how the correct computation of the derivative differs from the computation of the numerator only, which we saw in Chapter 54 (initial reference [1]). The “derivative” computed from the numerator term only increased and then leveled off as the spacing increased, whereas Figure 55-9 shows that the correct computation starts out with an (almost) constant value of the derivative, which then decreases, with an asymptotic approach to zero. Can we explain all these effects? Of course we can, and in fact the explanation is almost obvious. When spacings are small, the computed derivative is a good approximation to the true derivative. As long as this is the case, the exact value of X used to compute the derivative is unimportant, because as we saw in Figure 55-5, the first difference Y increases almost linearly with X, therefore all values of X give the same result for the computation, because Y/X is constant regardless of spacing.

Derivatives in Spectroscopy: Part 2

355

As we observed from Figure 55-5, however, as X continues to increase, Y no longer increases proportionately. Strictly speaking, this happens immediately when X becomes finite, and the question of whether the amount is noticeable is a matter of degree, how much difference it makes in a particular application. Nevertheless, whatever point that is, the initial increase in X carries a corresponding increase in Y , and beyond that point it is no longer proportional. At that point, the computed value of the estimate of the true derivative starts to decrease.

(a) 0.003 0.002

Second difference

0.001 0 1

101

201

–0.001 –0.002

Spacing = 5

–0.003 Spacing = 1

–0.004 –0.005

Wavelength (b) 0.003 0.002

Second difference

0.001 0 1

101

201

–0.001

Spacing = 40 –0.002 –0.003

Spacing = 5

–0.004 –0.005

Wavelength

Figure 55-8 Second derivatives calculated using different spacings for the finite difference approximation to the true derivative. The underlying curve is the 20 nm bandwidth absorbance band in Figure 54-1, with data points every nm. Figure 55-8a: Difference spacings = 1–5 nm; Figure 55-8b: Spacings = 5–40 nm; Figure 55-8c: Spacings = 40–90 nm. (see Color Plate 22)

356

Chemometrics in Spectroscopy (c) 0.0008 0.0006

Second difference

0.0004 0.0002 0 1

101

201

–0.0002 –0.0004 Spacing = 90

–0.0006 –0.0008 –0.001

Spacing = 40

–0.0012 –0.0014

Wavelength

Figure 55-8 (Continued)

0.045 0.04

Derivative magnitude

0.035 First derivative

0.03 0.025 0.02 0.015 0.01

Second derivative

0.005 0 0

10

20

30

40

50

60

70

80

90

Spacing

Figure 55-9 Maximum magnitudes of first and second derivative approximations as the spacing is varied.

Furthermore, as we also noted last time, at sufficiently large spacings (X) the numerator term ceased to increase. As we noted before, at this point the various points used for the computation are each individually tracing out the shape of the underlying curve. However, as X in denominator continues to increase, we can expect that the quotient, Y/X will decrease, and this is the behavior we observe. One final point to note: we see from Figure 55-9 that, as we noted before, the true value of the second derivative of a Normal curve (at its maximum) is roughly an order of magnitude smaller than the first derivative (or at least, the largest value of the first

Derivatives in Spectroscopy: Part 2

357

derivative). In the presence of noise, therefore, the S/N ratio will be degraded by this factor, from this cause alone. We also have noted before that adding or subtracting noisy data causes the variance to increase as the number of data points added together [2]. The noise of the first derivative, therefore, will be larger than that of the underlying absorbance band by a factor of the square root of two. We also showed previously that if a random variable (i.e., a measurement contaminated with noise) is multiplied by a constant (c, say), then the variance of the product is increased by a factor of c2 [3]. A second derivative calculation is equivalent to using coefficients 1, −2, 1 to multiply three data points spaced at the desired X-spacing by. The variance of the spectrum, then, is multiplied by 12 + 22 + 12 √ = 6. Therefore the standard deviation of the noise contribution to a first derivative is 2 greater√than the noise of the spectrum, while the noise contribution to the second derivative √ is 6 times the noise of the spectrum. Therefore the noise of the second derivative is 3, or roughly 1.7 times that of the first derivative. So from both aspects, the S/N ratio of the second derivative is worse than that of the first. The increase of the noise is clearly the lesser of the contributions, compared to the full order of magnitude reduction of the “signal” part of the S/N ratio. Second derivatives have become de rigeur as a data treatment of choice for spectral data, and there are reasons for that, which we have discussed But they also carry with them the burden of a severely reduced S/N ratio compared to first derivatives. When selecting a data treatment, therefore, one should know the disadvantages as well as the benefits of each one. While derivative treatments have been in long use for analysis of spectroscopic data, the quantitative study of the derivative transform has not previously been widely disseminated, but is worth having. There may be times when a second derivative transform is not giving adequate results, and in some of those cases, using a first derivative transform may be preferable.

REFERENCES 1. Mark, H. and J. Workman, J., Spectroscopy 18(4), 32–37 (2003). 2. Workman, J. and Mark, H., Spectroscopy 3(3), 40–42 (1988). 3. Mark, H. and Workman, J., Spectroscopy 3(8), 13–15 (1988).

This page intentionally left blank

56

Derivatives in Spectroscopy: Part 3 – Computing the

Derivative

In Chapters 54 and 55 [1, 2], we discussed the theoretical aspects of using derivatives in the analysis of spectroscopic data. Here we consider some of the practical aspects. The first one we will consider is, in the presence of some arbitrary but more-or-less (presumably) constant amount of noise, what is the optimum spacing of data at which to compute a difference to give the highest signal-to-noise ratio (S/N)? In the face of constant noise, this obviously reduces to the question: what is the spacing (for a Normal distribution) that gives the largest value for the numerator term? Note that the criterion for “best” has changed from our previous discussions, where “best” was considered to be the closest approximation to the true derivative. We have noted that the largest value of the true first derivative occurs when X − = . Therefore the largest differences between two points will occur when they are varied from + (or − ) by some amount , the spacing, which we need to determine. Therefore we need to determine the largest difference of

e

++/2

2

− e ++/2

2

(56-21)

The first question we need to ask is whether there is, in fact, a maximum value? That there is can be seen from noting that the Normal absorbance band approaches zero as X approaches infinity in both directions. Therefore if → the difference will approach zero. At small values of the difference will be finite, while as → 0 the difference will again approach zero, therefore there must be a maximum somewhere between 0 and . To get some idea of where that maximum is, in Figure 56-10 we show a plot of the difference as a function of , for the Normal absorbance band of 20 nm bandwidth we have shown in Figure 54-1. For a more precise result we must solve equation 56-21, but since it is transcendental, we must solve it by successive approximations. The result of doing so is max = 3428 nm. Since the bandwidth of the underlying absorbance band is 20 nm, the spacing needed for maximizing the first derivative S/N for any Normal absorbance band is therefore 3428/20 = 1714 times the bandwidth. However, this analysis is based on considering a single peak in isolation; as we will see for the second derivative, at some point it becomes necessary to take into account the presence and nature of whatever other materials exist in the sample. The second derivative is both simpler and more complicated to deal with. As we saw, the second derivative is maximum at the wavelength of the peak of the underlying absorbance curve, and we noted previously that the numerator term at that point increases

360

Chemometrics in Spectroscopy 0.9 0.8 0.7

Difference

0.6 0.5 0.4 0.3 0.2 0.1 0 0

10

20

30

40

50

60

70

80

90

100

Spacing

Figure 56-10 The difference between the ordinates of two points equally spaced around + as a function of the spacing. In this figure the underlying absorbance curve has a bandwidth of 20 nm.

monotonically with the spacing (see Figures 54-4 and 54-5 in [1]. Therefore we expect the S/N of the second derivative to improve continually as the spacing becomes larger and larger. While the “signal” part of the second derivative increases with the spacing used, the noise of the computed second derivative is independent of the spacing. It is, however, larger than the noise of the underlying spectrum. As we have shown [3], from elementary statistical considerations multiplying a random variable X by a constant A causes the variance of the product AX to be multiplied by A2 compared to the variance of X itself. Now, regardless of the spacing of the terms used to compute the second derivative, the operative multipliers for the data at the three wavelengths used are 1, −2, 1. Therefore the multiplier for the variance of the √ derivative is 12 + 22 + 12 = 6, and the standard deviation of the derivative is therefore 6 times the standard deviation of the spectrum, but nevertheless independent of the derivative spacing. The signal-to-noise ratio of the second derivative is therefore determined solely by the magnitude of the computed numerator value, which as we have seen, increases with spacing. In real samples, however, the wider the spacing the more likely it becomes that one of the points used for the derivative computation will be affected by the presence of other constituents in the sample, and the question of the optimum spacing for the derivative computation becomes dependent on the nature of the sample in which it is contained.

METHODS OF COMPUTING THE DERIVATIVE The method we have used until now for estimating the derivative, simply calculating the difference between absorbance values of two data points spaced some distance apart (and dividing by that X, of course), is probably the simplest method available. As we discussed in out previous chapter [4], however, there is a disadvantage associated with

Derivatives in Spectroscopy: Part 3

361

this method. This method causes a decrease in the S/N as compared with the underlying absorbance band, and this decrease has two sources. The lesser source is the increase in the noise level due to the addition of variances that occurs when numbers are added or subtracted. The far larger effect is that due to the fact that the derivatives are much smaller than the absorbance, and the second derivative is much smaller (by an order of magnitude) than the first. The net result is that, the closer the theoretical approximation to the true derivative is, the noisier the actual computed derivative becomes. Several methods have been devised to circumvent this characteristic of the process of taking derivatives. One of the very common methods is to reduce the initial noise of the spectrum by computing averages: averaging the spectral data over some number of wavelengths before estimating the derivative by calculating the difference between the resulting averages. This process is sometimes called “smoothing” since it smoothes out the noise of the spectrum. However, since we are not discussing smoothing, we will not consider this any further here. The next common method of computing derivatives is the use of Savitzky–Golay convolution functions. The application to spectroscopy is based on what is one of the most often-cited papers in the literature [5]. This classic paper presents the concept underlying this method for computing derivatives (including the zero-order derivative, which reduces to what is basically a smoothing operation), Figure 56-11A shows this diagrammatically. The assumption is that the mathematical nature of the underlying spectral curve is unknown, but can be represented over some finite region by a polynomial; “polynomial” in this sense in general and includes straight lines. If the equation for the polynomial is known, then the derivative of the spectrum can be calculated from the properties of the fitted polynomial. The key to all this is the fact that the nature of the polynomial can be calculated from the spectral data, by doing a least-squares fit of the polynomial to the data in the region of interest, as shown in Figure 56-11b. Figure 56-11a shows that various polynomials may be used to approximate the derivative curve at the point of interest, and Figure 56-11b shows that when the derivative curve is based on data that has error, the polynomials can be computed using a least-square fit to the data. At the point for which the derivative is computed, all three lines in Figure 56-11 are tangent to each other. The Savitzky–Golay approach provides for the use of varying numbers of data points to be used in the computation of the fitting polynomial. We will discuss the effect of changing the number of data points shortly. So the steps that Savitzky and Golay took to create their classic paper was as follows: 1) Fit a polynomial curve of the desired type (degree) to the data, using least-square curve fitting. 2) Compute the desired order of derivative of that polynomial 3) Evaluate the expression for the derivative of that polynomial at the point for which the derivative is to be computed. In the Savitzky–Golay paper, this is the central point of the set used to fit the data. As we shall see, in general this need not be the case, although doing so simplifies the formulas and computations. 4) Convert those formulas into a set of coefficients that can be used to multiply the data spectrum by, to produce the value of the derivative according the specified polynomial fit, at the point of the center of the set of data. As we shall see, however, their paper ignores some key points.

362

Chemometrics in Spectroscopy (a) Second derivative

0.0015 0.0005

Response

–0.0005 1 5 –0.0015

9

13 17 21 25 29 33 37 41 45 49 53 57 61 65 69 73 77 81 Linear derivative fit

Quadratic derivative fit

–0.0025 –0.0035 –0.0045 –0.0055

Concentration

(b) 0.0015 0.0005

Response

–0.0005 1

3

5

7

11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 Data

–0.0015 –0.0025 –0.0035

9

Linear fit Quadratic fit

–0.0045 –0.0055

Concentration

Figure 56-11 The Savitzky–Golay method of computing derivatives is based on a least-squares fit of a polynomial to the data of interest. In both parts of this figure the underlying second derivative curve is shown as the black line, while the linear (first degree) and quadratic (second degree) polynomials are shows as mauve and blue lines respectively. Figure 56-11a: here we show linear and quadratic fits to a Normal spectral curve. Figure 56-11b: an expansion of the Figure 56-11a shows how the polynomials are determined using a least-squares fit to the actual data in the region where the derivative is computed, when the data is contaminated with noise. Red dots represent the actual data. (see Color Plate 23)

And finally, while this work was all of very important theoretical interest, Savitzky and Golay took one more step that turned the theory into a form that could be easily put to practical use.

Derivatives in Spectroscopy: Part 3

363

5) For a good number of sets of derivative orders, fitting polynomials and numbers of data points, they calculated and printed in their paper tables of the coefficients needed for the cases considered. Thus the practicing chemist needed to be neither a heavy-duty theoretician nor more than a minimal computer programmer in order to make use of the results produced. Unfortunately there are also several caveats that have to go along with the use of the Savitzky–Golay results. The most important and also the best-known caveat is that there are errors in the tables in their paper. This was pointed out by Steinier [6] in a paper that is invariably cited along with the original Savitzky–Golay paper, and which should be considered a “must read” along with the original paper by anyone taking an interest in the Savitzky–Golay approach to computation of derivatives. The Savitzky–Golay coefficients provide a simplified form of computation for the derivative of the desired order at a single point. To produce a derivative spectrum the coefficients must be applied successively to sets of spectral data, each set offset from the previous one by a single wavelength increment. This is known as the convolution of the two functions. Having done that, the result of all the theoretical development and computation is that the derivative spectrum so produced simultaneously is based on a smoothed version of the spectrum. The amount of smoothing depends on the number of data points used to compute the least-squares fit of the polynomial to the data, use of more data points is equivalent to performing more smoothing. Using higher-degree poly nomials as the fitting function, on the other hand, is equivalent to using less smoothing, since high-order polynomials can twist and turn more to follow the details of the data.

LIMITATIONS OF THE SAVITZKY–GOLAY METHOD The publication of the Savitzky–Golay paper (augmented by the Steinier paper) was a major breakthrough in data analysis of chemical and spectroscopic data. Nevertheless, it does have some limitations, and some more caveats that need to be considered when using this approach. One limitation is that the method as originally described is applicable only to compu tations using odd numbers of data points. This was implied earlier when we discussed the fact that a derivative (of any order) is computed at the central point (wavelength) of the set used. Another limitation is that, also because of the computation being applicable to the central data point, there is an “end effect” to using the Savitzky–Golay approach: it does not provide for the computation of derivatives that are “too close” to the end of the spectrum. The reason is that at the end of the spectrum there is no spectral data to match up to the coefficients on one side or the other of the central point of the set of coefficients, therefore the computation at or near the ends of the spectrum cannot be performed. Of course, an inherent limitation is the fact that only those combinations of parameters (derivative order, polynomial degree and number of data points) that are listed in the Savitzky–Golay/Steinier tables are available for use. While those cover what are likely to be the most common needs, anyone wanting to use a set of parameters beyond those supplied is out of luck.

364

Chemometrics in Spectroscopy

A caveat to the use of the Savitzky–Golay tables is that, even after Steinier’s cor rections, they apply only to a special case of data, and do not, in general, produce the correct value of the true derivative. The reason for this is similar to the problem we pointed out in out first chapter dealing with computation of derivatives [1]: applying the Savitzky–Golay coefficients to a set of spectral data is equivalent to assuming that the data is separated by unit X distance, and therefore is equivalent to computing only the numerator term of a finite difference computation, without taking into account the X (spacing) to which the computed Y corresponds. Therefore, in order to compute the Savitzky–Golay estimate of a true derivative, the value computed using the Savitzky– Golay coefficients must be divided by (Xn , where n is the order of the derivative. Another limitation is perhaps not so much a limitation as, perhaps, a strange characteristic, albeit one that can catch the unwary. To demonstrate, we consider the simplest S–G derivative function, that for the first derivative using a 5-point quadratic fitting function. The convolution coefficients (after including the normalization factor) are −02 −01 0 01 02 Suppose we compute a second derivative by applying this first derivative func tion twice? The effect is easily shown to be equivalent to applying the convolution coefficients: 004 004 001 −004 −01 −004 001 004 004 a collection of nine coefficients that produces a second derivative, based on the S–G first derivative coefficients. However, this collection of convolution coefficients appears nowhere in the S–G tables. The nine-point S–G second derivative with a Quadratic or Cubic polynomial fit has the coefficients: 00606 00152 −00173 −00368 −00433 −00368 −00173 00152 00606 And the nine-point S–G second derivative with a Quartic or Quintic polynomial fit has the coefficients: −08811 25944 10559 −14755 −25874 −14755 10559 25944 −08811 The original S–G paper [5] describes how to compute other S–G convolution coeffi cients from given ones; these other coefficients are also functions that follow the basic concepts of the S–G procedure: the derivative of a least-square, best-fitting polynomial function. Since they do not produce the convolution coefficients we generated by apply ing the S–G first derivative coefficients twice, however, we are forced to the conclusion that even though the coefficients for the first derivative follow the S–G concepts, apply ing them twice (or multiple) times in succession does not produce a set of convolution coefficients that is part of the S–G collection of convolution functions. This seems to be generally true for the S–G convolution coefficients as a whole.

Derivatives in Spectroscopy: Part 3

365

EXTENSIONS TO THE SAVITZKY–GOLAY METHOD Several extensions have been developed to the original concept. First we will consider those that do not change the fundamental structure of the Savitzky–Golay approach, but simply make it easier to use. The main development along this line is the elimination of the tables. On the one hand, tables of coefficients are easy to deal with conceptually, because they can be applied mechanically – just copy down the entries and use them to multiply the data by. In fact, our initial foray into the world of Savitzky–Golay involved writing just such a program. The task was tedious, but having done it once and verified the numbers it should never be necessary to do it again. However, as noted above this approach has the inherent limitation of including only those conditions that are listed in the Savitzky–Golay tables, extensions to the derivative order, polynomial degree, or number of data points used are excluded. Therefore an extension of this idea was presented in a paper by Hannibal Madden [7]. Instead of presenting the already-worked-out numbers, Madden derived formulas from which the coefficients could be computed, and presented a table of those formulas in this paper. This is definitely a step up, since it confers several advantages: 1) Through the use of these formulas, Savitzky–Golay convolution coefficients could be computed for a convolution function using any odd number of data points for the convolution. 2) Since the coefficients are being computed by the computer, there is no chance for typographical errors occurring in the coefficients. Madden’s paper, however, also has limitations: 1) The paper contains formulas for only those derivative orders and degrees of polyno mials that are contained in the original Savitzky–Golay paper, therefore we are still limited to those derivative orders and polynomial degrees. 2) The coefficients produced still contain the implicit assumption that the value of X = 1. Therefore to produce correct derivatives, it will still be necessary to divide the results from the formulas by (Xn , as above. 3) The formulas are at least as complicated, difficult and tedious to enter as the tables they replace, and as fraught with the possibility of typos during their entry. This is exacerbated by the fact that, being a formula in a computer program, everything must be just so, and all the parentheses, and so on must be in the right places, which, for formulas as complicated as those are, is not easy to do. Nevertheless, as with the tables, once it is done correctly it need not be done again (but make sure you back up your work!). However, for the real kick in the pants, see the next item on this list. 4) There is an error in one of the formulas! While writing the program to implement the formulas in Madden’s paper, despite the tedium, most of the formulas (ten of the eleven given) in the program were working correctly in fairly short order – “correctly” in this case meaning that the coefficients agree with those of Savitzky– Golay or of Steinier, as appropriate. There was a problem with one of the formulas, however; the one for the third derivative using a quintic (fifth degree) polynomial

366

Chemometrics in Spectroscopy

fitting function. The coefficients produced were completely unreasonable, as well as being wrong. The coding of the formula was checked a couple of ways. First that formula was rewritten again, starting from scratch and using a different scheme to convert the printed formula to computer code, the same wrong answers were obtained both times. Then a buddy (Dave Hopkins), who was working with me on a project, was asked to check the coding; he reported not being able to find any discrepancies between the printed formula and what was coded. This left two possi bilities: either the printed formula was wrong or the corresponding Steinier table was wrong. We first tried to contact Hannibal Madden since the paper gave his affiliation as Sandia National Laboratory, but he was no longer there and the Human Resources department had no information as to his current whereabouts. Finally the problem was posted to an on-line discussion group (the discussion group for the International Chemometrics Society), asking if anybody had information relating to this problem. Fortunately, Premek Lubal, one of the members of the group, had run into this problem previously, while checking the derivations in Madden’s paper and knew the solution (P. Lubal, 2002, private communication). To save grief on the part of anybody who might want to code these formulas for themselves, here is the solution: in the formula for the case involved, the quintic fitting function for the third derivative, the term (50 ∗ m) has the wrong sign. The sign in the printed formula is negative −, and it should be positive +. After changing the sign of that term, the program produced the correct coefficients. So now the question presents itself: is there a more general method of computing coefficients for any arbitrary set of combinations of derivative order, polynomial degree and number of data points to fit? That is, is there an automated method for computing Madden’s formulas, or at least the Savitzky–Golay convolution coefficients? The answer turns out to be Yes. From the same on-line discussion that produced the solution to the problem in the Madden paper, Chris Brown pointed out some pertinent literature citations [8, 9], and summarized them into the general solution the we discuss below (C. Brown, 2002, private communication). Is the solution as “simple” as the tables in Savitzky–Golay/Steinier or the formulas in Madden? This is a matter of perception. If this general solution was presented to the chemical/spectroscopic community in 1964 (at the time of the original Savitzky–Golay paper), it would have been considered far beyond what a “mere” chemist would be expected to know, and would never have gained the popularity it currently enjoys. With the advent of modern software tools, however, tools such as MATLAB and even the older language, APL, matrix operations can be coded directly from the matrix-math expressions, and then it becomes near-trivial to create and solve the matrix equations on-the-fly, so to speak, and calculate the coefficients for any derivative using any desired polynomial, and computed over any odd number of data points. Wentzell et al. [9] present this scheme in a very clear way, the same way that Chris Brown gave it to me. We start by creating a matrix. This matrix is based on the index of coefficients that are to be ultimately produced. Savitzky and Golay labeled the coefficients in relation to the central data point of the convolution, therefore a three-term set of coefficients are labeled −1, 0, 1. A five-term set is labeled −2, −1, 0, 1, 2 and so forth.

Derivatives in Spectroscopy: Part 3

367

The matrix (M) is set up like this table (this, of course, is only one example, for expository purposes): 1 1 1 M= 1 1 1 1

−3 −2 −1 0 1 2 3

−27 −8 −1 0 1 8 27

9 4 1 0 1 4 9

(56-22)

What are the key characteristics of this matrix, that we need to know? The first one is that each column of the matrix contains the set of index numbers raised to the n − 1 power, where n is the column number in the table. Thus the first chapter contains the zeroth power, which is all 1s, the second column contains the first power, which is the set of index numbers themselves, and the rest of the columns are the second and third powers of the index numbers. What determines the number of rows and columns? The number of rows is determined by the number of coefficients that are to be calculated. In this example, therefore, we will compute a set (sets, actually, as we will see) of seven coefficients. The number of columns is determined by the degree of the polynomial that will be used as the fitting function. The number of columns also determines the maximum order of derivative that can be computed. In our example we will use a third-power fitting function and we can produce up to a third derivative. As we shall see, coefficients for lower-order derivatives are also computed simultaneously. The matrix M is then used as the argument for the following matrix equation: Coefficients = MT M−1 MT

(56-23)

where, by convention the boldface M refers to the matrix we produced, the superscript T refers to the transpose of the matrix and the superscript −1 means the matrix inverse of the argument. Let us evaluate this expression. The matrix M is given above, as equation 56-22. The transpose, then, is 1 −3 M = 9 −27 T

1 −2 4 −8

1 −1 1 −1

1 0 0 0

1 1 1 1

1 2 4 8

1 3 9 27

(56-24)

We then need to multiply these two matrices together to form MT M (rules for matrix multiplication are given in many books, including [10]): 7 0 M M= 28 0 T

0 28 0 196

28 0 196 0

0 196 0 1588

(56-25)

368

Chemometrics in Spectroscopy

Then we compute the matrix inverse of equation 56-25 (in MATLAB, this is just:: inv(m))

T

M M

−1

0333333 0 = −00476190 0

0 −0262566 0 −00324074

−00476190 0 001190476 0

0 −00324074 0 000462962

(56-26)

Finally, multiplying equation 56-26 by equation 56-24 gives MT M−1 MT = −009523 0142857 0285714 0333333 0285714 0142857 −009523 0087301 −02658730 −0230158 0 0230158 02658730 −0087301 00595238 0 −00357142 −00476190 −00357142 0 00595238 −00277777 002777777 002777777 0 −00277777 −00277777 00277777

(56-27) Equation 56-27 contains scaled coefficients for the zeroth through third derivative con volution functions, using a third degree polynomial fitting function. The first row of equation 56-27 contains the coefficients for smoothing, the second row contains the coefficients for the first derivative, and so forth. Equation 56-27 gives the coefficients, but there is a scaling factor missing. Therefore there is one more final computation that needs to be performed to create the correct coefficients; each row must be multiplied by the scaling factor. The scaling factor is (p − 1)! where p is the row number. Therefore the scaling factors for the first two rows are unity, since 0! and 1! are both unity, the scaling factor for the third row is two and for the fourth row is six. The final set of coefficients therefore is MT M−1 MT corrected for scaling = −009523 0142857 0285714 0333333 0285714 0142857 −009523 0087301 −0265873 −0230158 0 0230158 02658301 −0087301 0119047 0 −00714285 −0095238 −00714285 0 01190476 −0166666 0166666 0166666 0 −0166666 −0166666 0166666

(56-28) Finally, for those who are facile with the matrix math, Bialkowski [8] also shows how the “end effect” can be obviated, as well as allowing the use of even numbers of data points, but the advanced considerations involved are beyond the scope of our chapter. When this material was first published as an article, our respondents pointed out that the magnitudes of the various derivatives, and especially the relative magnitudes of derivatives of different orders, depend on the units used, particularly the units used to describe the X-axis. Now, while in fact we did not specify any units in our discussion (see, e.g., Figure 54-1 in [1], where the X-axis contains only the label “Wavelength”), given our backgrounds it is true enough that we implicitly had nanometers in mind for our X-units. In the case of real spectra, however, if spectra were measured using, say, microns as the units for the X-axis, the same spectrum would have a calculated value for the first derivative that was 1000 times what would be calculated for an “nm-based” derivative. In that case, the first derivative (for a 10 nm wide band, which would be a

Derivatives in Spectroscopy: Part 3

369

0.01 micron wide band) would be 100 times greater than the maximum spectral value, rather than being 1/10 of it, as the value computed using nanometers for the X-scale came out to. The second derivative would then be 106 times what we calculated and therefore 10,000 times greater than the maximum spectral value, instead of being 1/100 of it, the value we showed. In principle this is all correct. In practice, however, if we ignore FTIR and specialty technologies such as AOTF, then the vast majority of instruments in use today for modern NIR spectroscopy (still primarily diffraction grating based instruments) use nanometers as their wavelength unit, and usually collect data at some small integer number of nanometers. Furthermore, the vast majority of those have a 10-nm bandpass, so that 10 nm is the minimum bandwidth that would be measured. Also, even for instruments with higher resolution, the natural bandwidths of many, or even most, absorbance bands of materials that are commonly measured are greater than 10 nm in the NIR. Given all this, the use of a 10 nm figure to represent a “typical” NIR absorbance band is not unrealistic, and gives the reader a realistic assessment of what a “typical” user can expect from the NIR spectra he measures, and their derivatives. The choice of units, of course, does not affect the instrumental characteristic of signal-to-noise. which is what is important, and which we discuss in Chapter 57 [11]. If we consider FTIR instrumentation then the situation is trickier, since the equivalent resolution in nm varies across the spectrum. But even keeping the spectrum in its “natural” wavenumber units, we again find that, except for rotational fine structure of gases, the natural bandwidth of many (most) absorbance bands is greater than 10 wavenumbers. So again, using that figure shows the “typical” user how he can expect his own measured spectra to behave. We thank Todd Sauke, Peter Watson, and (again) Colin Christy for pointing out the errors and for general comments and discussion.

REFERENCES 1. 2. 3. 4. 5. 6. 7. 8. 9.

Mark, H. and J. Workman, J., Spectroscopy 18(4), 32–37 (2003). Mark, H. and Workman, J., Spectroscopy 18(9), 25–28 (2003). Mark, H. and Workman, J., Spectroscopy 3(8), 13–15 (1988). Mark, H. and Workman, J., Spectroscopy 18(12), 106–111 (2003). Savitzky, A. and Golay, M.J.E., Analytical Chemistry 36(8), 1627–1639 (1964). Steinier, J., Termonia, Y. and Deltour, J., Analytical Chemistry 44(11), 1906–1909 (1972). Madden, H.H., Analytical Chemistry 50(9), 1383–1386 (1978). Bialkowski, S.E., Analytical Chemistry 61(11), 1308–1310 (1989). Wentzell, P.D. and Brown, C.D., “Signal Processing in Analytical Chemistry”; in Encyclo pedia of Analytical Chemistry, Meyers, R. A. (Ed.) (John Wiley & Sons, Chichester, 2000), pp. 9764–9800. 10. Mark, H. and Workman, J., Statistics in Spectroscopy, 1st ed. (Academic Press, New York, 1991). 11. Mark, H. and Workman, J., Spectroscopy 19(1), 44–51 (2004).

This page intentionally left blank

57

Derivatives in Spectroscopy: Part 4 – Calibrating with

Derivatives

In Chapters 54–56 [1–3] contained discussion of the theoretical aspects of using deriva tives in the analysis of spectroscopic data, followed by a discussion of the development of the Savitzky–Golay method of using convolution functions to compute derivatives, concluding with the presentation of a general method to create the set of convolution coefficients for any desired order of derivative, using any degree of polynomial fitting function and number of data points. When performing quantitative calibrations using a derivative transform, several pos sible problems can arise. We have already noted that one of these is the possibility that the data used to compute the derivative will be affected by interfering materials. There is little we can do in a chapter such as this to deal with such arbitrary and sampledependent issues. Therefore we will concentrate on those issues which are amenable to mathematical analysis; this consists mostly of the behavior of the computed derivative when there is noise on the data. Most of our discussion so far has centered on the use of the two-point-difference method of computing an approximation to the true derivative, but since we have already brought up the Savitzky–Golay method, it is appropriate here to consider both ways of computing derivatives, when considering how they behave when used for quantitative calibration purposes. In fact, the two-point method can be considered a special case of the more general S–G concept, since it can be considered the application of the set of convolution coefficients: −1, 0, 1 to the data. Of course, these convolution coefficients were created ad hoc, and not according to the general scheme that produces the S–G set. Nevertheless, it is convenient to group them together for the purpose of further examination. We are also indebted to David Hopkins for invaluable discussions concerning the properties of the S–G convolution coefficients (D. Hopkins, 2002, personal Communication). In our previous chapter we derived the expressions for the first and second derivatives of both the Normal and Lorentzian band shapes [1]. For the following discussion, however, we will address only the Normal case, as we will see, the Lorentzian case will parallel it closely. In that previous chapter, we used the standard generic formula for the Normal distri bution, ignoring the aspect of using it to describe the situation for quantitative analysis. The quantity of concern now is the S/N of the data that we will use to perform the calibration calculations. In order to deal with this systematically, the S/N must now be divided into two parts: the magnitude of the signal, and the magnitude of the noise. Then different situations can be compared by independently computing the signal and noise contributions to the final S/N that is operative on the calibration. We start with the simpler case, the signal. By investigating the behavior of the theoretical, ideal derivative, we avoid issues having to do with the different ways of an

372

Chemometrics in Spectroscopy

approximation to the derivative can be obtained. The various approximations that can be obtained through the use of constructs such as the Savitzky–Golay convolutions allow us to make tradeoffs between maximizing the signal, faithfully reconstructing the true derivative, and creating artifacts, but these issues are all obviated by considering the behavior of the theoretically ideal case. When we come to consider the noise, then as we shall see, the nature of the approximating method becomes very important, but for now we will ignore that. If the concentration of a material can vary, however, then according to Beer’s law, the absorbance at any given wavelength will also be proportional to C, the concentration. Therefore to take the concentration into account we must modify (including changing the generic Y variable to A, to indicate absorbance) equations 54-1a, 54-6a and 54-9a (found in Chapter 54) to A=C

1 1 X− 2 e− 2 1/2 2

(57-29)

Whereupon the first derivative becomes 2 dA − X − − 21 X− e =C 3 1/2 dX 2

(57-30)

And the second derivative is � � d2 A X − 2 1 1 X− 2 =C − 3 e− 2 2 5 1/2 1/2 2 2 dX

(57-31)

The “signal” part of the S/N ratio that concerns us is the way these expressions vary with the concentration of the analyte. Therefore, from equation 57-29 we obtain, for the absorbance signal: � � 2 dA d 1 1 1 X− 2 − 21 X− = C e = e− 2 (57-32) 1/2 1/2 2 dC dC 2 For the first derivative we obtain � � � � 2 2 d dA d − X − − 21 X− − X − − 21 X− e e = C 3 = 3 21/2 dC dX dC 21/2 And for the second derivative we obtain � � � � � � 2 d d2 A d X − 2 1 − 21 X− = C − e dC dX 2 dC 5 21/2 3 21/2 � � 2 X − 2 1 − 21 X− = − e 5 21/2 3 21/2

(57-33)

(57-34)

As we see from these equations, we have recovered the original expressions for the absorbance and the derivatives with respect to wavelength. The expression we used

Derivatives in Spectroscopy: Part 4

373

for the Normal curve was the constant-area expression, but the continuation of the derivation for the change of the signal with respect to concentration will follow for the constant-height case, and for the Lorentzian curve, also. As we saw in the previous chapter [1], when compared to the rate of change of the absorbance, the maximum value of the first derivative decreases as 2 (i.e., 3 for the derivative divided by for the absorbance) and the second derivative similarly decreases as 4 and therefore their derivatives with respect to concentration (which is the sensitivity to concentration changes) also decreases that way. Therefore we now turn to the “noise” part of the S/N ratio. As we saw just above, the two-point derivative approximation can be put into the framework of the S–G convolution functions, and we will therefore not treat them as separate methods. We have derived previously [4, 5] that the following expression relates the noise on data to the noise of a constant multiple of that data: VaraX = a2 VarX

(57-35)

and, of course, we know that variances add. Therefore, if we have several variables, each of them contaminated with some noise (whose variance is Var(X)), and they are multiplied by some constants, then the variance of the result is VarXnet = a21 VarX + a22 VarX + a23 VarX +

(57-36)

Therefore, if X represents the spectrum, the various ai represent convolution coefficients and Var(X) represents a noise source that gives a constant noise level to the spectral values, then equation 57-36 gives the noise variance expected to be found on the com puted resultant value, whether that is a smoothed spectral value, or any order derivative computed from a Savitzky–Golay convolution. For a more realistic computation, an interested (and energetic) reader may wish to compute and use the actual noise that will occur on a spectrum, from the information determined in the previous chapters: [6–7] instead of using a constant-noise model. But for our current purposes we will retain the constant-noise model; then equation 57-36 can be simplified slightly: � (57-37) SDXnet = SDX a21 + a22 + a23 + The expression under the radical gives the multiplying factor for the noise standard deviation for the computed derivative (or smoothed spectrum, but that is not our topic here, we will address only the question of the effect on derivatives), and can be computed solely from the convolution coefficients themselves, independently of the effect of the convolution on the “signal” part of the S/N ratio. The nature of the convolution function matters, however, and so do the details of the way it is computed. To see this, let us begin by considering the two-point derivative we have been dealing with in most of this sub-series of chapters. For our first examination of the effect, let us consider that we are computing the derivative from adjacent data points spaced 1 nm apart (such as in our initial discussion of derivatives [1]). As we mentioned, the two-point first derivative is equivalent to using the convolution function {−1, 1}. We also treated this in our previous chapter, but it is worth repeating here. Therefore the multiplying factor of the spectral noise variance is −12 + 12 = 2,

374

Chemometrics in Spectroscopy

√ and the multiplying factor for the noise standard deviation is 2. Similarly, the second derivative based on adjacent data points is equivalent to a convolution function of {1, −2, 1}, making the multiplying factor for the standard deviation of the derivative √ calculated this way equal 6. Since we have noted above that the magnitude of the “signal” parts of the S/N ratio dC/ddX/d decreases with increasing derivative order, at this point it would appear that since the signal decreases and the noise increases when you take a derivative, you wind up losing from both parts of the S/N ratio. But things are not so simple. In this examination we have so far looked only at a derivative calculated from adjacent data points. What happens when we calculate a twopoint derivative based on non-adjacent data points? In fact we have already considered this question qualitatively in our previous chapter [3], when we noted that using the optimum spacing will result in an improved S/N ratio for the derivative. Of course, “improved” in this case is in comparison to the derivative computed using adjacent data points, it must be determined on a case-by-case basis whether the improvement is sufficient to exceed that of the actual direct absorbance signal. The improvement can also be expressed semi-quantitatively in a graph, as we do in Figure 57-12. Here we show true spectrum as the straight line representing the true derivative, and the measured absorbance data as the large Xs. Since the measured data are contaminated with random noise, they do not fall on the line representing the true spectrum. The diagram is set up in such a way, however, that the “noise” on the data from the two wavelengths representing spacing = 1 and spacing = 2 is the same. It is clear from this diagram that the computed approximation to the true derivative is better for the case of spacing = 2, even though the noise is the same. There are several ways to express this in words. One way is to note that the error is “spread” over a larger X distance, and therefore has less effect at any one point. Another way is to note that for a derivative computation, the effective “signal” is the value of

True derivative

Deriv error 2 Deriv error 1

ΔX = 1

ΔX = 2

Figure 57-12 This diagram shows how, as the spacing at which the derivative is computed increases, the error in the approximation to the true derivative decreases, even for the same error in the data.

Derivatives in Spectroscopy: Part 4

375

Y , and when X = 2, Y is double the value of Y when X = 1. Since the noise is the same, the S/N therefore improves with an increase in the spacing. We learned in our prior chapter [3], however, that the improvement is linear with spacing only at very small values of X, at large values it decreases, levels off, and then eventually starts to get worse again. From a mathematical point of view, we can let X be the increment between adjacent measurement wavelengths. Then, X = n × X, where n is the number of wavelength increments over which the derivative is calculated. Then, computed derivative =

Y Y = X n X

(57-38)

And applying equation 57-35 to find the variance of the computed derivative we obtain � � 2 Y Varderivative = 2 Var (57-39) n

X where the multiplier of two comes from the fact that a derivative is calculated from two data points, as we just showed from the above discussion, and since X is a constant (with an assumed value of unity), and therefore its variance is zero, equation 57-39 becomes: Varderivative =

2 Var Y n2

(57-40)

Converting to standard deviations: √ SDderivative =

2 SD Y n

(57-41)

A similar expression can be developed for the second derivative, but we leave that as an exercise for the reader. We turn now to the effect of using the Savitzky–Golay convolution functions. Table 57-1 presents a small subset of the convolutions from the tables. Since the tables were fairly extensive, the entries were scaled so that all of the coefficients could be presented as integers; we have previously seen this. The nature of the values involved caused the entries to be difficult to compare directly, therefore we recomputed them to eliminate the normalization factors and using the actual direct coefficients, making the coefficients more easily comparable; we present these in Table 57-2. For Table 57-2 we also computed the sums of the squares of the coefficients and present them in the last row. One trend is obvious: the more data included in the computation, the smaller the variance multiplying factor. This is expected for the case of smoothing; we know that the more data included in even an ordinary running smooth (i.e., a running arithmetic average), the smaller the variance of the smoothed (averaged) result (reducing as the square root of the number of data point included in the average). Therefore it is not surprising to find it also happening with a weighted average, such as we find with a Savitzky–Golay smooth. We see a similar effect from the first derivative; this can also be considered to be extended from the case of the two-point derivative, where we showed above that the

376

Chemometrics in Spectroscopy

Table 57-1 Some of the Savitzky–Golay convolution coefficients using a quadratic fitting function Index −4 −3 −2 −1 0 1 2 3 4 Normal. factor

5-point smooth

7-point smooth

9-point smooth

−3 12 17 12 −3

−2 3 6 7 6 3 −2

−21 14 39 56 59 54 39 14 −21

35

21

231

5-point 1st deriv

7-point first deriv

9-point first deriv

−2 −1 0 1 2

−3 −2 −1 0 1 2 3

−4 −3 −2 −1 0 1 2 3 4

10

28

60

Table 57-2 The Savitzky–Golay convolution coefficients multiplied out. All coefficients are for a quadratic fitting function. See text for meaning of SSK Index −4 −3 −2 −1 0 1 2 3 4 SSK

5-point smooth

7-point smooth

9-point smooth

−00857 034285 048571 034285 −00857

−00952 014825 028571 033333 028571 014825 −00952

−00909 006060 016883 023376 025541 023376 016883 006060 −00909

048571

0333333

025541

5-point 1st deriv

7-point first deriv

9-point first deriv

−02 −01 0 01 02

−010714 −007142 −003571 0 003571 007142 010714

−006666 −005 −003333 −001667 0 001667 003333 005 006666

014

003571

0016667

farther apart the points used are the smaller the variance of the resulting derivative value. In the case of the Savitzky–Golay convolution functions, however, the mechanism leading the reduced variance is slightly different than that of the two-point derivative. In the S-G case, the reduced variance is caused by both the use of a wider wavelength range for the derivative computation and the implicit smoothing effect of computing the function over multiple data points, just as it is in the case of explicit smoothing. There are several directions that the convolutions can be varied; one is the increase the amount of data used, by using longer convolution functions as we demonstrated above. Another is to increase the degree of the fitting polynomial, and the third is to compute higher-order derivatives. In Table 57-3, we present a very small selection of the effect of potential variations.

Derivatives in Spectroscopy: Part 4

377

Table 57-3 More Savitzky–Golay convolution coefficients. See text for meaning of SSK Index

7-point smoothing with quartic fitting function

5-point first derivative with cubic fitting function

5-point second derivative, quadratic fitting function

−3 −2 −1 0 1 2 3

0.02164 −0.12987 0.32467 0.56709 0.32467 −0.12987 0.02164

0.083333 −0.66667 0 0.666667 −0.083333

0.2857 −0.14285 −0.2857 −0.14285 0.2857

SSK

0.5670

0.9027

0.2857

What can we learn from Table 57-3? We can compare those sums of squared coeffi cients with the corresponding one in Table 57-2 using the same number of data points, and either: 1) The same order derivative with a lower-degree fitting polynomial, or 2) The same degree polynomial, for a lower-order derivative. For comparison 1, we find two cases: 7-point smooth with quadratic versus quartic fitting function, and 5-point first derivative with quadratic versus cubic fitting function. From these two comparisons we find that the noise multiplier of the derivative (of the same order and number of data points) increases as the degree of the fitting function increases. For comparison 2, we find one case: five-point first derivative versus five-point second derivative, both using a quadratic fitting function. Here again, the noise multiplier increased with increasing derivative order. In fact, we see that the five-point first derivative using a cubic fitting function will have almost as high a noise level as the original data. Couple this with the fact we saw above, that the sensitivity to concentration of the first derivative is reduced compared to the sensitivity of the absorbance data itself, and we see that in this particular case, depending on the value of for the absorbance band, use of this form of computing the derivative may be worse than using the absorbance data, while using a different computation, such as a quadratic fitting function may be better than the absorbance data. Therefore, the effect of using derivatives will depend very much, on a case-by-case basis, whether a particular computation will be beneficial or detrimental. For this reason, the reader will find another very interesting exercise to compute the sums of the squares of the coefficients for several of the sets of coefficients, to extend these results to both higher order derivatives and higher degree polynomials, to ascertain their effect on the variance of the computed derivative for extended versions of these tables. Hopkins [8] has performed some of these computations, and has also coined the term “RSSK/Norm” for the ((coeff/Normalization factor)2 in the S–G tables. Since here we pre-divide the coefficients by the normalization factors, and we are not taking the square roots, we use the simpler term SSK (sum squared coefficients) for our equivalent quantity. Hopkins in the same paper has also demonstrated how the two-point

378

Chemometrics in Spectroscopy

computation of derivatives can also have an equivalent value of the RSSK/Norm, with results essentially equivalent to the ones we present above. Table 57-3 in [8], particularly, shows how differences in the application of the derivative computation can cause the noise level of the computed derivative to be either greater or less than the noise of the absorbance spectrum from which they are computed.

ACKNOWLEDGEMENT The authors thank David Hopkins for valuable discussions regarding several aspects of the behavior of Savitzky–Golay derivatives, and also for making sure we spelled “Savitzky” and “Steinier” correctly!

REFERENCES 1. 2. 3. 4. 5.

Mark, H. and Workman, J., Spectroscopy 18(4), p.32–37 (2003). Mark, H. and Workman, J., Spectroscopy 18(9), 25–28 (2003). Mark, H. and Workman, J., Spectroscopy 18(12), 106–111 (2003). Mark, H. and Workman, J., Spectroscopy 3(8), 13–15 (1988). Mark, H. and Workman, J., Statistics in Spectroscopy, 1st ed. (Academic Press, New York, 1991). 6. Mark, H. and Workman, J., Spectroscopy 15(10), 24–25 (2000). 7. Mark, H. and Workman, J., Spectroscopy 18(1), 38–43 (2003). 8. Hopkins, D., Near Infrared Analysis 2(1–13) (2001).

58

Comparison of Goodness of Fit Statistics for Linear

Regression: Part 1 – Introduction

The scope of this chapter-formatted mini-series is to provide statistical tools for compar ing two columns of data, X and Y . With respect to analytical applications such data may be represented for simple linear regression as the concentration of a sample (X) versus an instrument response when measuring the sample (Y ). X and Y may also denote a comparison of the reference analytical results (X) versus predicted results (Y ) from a calibrated instrument. At other times one may use X and Y to represent the instrument response (X) to a reference value (Y ). Whatever data pairs one is comparing as X and Y , there are several statistical tools that are useful to assess the meaning of a change in Y as a function of a change in X. These include, but are not limited to: correlation (r), the coefficient of determination R2 , the slope k1 , intercept K0 , the z-statistic, and of course the respective confidence limits for these statistical parameters. The use of graphical representation is also a powerful tool for discerning the relationships between X and Y paired data sets. The specific software used for this pedagogical exercise is MathCad 2001i (© MathSoft Engineering & Education, Inc., 101 Main Street, Cambridge, MA 02142-1521), which we find particularly useful for describing the precise mathematics employed behind each set of examples. The mathematical tools used here may be employed when ever the assumptions of linear correlation are suspected or assumed for a set of X and Y data. The data set used for this example is from Miller and Miller ([1], p. 106) as shown in Table 58-1. This dataset is used so that the reader may compare the statistics calculated and displayed using the formulas and figures described in this reference with respect to those shown in this series of chapters. The correlation coefficient and other goodness of fit parameters can be properly evaluated using standard statistical tests. The Worksheets provided in this chapter series can be customized for specific applications providing the optimum information for particular method comparisons and validation studies. When performing X and Y linear regression computations there are several general assumptions. One is assuming that if the correlation between X and Y is significantly large then some cause-and-effect relationship could possibly exist between changes in X, and changes in Y . However, it is important to remember that probability alone tells us only if X and Y “appear” to be related. If no cause-effect relationship exists between X and Y , the regression model will have no true predictive importance. Thus knowledge of cause-and-effect creates a basis for decision making when using regression models. Limitations of inferences derived from probability and statistics arise from limited knowledge of the characteristics and stability of: the nature and origins of the set of samples used for X and Y comparison; the characteristics of the measuring instrument(s) used for collecting both X and Y data; the set of operators performing the measurements; and the precise set of measurement or experimental conditions.

380

Chemometrics in Spectroscopy

Table 58-1 Data used for this study of regression and correlation

Y:

X:

Y :=

X := 0

0

0

2.1

0

0

1

5

1

2

2

9

2

4

3

12.6

3

6

4

17.3

4

8

5

21

5

10

24.7

6

12

6

Source: Miller & Miller Date (p. 106).

One must note that probability alone can only detect “alikeness” in special cases, thus cause-effect cannot be directly determined – only estimated. If linear regression is to be used for comparison of X and Y , one must assess whether the five assumptions for use of regression apply. As a refresher, recall that the assumptions required for the application of linear regression for comparisons of X and Y include the following: (1) the errors (variations) are independent of the magnitudes of X or Y , (2) the error distributions for both X and Y are known to be normally distributed (Gaussian), (3) the mean and variance of Y depend solely upon the absolute value of X, (4) the mean of each Y distribution is a straight-line function of X, and (5) the variance of X is zero, while the variance of Y is exactly the same for all values of X. The requirement for a priori knowledge useful for providing a scientific basis for comparison of X and Y data poses several questions for the statistician or analyst when using regression as a comparative tool: 1) Is X a true predictor of Y , does cause-effect exist? 2) If X is a true predictor of Y , what is the optimum mathematical relationship to describe a measurement device response with respect to the reference data? such information defines the optimum mathematical tools to use for comparison) 3) What are the effects of operator and measurement or experimental conditions on the change in X relative to Y ? 4) What are the effects on X and Y of making measurements on multiple instruments with multiple operators? 5) What is the theoretical response for the X with respect to the Y ? 6) What is the Limit of Detection (LOD) relative to changes in X and Y ? Is this limit acceptable for the intended application? In routine comparisons of X and Y data for spectroscopic analysis, when X and Y denote a comparison of the reference analytical results (X) versus instrument response (Y ), at least three main categories of modeling problems are found:

Comparison of Goodness of Fit Statistics: Part 1

381

1) The technique is not optimal: the instrument response (Y ) is a predictor of analyte values (X). The limitation for modeling is in the representation of calibration set chemistry, sample presentation, and unknown variations of instrument and operator during measurement. 2) There is no clear, specific analyte signal: the instrument response (Y ) does not change adequately with a variation in the analyte value (X). This phenomenon indicates that small changes in analyte concentration are not detected by the measurement instrument. Different or additional instrument response information is required to describe the analyte (the problem is underdetermined). 3) The instrument response (Y ) changes dramatically with little or no change in analyte value (X). In this example additional clarification is required to define the relation ship between the analyte value and the spectroscopic/chemical data for the sample, as interfering factors other than analyte concentration are affecting the instrument response. Factors affecting the integrity of spectroscopic data include the variations in sample chemistry, the variations in the physical condition of samples, and the variation in mea surement conditions. Calibration data sets must represent several sample “spaces” to include compositional space, instrument space, and measurement or experimental con dition space (e.g., sample handling and presentation spaces). Interpretive spectroscopy where spectra-structure correlations are understood is a key intellectual process in approaching spectroscopic measurements if one is to achieve an understanding in the X and Y relationships of these measurements. The main concept addressed in this new multi-part series is the idea of correlation. Correlation may be referred to as the apparent degree of relationship between variables. The term apparent is used because there is no true inference of cause-and-effect when two variables are highly correlated. One may assume that cause-and-effect exists, but this assumption cannot be validated using correlation alone as the test criteria. Correlation has often been referred to as a statistical parameter seeking to define how well a linear or other fitting function describes the relationship between variables; however, two variables may be highly correlated under a specific set of test conditions, and not correlated under a different set of experimental conditions. In this case the correlation is conditional and so also is the cause-and-effect phenomenon. If two variables are always perfectly correlated under a variety of conditions, one may have a basis for cause-and-effect, and such a basic relationship permits a well-defined mathematical description. For example, the volume of a cube is perfectly correlated to the length of each side as V = s3 . Likewise the volume of a sphere is perfectly correlated to its radius as V = 4/3r 3 . However, the mass of such objects will be highly correlated to s or r only when the density (d) of the materials used to form the shapes are identical, since d = mass/volume. There is no correlation of mass to s or r when vastly different densities of material are used for comparison. Thus a first-order approximation for s and r vs. mass for widely different materials would lead one to believe that there is not a relationship between volume and mass. Conversely, when working with the same material one would find that volume and mass are perfectly correlated and that there is a direct relationship between volume and mass irrespective of shape. This simple example points to the requirements for a deeper understanding of the underlying phenomena in order to draw conclusions regarding cause and effect based on correlation.

382

Chemometrics in Spectroscopy

In spectroscopic problems one may observe a high correlation with several data sets, whereas there is poor correlation with other data sets. The underlying cause can often be rich in information content and will lead to a deeper understanding of the problem and underlying phenomena involved. Simply using correlation will not produce this learning if one looks no deeper. However, there are statistical tests which may be applied when using correlation that will help one assess the significance and meaning of correlation for specific test cases. It should be pointed out that when only two variables are compared for correlation, this is referred to as simple correlation. However, when more than two variables are compared for correlation this is termed multiple correlation. In spectroscopy correlation is used in two main ways: (1) for calibration of the instrument response (Y ) at one or more channels as absorbance or reflectance of the sample at some wavelength or series of wavelengths to the known analyte property (X) for that sample; and (2) following calibration the predicted analyte concentration (Y ) is compared (using correlation) to the known analyte concentration (X). Although correlation contains information regarding the relationship between two or more variables, a powerful visual tool indicating the relationship between variables is given in the use of scatter diagrams. Scatter diagrams indicate correlation, bias, nonlinearity, outliers, and subclasses. With practice one may train the eye to identify these potential effects quite easily. For example, observe the four figures (58-1a through 58-1d) below. The scatter plot illustration in Figure 58-1 demonstrates the power of visual aid to qualitatively assess the potential relationship between two or more variables. Figure 58 1a illustrates a positive, high correlation between X and Y . Figure 58-1b indicates no real correlation between the variables. Figure 58-1c demonstrates a high, negative correlation between the variables. Figure 58-1d shows several phenomena in the relationship between X and Y . An initial observation indicates that there are three potential outlier samples, one above the line in the upper left hand corner, and two beneath the line in the lower

(a)

(b)

(c)

(d)

Figure 58-1 An illustration of the use of scatter plots for gleaning visual information with respect to the correlation between variables X (abscissa) and Y (ordinate).

Comparison of Goodness of Fit Statistics: Part 1

383

right hand corner. These three data points possibly represent two types of samples that are unlike the majority of the samples near the line. If the reference data are accurate these three samples may be outliers and represent some unexplained phenomena. The majority of the samples are plotted near the regression line and potentially represent a nonlinear relationship between X and Y . Thus a scatter plot of X versus Y with a linear regression line overlay is useful as a powerful data analysis tool. The quantitative description of the relationship between two or more variables is often addressed using a least squares regression line referred to as linear regression. Linear regression, as and example of Y on X linear regression, between two data sets involves the relationship Y = K1 X + K0

(58-1)

where Y is the dependent variable as the estimated or predicted value, X is the indepen dent variable or often the measured value, K1 is the slope or linear regression coefficient, and K0 is the intercept for the regression line. The statistical tools used here are pro vided as a MathCad 2001i Professional Worksheet, which can be further customized for specific applications. The Worksheet includes graphical comparisons of the correlation coefficient (r), the coefficient of determination R2 , standard deviation of the calibration samples (Sr), and the standard error of estimate (SEE). Also included is a method for computing the confidence limits for the correlation coefficient; a method for comparing correlation coefficients for different size populations; and a method for computing the confidence limits for the slope and intercept of a data set. All these statistical parameters are computed for user-selected confidence levels. The program provides the required tools for goodness of fit confidence testing when developing validated methods for X and Y comparisons. The use of linear regression as a statistical tool is a standard technique for comparison of two sets of data X and Y where a linear relationship between a change in X X and a change in Y Y is suspected. Calibration problems associated with instrumental methods often use this technique over a linear dynamic range. This set of chapters and the accompanying MathCad program (shown later) provides the required tools for goodness of fit confidence testing when working with regression for multiple purposes, including developing validation of analytical methods. The use of statistics to calculate the coefficient of determination (R-squared, R2 ), the correlation coefficient (r), slope, and intercept is routine and uncomplicated, yet for some reason equally elementary statistics such as significance testing for these statistical parameters is not often demonstrated in analytical papers or reports. Varying parameters such as the level of confidence, the number of samples (n) in the calibration set, the standard error of estimate (SEE), and the standard deviation of the range of data (Sr) will have dramatic effects on the meaning or interpretation of “goodness of fit” statistics such as the coefficient of determination and correlation. This series of articles provides several sets of tools useful for evaluating all of the aforementioned statistics at user selected confidence levels. The general statistical tools to be described are 1) A graphical comparison of the correlation coefficient (r), the coefficient of determi nation R2 , with the standard deviation of the calibration sample analyte values (Sr)

384

2) 3) 4) 5)

Chemometrics in Spectroscopy

as compared to the standard error of estimate (SEE) Note: Sr is a MathCad program symbol. A graphical comparison of the correlation coefficient (r) and the standard error of estimate (SEE) for a calibration model. A Worksheet for computing the confidence limits for the correlation coefficient at user selected confidence levels. A method and Worksheet for comparing correlation coefficients for different size populations at user selected confidence levels. A method and Worksheet for computing the confidence limits for the slope and intercept of a data set at user-selected confidence levels.

REFERENCE 1. Miller, J.C. and Miller, J.N., Statistics for Analytical Chemistry, 2nd ed. (Ellis Horwood, New York, 1992).

59

Comparison of Goodness of Fit Statistics for Linear

Regression: Part 2 – The Correlation Coefficient

This chapter is a continuation of Chapter 58 describing the use of goodness of fit statistical parameters [1]. When developing a calibration for quantitative analysis one must select the analyte range over which the calibration is performed. For a given standard error of analysis the size of the range will have a direct affect on the magnitude of the correlation coefficient. The standard deviation of Y also has a direct affect. This is obviously the case as demonstrated by noting the computation for correlation between X and Y , in matrix notation, denoted as r=

covarX Y stdevX · stdevY

(59-2)

Note for this example that covar(X, Y ) represents the covariance of (X, Y ), stdev(X) is the standard deviation of the X data, and stdev(Y ) is the standard deviation of the Y data. For the MathCad program (© 1986-2001 MathSoft Engineering & Education, Inc., 101 Main Street Cambridge, MA 02142-1521), the stdev(X) is represented by the variable symbol Sr, which can be thought of as the set of many possible standard deviations for a set of data X. Thus a comparison of the correlation coefficient between two or more sets of X, Y data pairs cannot be adequately performed unless the standard deviations of the two data sets are nearly identical or unless the correlation coefficient confidence limits for the data sets are compared. In summary, if one Set A of X, Y paired data has a correlation of 0.95, this does not necessarily indicate that it is more highly correlated than a second Set B of X, Y paired data with a correlation of say 0.90. The meaning of this will be described in greater detail later. Let us look at seven slightly different equations (r1 through r7 , or Equations 59-7 through 59-13) for calculating correlation between X (known concentration or analyte data for a set of standards) and Y (instrument measured data for those standards) using MathCad function or summation notation nomenclature. First we must define the calculation of the standard error of performance, also termed the standard error of prediction (SEP), and the calculations for the slope (K1 and the intercept (K0 ) for the linear regression line between X and Y . The regression line for estimating the ˆ ) is given as: concentration denoted by (PredX or X ˆ = K1 Y + K0 PredX = X

(59-3)

386

Chemometrics in Spectroscopy

The standard error of performance, also termed the “standard error of prediction” (SEP), which represents an estimate of the prediction error (1 sigma) for a regression line is given as: � SEP =

� �� ˆ −X 2 X n

(59-4)

The slope of the line (K1 ) for this regression line is given as: � � Y · X − Y · X K1 = � � n Y 2 − Y 2 �� 2 � � � � Y · X − Y · Y · X K0 = � � n Y 2 − Y 2 n·

�

(59-5)

(59-6)

The seven ways (r1 through r7 ) for calculating correlation as the square root of the ratio of the explained variation over the total variation between X (concentration of analyte data) and Y (measured data) are described using many notational forms. For example, many software packages provide built-in functions capable of calculating the coefficient of correlation directly from a pair of X and Y vectors as given by r1 (Equation 59-7). r1 = corrX Y

(59-7)

[This is the built-in MathCad correlation function] Several software packages contain simple command lines for performing matrix computations directly and thus are conveniently capable of computing the correlation coefficient, for example as in r2 (Equation 59-8). r2 =

covarX Y stdevX · stdevY

(59-8)

[Equation 59-8 denotes the ratio of the covariance of X on Y to the standard deviation of X times the standard deviation of Y ] If the software is capable of using summation notation, such as in the standard capabilities of MathCad, then one may use this algebraic form for calculating the correlation as in r3 and r4 (Equations 59-9 and 59-10, respectively). � � ��� ˆ −X 2 � X � r3 = � � �2 X −X

(59-9)

Comparison of Goodness of Fit Statistics: Part 2

387

[Equation 59-9 is the square root of the ratio comprised of the sum of the squared differences between each predicted X and the mean of all X, to the sum of the squared differences between all individual X values and the mean of all X.] � �� � � � ˆ − X2 X (59-10) r4 = �1 − � � �2 X −X [Equation 59-10 denotes the square root of one minus the ratio comprised of the sum of the squared differences between each predicted X and its corresponding X, to the sum of the squared differences between all individual X values and the mean of all X.] And if the software allows you to assign variable names as needed for specific computations, such as SEP or standard deviations, then you may proceed to use such computational descriptions such as r5 and r6 (Equations 59-11 and 59-12, respectively) to compute the correlation. � � � SEP2 r5 = 1 − (59-11) stdevX2 [Equation 59-11 indicates that the correlation coefficient is represented by the square root of one minus the ratio comprised of the square of the standard error of performance, to the square of the standard deviation of all X]. � � � SEP 2 r6 = 1 − (59-12) stdevX [Equation 59-12, of course, is simply the algebraic equivalent of the equation found above.] Other computational methods for correlation is given in Miller and Miller, (reference [2], p. 105) as r7 shown in Equation 59-13. � xi − x yi − y i r7 = �� (59-13) �� �� 21 � � 2 2 yi − y xi − x i

i

You may be surprised that for our example data from Miller and Miller ([2], p. 106), the correlation coefficient calculated using any of these methods of computation for the r-value is 0.99887956534852. When we evaluate the correlation computation we see � �� � �� ˆ −X , that given a relatively equivalent prediction error represented as: X −X , X or SEP, the standard deviation of the data set (X) determines the magnitude of the correlation coefficient. This is illustrated using Graphics 59-1a and 59-1b. These graphics allow the correlation coefficient to be displayed for any specified Standard error of prediction, also occasionally denoted as the standard error of estimate (SEE). It should be obvious that for any statistical study one must compare the actual computational recipes used to make a calculation, rather than to rely on the more or less non-standard terminology and assume that the computations are what one expected.

388

Chemometrics in Spectroscopy 1

Correlation coefficient

0.86 0.71 0.57

r(Sr) 0.43 0.29 0.14 0

0

0.57

1.14

1.71

2.29

2.86

3.43

4

Sr Standard deviation of range

Graphic 59-1a r versus Sr of data range.

For a graphical comparison of the correlation [r(Sr)] and the standard deviation of the samples used for calibration (Sr), a value is entered for the SEP (or SEE) for a specified analyte range as indicated through the standard deviation of that range (Sr). The resultant graphic displays the Sr (as the abscissa) versus the r (as the ordinate). From this graphic it can be seen how the correlation coefficient increases with a constant SEP as the standard deviation of the data increases. Thus when comparing correlation results for analytical methods, one must consider carefully the standard deviation of the analyte values for the samples used in order to make a fair comparison. For the example shown, the SEE is set to 0.10, while the correlation is scaled from 0.0 to 1.0 for Sr values from 0.10 to 4.0. 1

Correlation coefficient

0.999 0.997 0.996

r(Sr) 0.994 0.993 0.991 0.99

0

0.57

1.14

1.71

2.29

2.86

3.43

4

Sr Standard deviation of range

Graphic 59-1b r versus Sr of data range.

This figure demonstrates the correlation range above 0.99 for the figure in Graphic 59-1a. Note that the correlation begins to flatten when the Sr is over an order of magnitude times the SEE.

Comparison of Goodness of Fit Statistics: Part 2

389

1

Correlation coefficient

0.98 0.96 0.94

r(Sr) 0.92 0.9 0.88 0.86 0.2

0.26

0.31

0.37

0.43

0.49

0.54

0.6

Sr Standard deviation of range

Graphic 59-1c r versus Sr of data range.

Note from this figure (Graphic 59-1c) that at a certain value for standard deviation of X (denoted as Sr), small change in the Sr results in a large apparent change in the correlation. For example, in this case where the SEE is set to 0.10, the correlation changes from 0.86 to 0.95 when the Sr is changed only from 0.20 to 0.32. As is the general case, using correlation to compare analytical methods requires identical sample analyte standard deviations, or comparison of the confidence limits for the correlation coefficients in order to interpret the significance of the different correlation values.

Coefficient of determination

1

R 2(Sr) 0.5

0

0

1

2

3

4

Sr

Standard deviation of range

Graphic 59-2 R2 versus Sr of data range.

For a graphical comparison of the coefficient of determination (R2 ) and the standard deviation of the calibration samples (Sr), a value is entered for the SEE for a specified range of Sr. The resultant graphic displays the Sr (abscissa) versus R2 (ordinate). From this graph it can be seen how the coefficient of determination increases as the standard deviation of the data. The SEE is set at 0.10 as in the examples shown in Graphics 59-1a and 59-1b. Note that the same recommendation holds whether using r or R2 , that being

390

Chemometrics in Spectroscopy

relative comparisons for this statistic should not be used unless the standard deviations of the comparative data sets are identical.

Correlation coefficient

1

0.98

r(Sr)

0.96

0

10

20

30

40

R(Sr)

Ratio of Sr/SEE

Graphic 59-3 r versus Sr/SEE.

A subsequent Graphic 59-3 shows the relative ratio of the range (Sr) to the SEE (abscissa) as compared to the correlation coefficient r as the ordinate. From this graph it can be seen that the correlation coefficient continues to increase as the ratio of Sr/SEE even when the ratio approaches more than 60. Note that when the ratio is greater than 10 there is not much improvement in the correlation.

Correlation coefficient

1

r(SEE) 0.5

0

0

1

2

3

4

SEE Standard error of estimate

Graphic 59-4 r versus SEE.

A graphical comparison of the correlation coefficient (r) versus the standard error of estimate (SEE) is shown in Graphic 59-4. This graphic clearly shows that when the Sr is held constant (Sr = 4) the correlation decreases as the SEE increases.

Comparison of Goodness of Fit Statistics: Part 2

391

Correlation coefficient

1

r(SEE) 0.5

0

0

0.2

0.4

0.6

0.8

1

R(SEE) Ratio of SEE/Sr

Graphic 59-5 r versus SEE/Sr.

This graphic shows the relationship between correlation and the ratio of SEE/Sr, as the SEE increases relative to the Sr the correlation decreases rapidly.

REFERENCES 1. Workman, J. and Mark, H., “Chemometrics in Spectroscopy: Comparison of Goodness of Fit Statistics for Linear Regression – Part 1, Introduction”, Spectroscopy 19(4), 32–35 (2004). 2. Miller, J.C. and Miller, J.N., Statistics for Analytical Chemistry, 2nd ed. (Ellis Horwood, New York, 1992).

This page intentionally left blank

60

Comparison of Goodness of Fit Statistics for Linear

Regression: Part 3 – Computing Confidence Limits for the

Correlation Coefficient

In this chapter as a continuation of Chapters 58 and 59 [1, 2], the confidence limits for the correlation coefficient are calculated for a user-selected confidence level. The user selects the test correlation coefficient, the number of samples in the calibration set, and the confidence level. A MathCad Worksheet (© MathSoft Engineering & Education, Inc., 101 Main Street, Cambridge, MA 02142-1521) is used to calculate the z-statistic for the lower and upper limits and computes the appropriate correlation for the z-statistic. The upper and lower confidence limits are displayed. The Worksheet also contains the tabular calculations for any set of correlation coefficients (given as ). A graphic showing the general case entered for the table is also displayed. For n pairs of values (X, Y ) the set of pairs may be interpreted as a subset of the entire population of X and Y values throughout some larger population of samples. For example, X and Y may constitute all possible combinations of an instrument response (Y ) and an analyte concentration (X) in a specific solvent matrix. The population correlation coefficient may be referred to as the Greek letter rho (), which may be estimated using the correlation coefficient computed for a specific subset of values, designated as (r). It is known that tests of significance can be performed on a measured r to determine if is it is significantly different from another r calculated from a different subset of X, Y values. The significance between any specific r calculated from a subset of X, Y values may also be compared to the estimated population correlation for all such possible samples, . When a hypothesis test is used to calculate whether is statistically equal to zero, the distribution is approximated using the Student’s t distribution. When is tested to be not equal to zero the use of the Fisher transformation produces a statistic which is normally distributed. This transformation is referred to as Fisher’s Z transformation (i.e., the Z-statistic). The z-statistic for testing a non-zero population correlation is given by equation 60-14 as Z1 , where e = 271828. A good discussion of this is found in reference [3]. � Z1 = 05 · loge

1+r 1−r

� (60-14)

A more standard form (equation 60-15) used for computational purposes is � Z1 = 11513 · log10

1+r 1−r

� (60-15)

394

Chemometrics in Spectroscopy

The confidence limits for a correlation coefficient for a given number of X, Y pairs (n) at a specified confidence limit is calculated as Z2 (Equation 60-16). � Z2 = 11513 · log10

1+r 1+r

�

� ±z· √

1

�

n−3

(60-16)

Note that the z-statistic is computed as z or is available from standard statistical tables as the Student’s t distribution such that confidence levels as 0.90, 0.95, 0.98, and 0.99 corresponding to t050 , t025 , t010 , and t005 , respectively. At infinite n of X, Y pairs the corresponding z-values are 1.645, 1.960, 2.326, and 2.576. For a specific example problem, we may calculate the confidence limits for r as 0.8, n as 21 at a 95% confidence interval [3]. Then Z2 for this problem is as (equation 60-17). � Z2 = 11513 · log10

� � � 1 + 080 1 ± 196 √ = 06366 to 15606 1 − 080 21 − 3

(60-17)

Then it follows that solving for using 0.6366 and 1.5606 substituted individually into the ZLL and ZUL equations below (i.e., equations 60-18 and 60-19), we calculate 0.563 and 0.920 as the lower and the upper confidence limits, respectively, for the

correlation coefficient of 0.80 and n = 21 as shown in the equations (i.e., ZLL and ZUL

and Graphics 60-6a and 60-6b).

Lower Limit:

� ZLL = 06366 = 11513 · log10

1 + LL 1 − LL

� ⇒ LL = 05626

(60-18)

⇒ UL = 09155

(60-19)

Upper Limit: � ZUL = 15606 = 11513 · log10

1 + UL 1 − UL

�

A graphic or tabular data display can be generated for any z-statistic value given a population correlation coefficient, . This is accomplished by using the Fisher’s Z transformation (i.e., the Z-statistic) computation as (equation 60-20) � Z = 11513 · log10

1+ 1−

� (60-20)

In summary, for any stated value of the population correlation ( the z statistic is denoted as Z, and the corresponding correlation confidence limits can be determined. For our example, the Z statistic of 0.6366 corresponding to the lower correlation coeffi cient confidence limit is shown in the graphic below (Graphic 60-6a) as having a value of 0.562575; this represents the lower confidence limit for the correlation coefficient for this example.

Comparison of Goodness of Fit Statistics: Part 3

395

0.63663

0.63662

z-statistic

0.63661

Z (ρ)

0.6366

0.63658

0.63657

0.63656 0.56255 0.562558 0.562567 0.562575 0.562583 0.562592 0.5626

ρ

Correlation coefficient

Graphic 60-6a The z statistic is denoted as Z, and the corresponding correlation confidence ( lower limit can be graphically displayed for our example.

Likewise for this example, the Z statistic of 1.5606 corresponding to the upper correlation coefficient confidence limit is shown in the graphic below (Graphic 60) as having a value of 0.91551; this represents the upper confidence limit for the 0.80 correlation example problem. Finally then, for the example problem the correlation confidence limits are from 0.562575 to 0.91551 (i.e., 0.56 to 0.92).

1.5611

1.5609

z-statistic

1.5608

Z (ρ) 1.5606 1.5604

1.5602

1.5601 0.91543 0.91546 0.91549 0.91551

0.91554 0.91557

0.9156

ρ Correlation coefficient

Graphic 60-6b The z statistic is denoted as Z, and the corresponding correlation confidence ( upper limit can be graphically displayed for our example.

396

Chemometrics in Spectroscopy

TESTING CORRELATION FOR DIFFERENT SIZE POPULATIONS The following description and corresponding MathCad Worksheet allows the user to test if two correlation coefficients are significantly different based on the number of sample pairs (N ) used to compute each correlation. For the Worksheet, the user enters the confidence level for the test (e.g., 0.95), two comparative correlation coefficients, r1 and r2 , and the respective number of paired (X, Y ) samples as N1 and N2 . The desired confidence level is entered and the corresponding z statistic and hypothesis test is performed. A Test result of 0 indicates a significant difference between the correlation coefficients; a Test result of 1 indicates no significant difference in the correlation coefficients at the selected confidence level. Again we will use a standard example [3] where r1 is 0.5, with n1 as 28; r2 is 0.3 with n2 of 35. The typical confidence level is 0.95 and the z-value statistic for this level is 1.96. Note here again that the z-statistic is computed as z or is available from standard statistical tables as the Student’s t distribution such that confidence levels of 0.90, 0.95, 0.98, and 0.99 correspond to t050 , t025 , t010 , and t005 , respectively. At infinite n (i.e., greater than 120) of X, Y pairs the corresponding z-values are 1.645, 1.960, 2.326, and 2.576. The test statistic for this problem is given as equation 60-21. � � � � �� 1 + r2 1 + r1 − 11513 · log10 11513 · log10 1 − r2 1−r Zn = (60-21) � 1 1 1 + n1 − 3 n2 − 3 The null hypothesis test for this problem is stated as follows: are two correlation coefficients r1 and r2 statistically the same (i.e., r1 = r2 )? The alternative hypothesis is then r1 = r2 . If the absolute value of the test statistic Zn is greater than the absolute value of the z-statistic, then the null hypothesis is rejected and the alternative hypothesis accepted – there is a significant difference between r1 and r2 . If the absolute value of Zn is less than the z-statistic, then the null hypothesis is accepted and the alternative hypothesis is rejected, thus there is not a significant difference between r1 and r2 . Let us look at a standard example again (equation 60-22). � � � � �� 1 + 05 1 + 03 11513 · log10 − 11513 · log10 1 − 05 1 − 03 Zn = (60-22) � 1 1 + 28 − 3 35 − 3 And Zn = 089833, therefore Zn, the test statistic, is less than 1.96, the z-statistic, and the null hypothesis is accepted – there is not a significant difference between the correlation coefficients. In a second example, which may be more typical, let us see what happens when r1 is 0.87 and r2 is 0.96, with n1 as 20, and n2 as 25. At a confidence level test of 0.95, we use the above equations for Z(n) and find that there is not a significant difference (e.g., Zn = 18978, which is less than 1.96). The use of this statistical test emphasizes the

Comparison of Goodness of Fit Statistics: Part 3

397

point that comparison of correlation coefficients for small numbers of sample pairs is definitely “risky” business when confidence limits and statistical hypothesis testing are not used. In our experience we have seen analytical techniques and methods accepted or rejected by large research organizations using the “correlation eye-balling” test, where the method is accepted or rejected solely on a relative comparison of correlation coefficients, without the benefit of computing the confidence limits! This is a somewhat common, but easily preventable, mistake.

REFERENCES 1. Workman, J. and Mark, H., “Chemometrics in Spectroscopy: Comparison of Goodness of Fit Statistics for Linear Regression – Part 1, Introduction,” Spectroscopy 19(4), 32–35 (2004). 2. Workman, J. and Mark, H., “Chemometrics in Spectroscopy: Comparison of Goodness of Fit Statistics for Linear Regression – Part 2, The Correlation Coefficient,” Spectroscopy 19(6), 29–33 (2004). 3. Spiegel, M.R. Statistics (McGraw-Hill Book Company, New York, 1961).

This page intentionally left blank

61 Comparison of Goodness of Fit Statistics for Linear Regression: Part 4 – Confidence Limits for Slope and Intercept

For this chapter we continue to describe the use of confidence limits for comparison of X, Y data pairs. This subject has been addressed in Chapters 58–60 first published as a set of articles in Spectroscopy [1–3]. A MathCad Worksheet (© 1986-2001 MathSoft Engineering & Education, Inc., 101 Main Street Cambridge, MA 02142-1521) provides the computations for interested readers. This will be covered in a subsequent chapter or can be obtained in MathCad format by contacting the authors with your e-mail address. The Worksheet allows the direct calculation of the t-statistic by entering the desired confidence levels. In addition the confidence limits for the calculated slope and intercept are computed from the original data table. The lower limits for the slope and the intercept are displayed using two different sets of equations (and are identical). The intercept confidence limits are also calculated and displayed. For calculations of slope and intercept two sets of equations will be shown, one as a summation notation set useful for application in MathCad software, and a second set as shown from reference [4], pp. 100–111. For these formulas, X represents the concentration and Y represents the instrument response. This is to demonstrate that the two computational formula sets yield the same precise answer. To begin, the following summation notation may be used to calculate the slope (k1 ) of a linear regression line given a set of X, Y paired data (equation 61-23). n · X · Y − X · Y (61-23) k1 = 2 n· X2 − X The summation notation formula for calculating the intercept (k0 of a linear regression line given a set of X, Y paired data is as equation 61-24. 2 X · Y − X · X · Y k0 = (61-24) 2 n· X2 − X In reference [4], p. 109, Miller and Miller use the following for the slope (b) calculation (equation 61-25) xi − x¯ yi − y¯ i b= (61-25) xi − x¯ 2 i

400

Chemometrics in Spectroscopy

The intercept (a) is given by the same authors [4] as (equation 61-26) a = y¯ − bx¯

(61-26)

The reader may be surprised to learn that for the selected data the slope using either method computes to a value of 1.93035714285714, while the intercept for both methods of computation have values of 1.51785714285715 (summation notation method) versus 1.51785714285714 for the Miller and Miller cited method (this, however, is the probable result of computational round-off error). The confidence limits for the slope and intercept may be calculated using the Student’s t statistic, noting Equations 61-27 through 61-30 below. The slope (k1 ) confidence limits are computed as shown in Equations 61-27 through 61-30. ⎛ ⎞ Y − Yˆ 2 t Limits = k1 ± ⎝ √ · (61-27) ⎠ ¯ 2 n−2 X −X Miller and Miller, pp. 110 and 111 in reference [4], cite the following equations for calculation of the slope (b) confidence limits.

sy/x =

⎧ ⎫ 21 2 ⎪ ⎪ ⎨ yi − yˆ i ⎬ i

⎪ ⎩

sb =

⎪ ⎭

n−2 sy/x

i

xi − x¯

2

21

Limits = b ± t · sb

(61-28)

(61-29)

(61-30)

As the reader may suspect by now, these methods of computation yield precisely the same answer as LL = 182521966597124; and UL = 203549461974305. The intercept (k0 confidence limits are computed as equation 61-31 2 Y − Yˆ · X 2 Limits = k0 ± t · (61-31) ¯ 2 n − 2 · n · X −X Miller and Miller, pp. 111 and 112 in reference [4] cite the following Equations for calculation of the intercept (a) confidence limits.

sy/x =

⎧ ⎫ 21 ⎪ yi − yˆ i 2 ⎪ ⎨ ⎬ i

⎪ ⎩

n−2

⎪ ⎭

(61-32)

Comparison of Goodness of Fit Statistics: Part 4

sa = sy/x

⎧ ⎨ ⎩n

401

i

i

xi2

⎫ 21 ⎬

xi − x¯ 2 ⎭

Limits = a ± t · sa

(61-33)

(61-34)

Again the methods of computation shown yield precisely the same values for LL = 0759700015087087; and UL = 227601427062721. We will be discussing a more detailed interpretation for the slope and intercept confidence limits in later chapters. However, the reader will note that the regression line for any X, Y paired data rotates at the epicenter point designated by the mean X and mean Y data point. Thus the farther from the mean of X and Y a data point along a line occurs, the less the overall confidence in the relative position of the line. A more detailed description of the confidence limits surrounding any regression line using the F -distribution will be discussed later.

REFERENCES 1. Workman, J. and Mark, H., “Chemometrics in Spectroscopy: Comparison of Goodness of Fit Statistics for Linear Regression – Part 1,” Spectroscopy 19(4), 32–35 (2004). 2. Workman, J. and Mark, H., “Chemometrics in Spectroscopy: Comparison of Goodness of Fit Statistics for Linear Regression – Part 2, The Correlation Coefficient,” Spectroscopy 19(6), 29–33 (2004). 3. Workman, J. and Mark, H., “Chemometrics in Spectroscopy: Comparison of Goodness of Fit Statistics for Linear Regression – Part 3, Computing Confidence Limits for the Correlation Coefficient,” Spectroscopy 19(7), 31–33 (2004). 4. Miller, J.C. and Miller, J.N., Statistics for Analytical Chemistry, 2nd ed. (Ellis Horwood, New York, 1992).

Supplement

MathCad Worksheets for Correlation, Slope and Intercept

The attached worksheet from MathCad (© 1986–2001 MathSoft Engineering & Education, Inc., 101 Main Street Cambridge, MA 02142–1521) is used for computing the statistical parameters and graphics discussed in Chapters 58 through 61, in refer ences [b-1–b-4]. It is recommended that the statistics incorporated into this series of Worksheets be used for evaluations of goodness of fit statistics such as the correlation coefficient, the coefficient of determination, the standard error of estimate and the use ful range of calibration standards used in method development. If you would like this Worksheet sent to you, please request this by e-mail from the authors.

R-Squared Study (Y on X) − − − − − − − − − − − − − − − − − − −−

Y:

X:

Y:=

Y = k1X + k0

X:= 0

0

An Example of Y on X Regression

0

2.1

0

0

1

5

1

2

2

9

2

4

3

12.6

3

6

4

17.3

4

8

5

21

5

10

6

24.7

6

12

n:= rows(X)

Correlation: cvar (X, Y) stdev (X)·stdev (Y)

= 0.99888

Miller & Miller Data (page 106)

Methods for computing the Correlation Coefficient (r): n = rowsX

Slope k1x =

n·

−−−−→ Y · X − Y · X −− −−→ n· Y2 − Y2

Intercept − − −−→ −−−−→ Y2 · X − Y · Y · X k0x = −− −−→ n· Y2 − Y2

Comparison of Goodness of Fit Statistics: Part 4

403

PredX = k1x · Y + k0x

Predicted Values for X:

SEP: SEP =

−−−−−−−−→2 PredX − X n

Correlation v1: r1X = corrX Y

Correlation v2: r2X =

r1X = 099887956534852

cvarX Y stdevX · stdevY

r2X = 099887956534852

−−−−−−−−−−−−−−→2 PredX − meanX Correlation v3: r3X = −−−−−−−−−−→ X − meanX2

r3X = 099887956534852

⎛ ⎞ −−−−−−−−→2 PredX − X Correlation v4: r4X = 1 − ⎝ −−−−−−−−−−→ ⎠ X − meanX2 Correlation v5: r5X =

1−

Correlation v6: r6X =

Correlation v7:

r4X = 099887956534852

SEP2 stdevX2

1−

SEP stdevX

r5X = 099887956534852

2 r6X = 099887956534852

−−−−−−−−−−−−−−−−−−−−−−−→ X − meanX · Y − meanY

−−−−−−−−−−−→2 −−−−−−−−−−−→2 X − meanX · Y − meanY

r7X = 099887956534852

r7X =

Comparison of Correlation Coefficient (r) and the Standard Deviation of Calibration Data: Enter Data:

SEE = 01

Data Manually Entered Sr = 01 02stdevX

CALCULATIONS: stdev X = 4

SEE2 = 001

rSr =

1−

SEE2 Sr 2

404

Chemometrics in Spectroscopy Graphic 1A: r versus Sr of data range 1

Correlation Coefficient

0.86 0.71 0.57

r(Sr) 0.43 0.29 0.14 0

0

0.57

1.14

1.71

2.29

2.86

3.43

4

Sr Standard Deviation of Range Graphic 1B: r versus Sr of data range 1

Correlation Coefficient

0.999 0.997 0.996

r(Sr) 0.994 0.993 0.991 0.99

0

0.57

1.14

1.71

2.29

2.86

3.43

4

Sr Standard Deviation of Range Graphic 1C: r versus Sr of data range 1

Correlation Coefficient

0.98 0.96 0.94

r(Sr) 0.92 0.9 0.88 0.86 0.2

0.26

0.31

0.37

0.43

0.49

Sr Standard Deviation of Range

0.54

0.6

Comparison of Goodness of Fit Statistics: Part 4

R2Sr =

405

SEE2 Sr 2

Graphic 2: R2 versus Sr of data range

Coefficient of Determination

1

R2(Sr) 0.5

0

0

1

2

3

4

Sr

Standard Deviation of Range

RSr =

Sr SEE

Graphic 3: r versus Sr/SEE

Correlation Coefficient

1

0.98

r(Sr)

0.96

0

10

20

30

R(Sr)

Ratio of Sr/SEE

Comparison of Correlation Coefficient (r) and SEE: Enter Data:

Sr = stdevX

CALCULATIONS:

Data Manually Entered

SEE = 01 02Sr

Sr = 4 − −−−−2→ SEE rSEE = 1 − Sr 2

40

406

Chemometrics in Spectroscopy Graphic 4: r versus SEE

Correlation Coefficient

1

r(SEE) 0.5

0

0

1

2

3

4

SEE Standard Error of Estimate

RSr =

Sr SEE

Graphic 5: r versus SEE/Sr

Correlation Coefficient

1

r(SEE) 0.5

0

0

0.2

0.4

0.6

0.8

1

R(SEE) Ratio of SEE/Sr

Computing Confidence Limits for Correlation Coefficient (at selected con fidence limits) Enter Data:

= 080

Enter Confidence level as 2

n = 21

Minimum n = 5

2 = 095

Comparison of Goodness of Fit Statistics: Part 4

407

CALCULATIONS:

2 + 1 2 z = qt 1 100000

Calculate z-table value:

1

z − value z = 196 1+ 1 1+ 1 Zn = 11513 log −z· √ Zp = 11513 log + z · √

1− 1− n−3 n − 3

Zn = 06366

Zp = 15606

Table of Exact Values for � given Z�, as Zp and Zn, at Specified Confidence Limit: − → −−−−−−−−− −−−−− 1+ Z = 11513 log 1−

= 000001 000002250000 Graphic 6a

0.63663

0.63662

z-statistic

0.63661 Z (ρ)

0.6366

0.63658

0.63657

0.63656 0.56255 0.562558 0.562567 0.562575 0.562583 0.562592 0.5626

ρ

Correlation Coefficient Graphic 6b 1.5611

1.5609

z-statistic

1.5608

Z (ρ) 1.5606 1.5604

1.5603

1.5601 0.91543

0.91546

0.91549

0.91551

ρ

0.91554

Correlation Coefficient

0.91557

0.9156

408

Chemometrics in Spectroscopy

Correlation coefficient confidence limits estimates for selected confidence level are: a = 077261189 · 2Zn0710540889

b = 076468768 · 3Zn0441013741

c = 0864765533 · 5Zn0137899811

d = 0772611892 · 2Zp0710540889

e = 076468768 · 3Zp0441013741

f = 086476533 · 5Zp0137899811

a if 050 ≤ �Zn� < 1 b if 1 ≤ �Zn� < 15 LL = c if 15 ≤ �Zn� ≤ 29 1000 if �c� ≥ 1

d if 050 ≤ �Zp� < 1 e if 1 ≤ �Zp� < 15 UL = f if 15 ≤ �Zp� ≤ 29 1000 if �f� ≥ 1

Correlation coefficient confidence limits estimated for selected confidence level are: Lower Limit:

Upper Limit:

LL = 056

UL = 092

Testing Correlation for Different Size Populations Are two correlation coef ficients (r1 and r2 different based on a difference in the number of obser vations for each (N)? Enter Data:

r1 = 097

Enter Confidence level as

N1 = 28 �2

r2 = 099

N2 = 28

= 095

CALCULATIONS: Calculate Test Statistic:

11513 log 1+r1 − 11513 log 1+r2 1−r1 1−r2 ZN = 1 1 + N2−3 N1−3 ZN = −195996

NOTE: If Z(N) is greater than the absolute value of the z-statistic (Normal Curve onetailed) we reject the null hypothesis and state that there is no significant difference in r1 and r2 at the selected significance level.

Calculate the Z-statistic at selected confidence limit: Calculate z-table value: 1 =

2+1 2

z = qt 1 100000 z-Value statistic:

z = 196

Comparison of Goodness of Fit Statistics: Part 4

409

The hypothesis test conclusion at the specified level of significance: 1 if �ZN� < �Z� Test = 0 otherwise

Test = 1

0 = reject hypothesis – there IS a significant difference 1 = accept hypothesis – there is NOT a significant difference Confidence Limits for Slope and Intercept: �2

Enter Confidence level as

2 = 095

n = rowsX

CALCULATIONS:

− −−−−−−−−−−−−−→ X − meanX2 Sx = n−2

Slope and Intercept Calculations: X = 42

Y = 917

− −−−→ X2 = 364

n = rowsX −−−−→ X · Y = 7664

Slope −−−−→ n · X · Y − X · Y k1 = − − −−→ n· X2 − X2 k1 = 1.93035714285714

Miller and Miller, p. 109 −−−−−−−−−−−−−−−−−−−−−−−→ X − meanX · Y − meanY

bX = X − meanX2 bX = 193035714285714

410

Chemometrics in Spectroscopy

Intercept: −− −−→ −−−−→ X2 · Y − X · X · Y k0 = −− −−→ n· X2 − X2 k0 = 1.51785714285715 Miller and Miller, p. 109 aX = meanY − bX · meanX aX = 151785714285714 meanX = 6 meanY = 131 bX = 193035714285714

Calculated z-table value: Calculate z-table value

1 =

2+1 2

t = qt 1 n t-value statistic

t = 25706

Ye = k1 · X + k0

Syx = Standard Error of Estimate:

Slope Confidence Limits: ⎛ ⎜ t LLk1 = k1 − ⎝ √ n−2

−−−−−−−−→ − Y − Ye2

n−2

Syx = 04328 Method 1

·

⎞ − −−−−−−−−→ 2 Y − Ye ⎟ ⎠ − −−−−−−−−−−−−→ X − meanX2

·

⎞

LLk1 = 182521966597124 ⎛ ⎜ t ULk1 = k1 + ⎝ √ n−2

− −−−−−−−−→ Y − Ye2 ⎟ − −−−−−−−−−−−−→ ⎠ X − meanX2

ULk1 = 203549461974305

Comparison of Goodness of Fit Statistics: Part 4

411

Slope Confidence Limits:

Method 2 t Syx LL = k1 − √ · n − 2 Sx

t

Syx UL = k1 + √ · n − 2 Sx

Slope confidence limits at selected confidence level are: Lower Limit:

LL = 182521966597124 Upper Limit: UL = 203549461974305

Using Miller and Miller Formulas (pp. 100–111)

syx =

−−−−−−−−→2 Y − Ye n−2 syx = 0433

sb =

syx

− −−−−−−−−−−−−→ X − meanX2

Csb = t · sb

sb = 0041

Lower Limit: k1 − Csb = 182521966597124 Upper Limit: 203549461974305

k1 + Csb =

Intercept confidence limits at selected confidence level are: Method 1 LLk0 = k0 − t ·

−−−−−−−−→ − −− −−→ 2 · X2 Y − Ye −−−−−−−−−−−−→2 n − 2 · n · X − meanX

LLk0 = 0759700015087087

ULk0 = k0 + t ·

− −−−−−−−−→ −− −−→ 2 · X2 Y − Ye −−−−−−−−−−−−→2 n − 2 · n · X − meanX

ULk0 = 227601427062721

412

Chemometrics in Spectroscopy

Using Miller and Miller Formulas (pp. 100–111) sa = syx·

−− −−→ X2 −− −−−−−−−−−−−→ n · X − meanX2

Csa = t · sa

sa = 02949

Lower Limit: k0 − Csa = 0759700015087087 Upper Limit: k0 + Csa = 227601427062721

REFERENCES b-1. Workman, J. and Mark, H., “Chemometrics in Spectroscopy: Comparison of Goodness of Fit Statistics for Linear Regression – Part 1, Introduction,” Spectroscopy 19(4), 32–35 (2004). b-2. Workman, J. and Mark, H., “Chemometrics in Spectroscopy: Comparison of Goodness of Fit Statistics for Linear Regression – Part 2, The Correlation Coefficient,” Spectroscopy 19(6), 29–33 (2004). b-3. Workman, J. and Mark, H., “Chemometrics in Spectroscopy: Comparison of Goodness of Fit Statistics for Linear Regression – Part 3, Computing Confidence Limits for the Correlation Coefficient,” Spectroscopy 19(7), 31–33 (2004). b-4. Workman, J. and Mark, H., “Chemometrics in Spectroscopy: Comparison of Goodness of Fit Statistics for Linear Regression – Part 4, Confidence Limits for Slope and Intercept,” Spectroscopy 19(10), 30–31 (2004).

62

Correction and Discussion Regarding Derivatives

The previous Chapters 54 through 57 dealing with the analysis of derivatives of spectra were first published as [1–4]. It seems that, unfortunately, those columns contained some errors. Although those errors were corrected in Chapter 54, we wanted to include the thought process and comments that went into those corrections. This chapter described one of the errors which was caught early and we were able to get the correction into the subsequent column [2]. Some of the others were not detected until some time had passed and various people had the opportunity (and time, and inclination) to check the equations in detail. Some of the errors were relatively minor (typographical errors in tables, for example), but some were substantive (and substantial). However, to get a complete set of corrections in one place, we here list all the errors found (and the corrections). Equation numbering follows that of the original chapter numbers and corresponding equations. First, in going from equation 54-3 to equation 54-4 [1], when we factored the constants from the derivative we should have taken out 1/ 2 , whereas we factored out 1/. Therefore several equations from equation 54-4 on are off by a factor of . The correct equations are 2 dY 1 1 d 2 − 21 X− = e − 2 X − dX 21/2 2 dX 2 dY 1 1 − 21 X− X = e − 2 − dX 21/2 2 2

(54-4) (54-5)

2 dY − X − − 21 X− = 3 e 1/2 dX 2

(54-6a)

2 dY − X − − 21 X− e = 2 dX

(54-6b)

Similarly, the correct equations for the second derivative of the Normal distribution are 2 d 2 d 1 X − 2 − X − − X − − 21 X− d2 Y − 21 X− e +e − = 3 dX 2 21/2 dX 2 dX 3 21/2 (54-7) 2 d2 Y 1 −1 − X − − 21 X− 1 X− 2 X e− 2 = e − 2 − + (54-8) 2 3 1/2 2 3 1/2 dX 2 2 2

414

Chemometrics in Spectroscopy

d2 Y X − 2 1 1 X− 2 = − 3 e− 2 1/2 1/2 2 5 dX 2 2 2 d2 Y X − 2 1 − 21 X− = − e 2 4 dX 2

(54-9a)

(54-9b)

Next, going from equation 54-10 to equation 54-11 for the Lorentzian distribution (in the same chapter 54) there were a couple of errors, including a missed sign change and not correctly bringing 2 inside the brackets containing an expression that was itself squared. Again, all the subsequent equations derived from equation 54-11 were themselves then also in error. The corrected derivation follows. This time, we present the derivation in much smaller and more detailed steps than initially. In doing this, we give intermediate equations letters, so that the equations labeled with pure numbers correspond to the original equation with the same number, and can be compared with it: d 2 − X 2 2 −1 dY = × 2 × dX 1 + dX 2 − X 2 1+ d 2 − X 2 2 × 0 + dX 2 − X 2 1+

2 − X d 2 − X −1 × 2 × 2 dX 2 − X 2 1+ −1 4 − X 2 d × − X × × 2 dX 2 − X 2 1+ −1 8 − X × × 0 − 1 2 × 2 2 − X 2 1+ −1 −8 − X × × 2 2 2 − X 2 1+ −1

dY 2 = × dX

dY 2 = dX

dY 2 = dX

2 dY = dX

2 dY = dX

dY 2 = × dX

8 − X 2 2 − X 2

1+ × 2

(54-10)

(54-10a)

(54-10b)

(54-10c)

(54-10d)

(54-10e)

(54-10f)

Correction and Discussion Regarding Derivatives

dY 2 8 − X = × 2 dX 2 − X 2 1+

415

(54-10g – this step is where the error crept in previously – you can’t be too careful)

dY 2 8 − X = × 2 dX 2 − X 2 + dY 2 = × dX

8 − X 2 − X2 +

2

dY 2 8 − X = × 2 dX 4 − X2 + 8 − X dY 2 2 = × 2× 2 dX 4 − X2 + dY 2 8 2 − X = × 2 dX 2 + 4 − X2

(54-10h)

(54-10i)

(54-10j)

(54-10k)

(54-11)

The error in equation 54-11 then propagated through to the rest of the equations for the Lorentzian distribution. The correct formulas are as follows: ⎧ 2 d

⎪ 2 + 4 − X2 ⎨ 8 2 − X dY 2 dX = ×

4 ⎪ dX 2 ⎩ 2 + 4 − X2 2

2 ⎫ d 2 ⎪ 8 2 − X + 4 − X2 ⎬ dX − 4 ⎪

⎭ 2 + 4 − X2

(54-12)

⎧ 2

⎪ 2 + 4 − X2 × 8 2 × d − X ⎨ dY 2 dX = × 4 ⎪ dX 2 ⎩ 2 2 + 4 − X 2

d ⎫ ⎬ 2 + 4 − X2 ⎪ 8 2 − X × 2 2 + 4 − X2 dX − 4 ⎪ ⎭ 2 + 4 − X2

(54-12a)

416

Chemometrics in Spectroscopy

⎧ 2 ⎪ 2 + 4 − X2 × 8 2 × 0 − 1 ⎨ dY 2 = × 4 ⎪ dX 2 ⎩ 2 + 4 − X2 2

d ⎫ ⎬ 2 + 4 − X2 ⎪ 8 2 − X × 2 2 + 4 − X2 dX − 4 ⎪ ⎭ 2 + 4 − X2

(54-12b)

⎧ 2 ⎪ −8 2 2 + 4 − X2 ⎨ dY 2 = × 4 ⎪ dX 2 ⎩ 2 + 4 − X2 2

d ⎫ ⎬ 2 + 4 − X2 ⎪ 8 2 − X × 2 2 + 4 − X2 dX − 4 ⎪ ⎭ 2 + 4 − X2

(54-13)

⎛

2

−8 2 2 + 4 − X2

dY 2 ⎜ = ×⎝ 4 dX 2 2 + 4 − X2 2

d ⎞ d 2 2 2 2 16 − X + 4 − X 4 − X + ⎟ dX dX ⎟ (54-14) − 4 ⎠ 2 2 + 4 − X 2

⎛

2

2

2 2 + 4 − X −8 dY 2 ⎜ = ×⎝ 4 dX 2 2 + 4 − X2 2

⎞

d − X 16 2 − X 2 + 4 − X2 0 + 4 × 2 − X ⎟ dX ⎟ − 4 ⎠ 2 2 + 4 − X (54-14a) ⎧ 2

2

⎪ 2 2 ⎨ 2 −8 + 4 − X dY 2 = × 4 ⎪ dX 2 ⎩ 2 + 4 − X2 ⎫ d ⎪ − X ⎪ 8 − X 16 − X + 4 − X ⎬ dX − 4 ⎪ ⎪ ⎭ 2 + 4 − X2 2

2

2

(54-14b)

Correction and Discussion Regarding Derivatives

417

⎧ 2

2

⎪ 2 2 ⎨ −8 + 4 − X dY 2 = × 4

⎪ dX 2 ⎩ 2 + 4 − X2 2

⎫ ⎪ 16 − X + 4 − X 8 − X 0 − 1 ⎬ − 4 ⎪ ⎭ 2 + 4 − X2 2

2

2

(54-14c)

⎧ 2 ⎪ −8 2 2 + 4 − X2 ⎨ dY 2 = × 4 2 ⎪ dX ⎩ 2 + 4 − X2 2

⎫ ⎬ 16 2 − X 2 + 4 − X2 −8 − X ⎪ − 4 ⎪ ⎭ 2 + 4 − X2

(54-15)

⎧ 2 ⎫ 2 2 2 ⎪ ⎪ 2 2 2 2 ⎨ −128 − X + 4 − X ⎬ −8 + 4 − X d2 Y 2 = × − 4 4 ⎪ dX 2 ⎪ ⎭ ⎩ 2 + 4 − X2 2 + 4 − X2 (54-15a) ⎧ 2 ⎫ 2 2 2 ⎪ ⎪ 2 2 2 2 ⎨ 128 − X + 4 − X ⎬ −8 + 4 − X d2 Y 2 + = × 4 4 ⎪ ⎪ dX 2 ⎩ ⎭ 2 + 4 − X2 2 + 4 − X2 (54-15b) ⎧ ⎫ ⎪ −8 2 2 + 4 − X2 2 ⎪ ⎨ ⎬ 2 dY 128 − X 2 = × 3 + 3 2 ⎪ ⎪ ⎩ dX 2 + 4 − X2 ⎭ 2 + 4 − X2 2

⎧ ⎫ ⎪ − 2 2 + 4 − X2 ⎪ 2 ⎬ ⎨ 2 16 − X dY 16 = × 3 3 + ⎪ ⎪ dX 2 ⎩ 2 + 4 − X2 2 + 4 − X2 ⎭

(54-15c)

2

⎧ ⎫ 2 2⎪ ⎪ 2 ⎨ − + 4 − X + 16 − X ⎬ d2 Y 16 = × 3 ⎪ ⎪ dX 2 ⎩ ⎭ 2 + 4 − X2 ⎧ ⎫ ⎪ − 3 − 4 − X2 + 16 − X2 ⎪ ⎬ ⎨ dY 16 = × 3 ⎪ ⎪ dX 2 ⎩ ⎭ 2 + 4 − X2

(54-15d)

(54-15e)

2

(54-15f)

418

Chemometrics in Spectroscopy

⎧ ⎫ ⎪ ⎪ 2 ⎨ 3⎬ 16 12 − X − dY = × 3 ⎪ ⎪ 2 ⎩ dX 2 + 4 − X2 ⎭ 2

(54-16)

This correction also propagates to equation 54-18 when we set (X − ) equal to : dY 2 8 2 16 2 8 2 = × = 2 = × 2 25 2 5 2 dX 2 + 4 2

(54-18)

Third, an error in evaluating the exponential in equation 54-19 led to the incorrect constant multiplier. The corrected expression is − 2 1 e0 −1 d2 Y 1 − 2 = − e− 2 = 0 − = 2 1/2 1/2 1/2 5 3 3 3 dXMAX 2 2 2 21/2 (54-19) We see, therefore, that the derivative decreases with the third power of , the same rate as the derivative of the Normal distribution. Next, the matrices in Chapter 56 [3] contain several erroneous entries. There are a number of sign errors, and some errors in values, mostly resulting from formatting problems in the manuscript. Here we present the corrected matrices for those. For equation 56-25, the fourth entry on the fourth line had a formatting problem; the correct value is 1588.

56-26 MT M−1 =

0 333333 0 −0 0476190 0 0 +0 262566 0 −0 0324074 −0 0476190 0 0 01190476 0 0 −0 0324074 0 0 00462962

56-27 MT M−1 MT = −0 095238 0 14285714 0 28571428 0 333333 0 28571428 0 14285714 −0 0952381 0 087301 −0 2658730 −0 2301587 0 0 23015873 0 2658730 −0 0873015 0 059523 0 −0 0357143 −0 047619 −0 0357143 0 0 05952381 −0 027777 0 0277777 0 02777777 0 −0 0277777 −0 0277777 0 02777777

56-28 MT M−1 MT (corrected for scaling) = −0 095238 0 1428571 0 2857143 0 333333 0 285714 0 0873016 −0 265873 −0 230158 0 0 230158 0 1190476 0 −0 071428 −0 09523 −0 071428 −0 166666 0 166666 0 1666666 0 −0 166666

0 142857 0 265830 0 −0 166666

−0 095238 −0 087301 0 1190476 0 1666666

The next (and final) item is, perhaps, not so much an error as a question of possible differences in interpretation of the results and meanings of some of the derivative

Correction and Discussion Regarding Derivatives

419

computations presented. One of our respondents pointed out that the magnitudes of the various derivatives, and especially the relative magnitudes of derivatives of different orders, depend on the units used, particularly the units used to describe the X-axis. Now, while in fact we did not specify any units in our discussion (see, e.g., Figure 54-1 in Chapter 54 [1], where the X-axis contains only the label “Wavelength”), given our backgrounds, it is true enough that we implicitly had nanometers in mind for our X-units. In the case of real spectra, however, if spectra were measured using, say, microns as the units for the X-axis, the same spectrum would have a calculated value for the first derivative that was 1000 times what would be calculated for a “nm-based” derivative. In that case, the first derivative (for a 10 nm wide band, which would be a 0.01 micron wide band) would be 100 times greater than the maximum spectral value, rather than being 1/10 of it, as the value computed using nanometers for the X-scale came out to. The second derivative would then be 106 times what we calculated and therefore 10,000 times greater than the maximum spectral value, instead of being 1/100 of it, the value we showed. In principle this is all correct. In practice, however, if we ignore FTIR and specialty technologies such as AOTF, then the vast majority of instruments in use today for modern NIR spectroscopy (still primarily diffraction grating based instruments) use nanometers as their wavelength unit, and usually collect data at some small integer number of nanometers. Furthermore, the vast majority of those have a 10-nm bandpass, so that 10 nm is the minimum bandwidth that would be measured. Also, even for instruments with higher resolution, the natural bandwidths of many, or even most, absorbance bands of materials that are commonly measured are greater than 10 nm in the NIR. Given all this, the use of a 10 nm figure to represent a “typical” NIR absorbance band is not unrealistic, and gives the reader a realistic assessment of what a “typical” user can expect from the NIR spectra he measures, and their derivatives. The choice of units, of course, does not affect the instrumental characteristic of signal-to-noise, which is what is important, and which we discuss in part IV of the sub-series [4]. If we consider FTIR instrumentation, then the situation is trickier, since the equivalent resolution in nm varies across the spectrum. But even keeping the spectrum in its “natural” wavenumber units, we again find that except for rotational fine structure of gases, the natural bandwidth of many (most) absorbance bands is greater than 10 wavenumbers. So again, using that figure shows the “typical” user how he can expect his own measured spectra to behave. We thank Todd Sauke, Peter Watson and (again) Colin Christy for pointing out the errors and for general comments and discussion.

REFERENCES 1. 2. 3. 4.

Mark, Mark, Mark, Mark,

H. H. H. H.

and and and and

Workman, Workman, Workman, Workman,

J., J., J., J.,

Spectroscopy Spectroscopy Spectroscopy Spectroscopy

18(4), 32–37 (2003). 18(9), 25–28 (2003). 18(12), 106–111 (2003). 19(1), 44–51 (2004).

This page intentionally left blank

63

Linearity in Calibration: Act III Scene I – Importance

of Nonlinearity

Here we go again. We seem to come up with the same themes. There are two reasons for that: first, there is so much to say and second, because the format of these chapters, which is an open-ended discussion of all manner of things chemometric, give us the opportunity to expand on a topic to any extent we consider necessary and desirable, sometimes after having discussed it in lesser detail previously, or not having discussed a particular aspect. Having previously discussed linearity in Chapters 27 and 29–33 to a considerable extent [1–6], you might think that there was little more to say. Hardly! In this chapter we will discuss where linearity considerations fit into the larger scheme of calibration theory, then we will discuss methods of testing data for linearity (or, more accurately, nonlinearity) and what can be done about it. This is not the first time we have addressed nonlinearity. In fact, the first time either of us addressed the issue was quite a long time ago, although from a purely qualitative point of view [7]. More recently others, particularly in the NIR community, have been starting to take an interest as well, mainly from the point of view of detecting nonlinearity in the data. Chuck Miller described some of the sources of nonlinearity in an article in NIR News [8]. Our good colleague and friend Tom Fearn, who writes a column in the British journal NIR News, has recently tackled this somewhat thorny topic [9]. A bit farther back, Tony Davies also addresses this topic, although in a more general context [10].

WHY IS NONLINEARITY IMPORTANT? Discussions dealing with quantitative spectroscopic analysis often list many sources of error. This is particularly true in the case of NIR analysis, where the error sources are often categorized into subheadings, such as errors due to the instrument (e.g., noise, drift, etc.), errors due to the sample (inhomogeneity, etc.), errors due to chemistry/physics (interactions, etc.) and data handling (outliers, intercorrelation, etc.). Indeed, we have often done this ourselves. Breaking down the error sources into the smallest pieces that contribute to the total error of the analysis and categorizing them is an exercise of great importance, since it is only through identifying and classifying errors this way can we devise methods to reduce them and so improve our analyses. However, for our current purposes we want to approach the situation somewhat differently. What we want to do here is to consider that, after all the samples are prepared, after all the experiments are performed, after all the data is collected, what we end up with is a table (or maybe more than one table) of numbers – even if that table exists only in a computer memory somewhere. Everything that affects the data, for good or bad, is

422

Chemometrics in Spectroscopy

reflected one way or another in that table. All the dozens of individual effects that are described in the more detail tables of error sources are, in the end, only effective by the way they are manifested in the spectrum and therefore in the spectral data. Therefore, everything that affects the performance of our spectroscopic analyses can be distilled into the effect that they have on the data, and the effects that are manifested in the calibration results. There are surprisingly few of these, if considered generally enough. This is essentially the opposite of the detailed breakdowns described above, it is the lumping together of effects into a very few categories. While some may disagree, all the effects described in the detailed listings can be classified into one of the following categories, and shown to be manifested in the data through one (or more) of these characteristics (or at least, this is one way to categorize them): (1) Characteristics that act on the X data or the Y data alone: a. Random error b. Drift & other systematic error (2) Characteristics that affect the relationship between X and Y : a. Poor choice of algorithm and/or data transformation b. Incorrect choice of factors/wavelengths c. Nonlinearity. As indicated, the first two items on this condensed listing include those aspects of mea surement that contribute error to measurements of the spectral data or of the constituent information, while the last group includes all those aspects that affect the relationships between them. From this list we see that nonlinearity is one of the fundamental limiting characteristics that makes it through this (rather brutal) screening process. For a long time, however, the contribution of nonlinearity to the error of spectroscopic calibrations was not generally recognized by the spectroscopic (or the chemometric) community. Much attention was given to issues of random noise, choice of factors (for PCR and PLS calibration) and wavelengths (for MLR calibrations) and investigations into the “best” data transform. Innumerable papers were written, and presentations were made concerning empirical methods of trying to improve calibration performance, but to a large extent they only addressed these three characteristics. These efforts could largely be summarized by the following template that can be applied to specific cases by replacing the terms in angle brackets with the specific term used in a given paper: “Calibration for in <product> by using ” to come up with the title for the paper. But with the exception of Tom Fearn’s column, very little theoretical analysis of the behavior of NIR calibrations appears in the current NIR literature. Even the empirical work that has appeared almost completely ignores the issue of linearity in favor of concentrating on the more glamorous issues of the “best” data transform or the number of factors to include in the latest whiz-bang algorithm. Lately there is another player starting to rear its head. Let us start with a little background. Regulatory agencies, especially the FDA, for many years have relied on wet chemical analysis, and more recently chromatography, to perform the required analyses. That is not right; let us reword it: for many years, companies that had to meet the requirements of regulatory agencies, particularly those that have to meet the requirements of the FDA (i.e., pharmaceutical companies), have relied on wet chemical analysis, and more recently chromatography, to perform the required chemical analyses. The analytical

Linearity in Calibration: Act III Scene I

423

methods used (i.e., those that have obtained the approval of the regulatory agency, a term which essentially means the FDA, in this writing) are mostly inherently univariate. There are publications available that provide the official specifications for characteristics that an analytical method must meet in order to be accepted by the regulatory agency; these specifications are all designed to accommodate the characteristics of the univariate methods. The US Pharmacopoeia provides the official specifications for the United States, and the FDA requires that all analytical methods used for products under their supervision meet those specifications. Other countries have equivalent agencies. In order to reduce the burden on the many pharmaceutical companies that are international in scope, there exists an organization called the International Committee on Harmonization (ICH) that advises individual countries’ agencies with a view toward having uniform requirements. (We are grateful to Gary Ritchie for verifying the accuracy of statements regarding the structure, mechanisms and meaning of the regulatory processes (G. Ritchie, 2002, personal communication).) The FDA is very conservative, and for good reason. And we, at least, are very glad of that whenever we go to the drug store to buy some antibiotics, painkillers, anticholesterol drugs or any other medicine. Reading the required specifications for analytical methods makes it abundantly clear that they were written with univariate analytical methods in mind. The conservatism of the regulatory agencies means that it will be difficult to make the sweeping changes that we might like to see happen, that will permit NIR and other analytical methods based on multivariate methods of analysis to be used. Nevertheless, by the time you read this chapter, the FDA will have convened several meetings of interested scientists, to advise them on whether, and how, these methods can become approved. But in order to understand what needs to be changed, we first need to understand the current situation. In order for a pharmaceutical company to use any analytical method for certifying the properties (efficacy, potency, etc.) of their products, the analytical method has to be validated. “Validation”, in the parlance of the FDA, is a far cry from what we usually call “validation” when developing a multivariate spectroscopic method. In fact, what we call “validation” in spectroscopic calibration (which usually means calculating an SEP, or an SECV) is a far cry from the dictionary definition of “validate”, which is “to make legally valid”, where “valid” is defined as “having legal efficacy or force” [11]. The meaning of “validation” as used by the FDA is much closer to the dictionary definition (not surprising, since the FDA is an entity very much concerned with the legal as well as the technical issues concerning validation of analytical methods) than it is to the spectroscopic concept of validation, but still differs considerably even from that. While still very general, the FDA’s definition of “validation” is much more specific than the dictionary definition. The bottom line of the FDA meaning of “validation” is essentially to thoroughly demonstrate scientifically (meaning: to “prove” in a manner that is both scientifically and legally defensible) that the method is “suitable for its intended purpose”. In the world of the FDA, anything having to do with the manufacture of pharmaceutical products (equipment, chemicals, processes, etc.) must be validated in the described sense, including the analytical methods used for testing them. When developing an analytical method to meet the requirements of being validatable, the burden is on the developer of the method to show that it is, in fact, “suitable for its intended purpose”. The Pharmacopoeia and ICH specifications include a “laundry

424

Chemometrics in Spectroscopy

list” of characteristics or “validation parameters” that must be tested. In this chapter we are not going to discuss the general topic of validating an analytical method for FDA approval; among other reasons is that they do not all fall under the umbrella of “chemometrics in spectroscopy”. We are only interested in the more limited topic of nonlinearity, therefore it suffices for us to simply point out that one of the param eters that must be tested and demonstrated for an analytical method is its linearity. The burden is on the developer of a method to demonstrate linearity between the response of the method and the concentration of the analyte that the method purports to measure. What does that mean? Any analytical method, whether based on wet chemistry, chromatography, or spectroscopy (or other technology: electrochemistry, for example) provides, as its final, ultimate output, a number. This number, which we claim represents the amount of the analyte in the sample (whether that is a concentration, total amount, or some other characteristic) we can call the response of the method to the analyte. The guidelines provide variant descriptions of the meaning of the term “linearity”. One definition is, “ ability (within a given range) to obtain test results which are directly proportional to the concentration (amount) of analyte in the sample” [12]. This is an extremely strict definition, one which in practice would be unattainable when noise and error are taken into account. Figure 63-1a schematically illustrates the problem. While there is a line that meets the criterion that “test results are directly proportional to the concentration of analyte in the sample”, none of the data points fall on that line, therefore in the strictest sense of the phrase, none of the data representing the test results can be said to be proportional to the analyte concentration. In the face of nonlinearity of response, there are systematic departures from the line as well as random departures, but in neither case is any data point strictly proportional to the concentration. Less strict descriptions of linearity are also provided. One recommendation is visual examination of a plot (unspecified, but presumably also of the method response versus the analyte concentration). Another recommendation is to use “statistical methods”, calculation of a regression line is advised. If regression is performed, the correlation

Test results

(b)

Test results

(a)

0

0 0 Analyte concentration

0 Analyte concentration

Figure 63-1 Linear and nonlinear data. Figure 63-1a: Even when the overall trend of the data is to follow a straight line none of the data points meet the strict criterion of having the test results strictly proportional to the analyte concentration. Figure 63-1b shows that for nonlinear data there are systematic departures from the straight line as well as random departures.

Linearity in Calibration: Act III Scene I

425

coefficient, slope, y-intercept and residual sum of squares are to be reported. These requirements are all in keeping with their background of being applied to univariate methods of analysis. There is no indication given as to how these quantities are to be related to linearity, only that they be reported. The recommendations all have difficulties. In the first place, there is a specification that a minimum of five concentrations are to be used. However, reflecting the background of the guidelines in a world of univariate analyses, the different concentrations are to be created using dilution techniques. This method of creating samples is generally unsuitable for spectroscopic (especially NIR) analysis. Visual examination of the plot is fraught with possible errors of interpretation. Since visual examination of a plot is inherently subjective, different analysts might come to different conclusions from the same data plot. The recommended statistical quantities to be reported from the regression analysis have nothing to do with linearity (or much of anything else, for that matter). R2 is rather strongly recommended, but the problem with using R2 to assess linearity was nicely illustrated by Tom Fearn [13], who showed that random error can cause linear data to have a lower value of R2 than nonlinear data with less random error, making the test actively misleading. Furthermore, there is a problem with all the statistics mentioned, this problem is demonstrated by the work of Anscombe [14] in a fascinating paper that everyone doing any sort of statistical calibration work should read. Anscombe’s work was also the basis of a more recent paper dealing with how misunderstanding the statistics can cause someone to become mislead [15]. We will not repeat Anscombe’s presentation, but we will describe what he did, and strongly recommend that the original paper be obtained and perused (or alternatively, the paper by Fearn [15]). In his classic paper, Anscombe provides four sets of (synthetic, to be sure) univariate data, with obviously different characteristics. The data are arranged so as to permit univariate regression to be applied to each set. The defining characteristic of one of the sets is severe nonlinearity. But when you do the regression calculations, all four sets of data are found to have identical calibration statistics: the slope, y-intercept, SEE, R2 , F -test and residual sum of squares are the same for all four sets of data. Since the numeric values that are calculated are the same for all data sets, it is clearly impossible to use these numeric values to identify any of the characteristics that make each set unique. In the case that is of interest to us, those statistics provide no clue as to the presence or absence of nonlinearity. So the fact of the matter is that the reason the recommended statistics do not tell us about linearity is that, as Anscombe shows, they cannot tell us about linearity. In fact, the recommendations in the official guidelines, while well-intended, are them selves not suitable for their intended purpose in this regard, not even for univariate methods of analysis. For starters, they do not provide a good definition of linearity, that can be used as the basis for deciding whether a given set conforms to the desired criterion of being linear. Therefore, let us start by proposing a definition, one that can at least serve as a basis for our own discussions. Let us define linearity as “The property of data comparing test results to actual concentrations, such that a straight line provides as good a fit (using the least-squares criterion) as any other mathematical function.” We continue in out next chapter with a discussion of using the Durbin-Watson Statistic for testing for nonlinearity.

426

Chemometrics in Spectroscopy

REFERENCES 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15.

Mark, H. and Workman, J., Spectroscopy 13(6), 19–21 (1998). Mark, H. and Workman, J., Spectroscopy 13(11), 18–21 (1998). Mark, H. and Workman, J., Spectroscopy 14(1), 16–17 (1998). Mark, H. and Workman, J., Spectroscopy 14(2), 16–27,80 (1999). Mark, H. and Workman, J., Spectroscopy 14(5), 12–14 (1999). Mark, H. and Workman, J., Spectroscopy 14(6), 12–14 (1999). Mark, H., Applied Spectroscopy 42(5), 832–844 (1988). Miller, C.E., NIR News 4(6), 3–5 (1999). Fearn, T., NIR News 12(6), 14–15 (2001). Davies, T., Spectroscopy Europe 10(4), 28–31 (1998). Webster’s Seventh New Collegiate Dictinoary (G. & C. Merriam Co., Springfield, MA, 1970). ICH-Q2A, Food and Drug Adminsitration, March 1, 1995. Fearn, T., NIR News 11(1), 14–15 (2000). Anscombe, F.J., The American Statistician 27, 17–21 (1973). Fearn, T., NIR News 7(1), 3, 5 (1996).

64

Linearity in Calibration: Act III Scene II – A Discussion

of the Durbin-Watson Statistic, a Step in the

Right Direction

As we left off in Chapter 63, we had proposed a definition of linearity. Now let us start by delving into the ins and outs of the Durbin-Watson statistic [1–6] and looking at how to use it to test for nonlinearity. In fact, we have talked about the Durbin-Watson statistics in previous chapters, although a long time ago and under a different name. Quite a while ago we published a column titled “Alternative Ways to Calculate Standard Deviation” [7]. One of the alternative ways described was the calculation by Successive Differences. As we shall see, that calculation is very closely related indeed to the Durbin-Watson Statistic. More recently we described this statistic (more directly named) in a sidebar to an article in the American Pharmaceutical Review [8]. To relate the Durbin-Watson Statistic to our current concerns, we go back to the basics of statistical analysis and remind ourselves how statisticians think about Statistics. Here we get into the deep thickets of statistical theory and meaning and philosophy. We will try to keep it as simple as possible, though. Let us start with two of the formulas for Standard Deviation presented in earlier chapters and columns [7]. One of the formulas is the “ordinary” formula for standard deviation: � �� n � �X − X �2 i � i=1 (64-1) SD1 = n−1 The other formula is the formula for calculating Standard Deviation by successive Differences: � � � � n−1 � Xi+1 − Xi 2 � i=1 (64-2) SD2 = 2n − 1 Now we ask ourselves the question: “If we calculate the standard deviation for a set of data (or errors) from these two formulas, will they give us the same answer?” And the answer to that question is that they will, IF (that’s a very big “if ”, you see) the data and the errors have the characteristics that statisticians consider “good” statistical properties: random, independent (uncorrelated), constant variance, and in this case, a Normal distribution, and for errors, a mean ( of zero, as well. For a set of data that meets all these criteria, we can expect the two computations to produce the same answer (within the limits of what is sometimes loosely called “Statistical variability”).

428

Chemometrics in Spectroscopy

So under conditions where we expect the same answer from both computations, we expect the ratio of the computations to equal 1 (unity). Basically, this is a general description of how statisticians think about problems: compare the results of two com putations of what is nominally the same quantity when all conditions meet the specified assumptions. Then if the comparison fails, this constitutes evidence that something about the data is not conforming to the expected characteristic (i.e., is not random, is corre lated, is heteroscedastic, is not Normal, etc.). The Durbin-Watson statistic is that type of computation, stripped to its barest essentials. Dividing equation 64-2 by equation 64-1 above, canceling similar terms, noting that the mean error is zero and ignoring the constant factor (64-2) we arrive at � ei+1 − ei 2 DW = (64-3) � 2 e Because of the way it is calculated, particularly the way the constant factor is ignored, the expected value of DW is two, when the data does in fact meet all the specified criteria: random, independent errors, etc. Nonlinearity will cause the computed value of DW to be statistically significantly less than two. (Homework assignment for the reader: what characteristic will make DW be statistically significantly greater then two?) Figures 64-1 and 64-2 illustrate graphically what happens when you inspect the residuals from a calibration. When you plot linear data, the data are evenly spread out around the calibration line as shown in Figure 64-1a. When plotting the residuals, the line representing the calibration line is brought into coincidence with the X-axis, so that the residuals are evenly spread out around the X-axis, as shown in Figure 64-1b. For nonlinear data, shown in Figure 64-2a, a plot of the residuals shows that although the calibration line still coincides with the X-axis, the data does not follow that line. Therefore, although the residuals still have equal positive and negative values, they are no longer spread out evenly around the zero line because the actual function is no longer a straight line. Instead the residuals are evenly spread out around some hypothetical curved line (shown) representing the actual (nonlinear) function describing the data. In both the linear and the nonlinear cases the total variation of the residuals is the sum of the random error, plus the departure from linearity. When the data is linear, the variance due to the departure from nonlinearity is effectively zero. For a nonlinear set of data, since the X-difference between adjacent data points is small, the nonlinearity of the function makes minimal contribution to the total difference between adjacent residuals; and most of that difference contributing to the successive differences in the numerator of the DW calculation is due to the random noise of the data. The denominator term, on the other hand, is dependent almost entirely on the systematic variation due to the curvature, and for nonlinear data this is much larger than the random noise contribution. Therefore the denominator variance of the residuals is much larger than the numerator variance when nonlinearity is present, and the Durbin-Watson statistic reflects this by assuming a value less than 2. The problem we all have is that we all want answers to be in clear, unambiguous terms: yes/no, black/white, is/isn’t linear, and so on while Statistics deals in probabilities. It is certainly true that there is no single statistic: not SEE, not R2 , not DW, nor any other that is going to answer the question of whether a given set of data, or residuals, has a linear relation. If we wanted to be REALLY ornery, we could even argue that “linearity” is, as with most mathematical concepts, an idealization of a property that

Linearity in Calibration: Act III Scene II

429

(a) 12.15 12.1

Test value

12.05 12 11.95 11.9 11.85 11.8

12.09

12.07

12.05

12.03

12.01

11.99

11.97

11.95

11.93

11.91

11.89

11.87

11.85

11.75

Concentration (b) 0.15

0.1

Residual

0.05

0 11.85

11.9

11.95

12

12.05

12.1

12.15

–0.05

–0.1

–0.15

Concentration

Figure 64-1 A graphic illustration of the behavior of linear data. Figure 64-1a – Linear data spread out around a straight line. Figure 64-1b – the residuals are spread evenly around zero.

NEVER exists in real data. But that is not productive, and does not address the real-world issues that confront us. What are some of these real-world issues? Well, you might want to check out the following paper: Anscombe, F.J., “Graphs in Statistical Analysis” [9]. I will describe his results again, but it really is worth getting hold of and reading the original paper anyway, it is quite an eye-opener. What Anscombe presents are four sets of synthetic data, representing four simple (single X-variable) regression situations. One of the data sets represents a reasonably well-behaved set of data: uniform distribution of data along the X-axis, errors are random, independent and Normally distributed, and in all respects has all the properties that statisticians consider “good”. The other three sets show very gross departures, of varying kinds (including one that is severely nonlinear),

430

Chemometrics in Spectroscopy (a) 12.15 12.1 12.05

Test value

12 11.95 11.9 11.85 11.8 11.75

12.07 11.26

12.09

12.05 11.24

12.03

12.01

11.99

11.97

11.95

11.93

11.91

11.89

11.87

11.85

11.7

Concentration (b) 0.1 0.08 Operative difference for denominator

0.06

Operative difference for numerator

11.3

11.28

11.22

11.2

11.18

11.16

11.14

11.12

11.1

11.08

11.06

–0.02

11.04

0

11.02

0.02

11

Residual

0.04

–0.04 –0.06

Wavelength

Figure 64-2 A graphic illustration of the behavior of nonlinear data. Figure 64-2a – Nonlinear data does not surround a straight line evenly. Figure 64-2b – The residuals from nonlinear data are not spread out around zero.

from this well-behaved data set. So what is the big deal about that? The big deal is that, by design, all four sets of data have identical values of all the common regression statistics: coefficients, SEE, R2 , and so on. The intent is, of course, to show that no set of statistics can unambiguously diagnose all possible problems in all situations. It is immediately clear, when you look at the graphs of the four data sets on the other hand, which is the “good” one and which ones have the problems, and what the problems are. Any statistician worth his salt will tell you that if you are doing calibration work, you should examine the residual plots, and any others that might be informative.

Linearity in Calibration: Act III Scene II

431

But the FDA/ICH guidelines do not promote that approach even though they are mentioned. To the contrary, they emphasize calculating and submitting the numerical results from the line fitting process. Under ordinary circumstances, that is really not too bad, as long as you understand what it is you are doing, which usually means going back to basic statistical theory. This theory says that IF data meets certain criteria, criteria that (always) include the fact that the errors that are random and independent, and (usually) Normally distributed, then certain calculations can be done and PROB ABILISTIC statements made about the results of those calculations. If you make the calculation and the value turns out to be one of low probability, then that is taken as evidence that your data fail to meet one or more of the criteria that they are assumed to meet. Note that the calculation alone does not tell you which criterion is not met; the criterion that it does not meet may or may not be the one you are concerned with. The converse, however, is, strictly speaking, not true. If your calculated result turns out to be a high-probability value, that does NOT “prove” that the data meet the criteria. That is what Anscombe’s paper is demonstrating, because there is a (natural) tendency to forget that point, and assume that a “good” statistic means “good” data. So where does that leave us? Does it mean that statistics are useless, or that the FDA is clueless? No, but it means that all these things have to be done with an eye to knowing what can go wrong. I strongly suspect that the FDA has taken the position it does because it has found that, even though numerical statistics are not perfect, they provide an objective measure of calibration performance, and they have found through hard experience that the subjective interpretation of graphs is even more fraught with problems than the use of admittedly imperfect statistics. For similar reasons, the statement “If the Durbin-Watson test demonstrates a correla tion, then the relationship between the two assays is not linear” is not exactly correct, either. Under some circumstances, a linear correlation can also give rise to a statistically significant value of DW. In fact, for any statistic, it is always possible to construct a data set that gives a high-probability value for the statistic, yet the data clearly and obviously fail to meet the pertinent criteria (again, Anscombe is a good example of this for a few common statistics). So what should we do? Well, different statistics show different sensitivities to particular departures from the ideal, and this is where DW comes in. The key to calculating the Durbin-Watson statistic is that prior to performing the calculation, the data must be put into a suitable order. The Durbin-Watson statistic is then sensitive to serial correlations of the ordered data. While the serial correlation is often thought of in connection with time series, that is only one of its applications. Draper and Smith [1] discuss the application of DW to the analysis of residuals from a calibration; their discussion is based on the fundamental work of Durbin, et al., in the references listed at the beginning of this chapter. While we cannot reproduce their entire discussion here, at the heart of it is the fact that there are many kinds of serial correlation, including linear, quadratic and higher order. As Draper and Smith show (on p. 64), the linear correlation between the residuals from the calibration data and the predicted values from that calibration model is zero. Therefore if the sample data is ordered according to the analyte values predicted from the calibration model, a statistically significant value of the Durbin-Watson statistic for the residuals in indicative of high-order serial correlation, that is nonlinearity. Draper and Smith point out that you need a minimum of fifteen samples in order to get meaningful results from the calculation of the Durbin-Watson statistic [1]. Since the

432

Chemometrics in Spectroscopy

Anscombe data set contains only eleven readings, statistically meaningful statements cannot be made, nevertheless it is interesting to see the results of the Durbin-Watson statistic applied to the nonlinear set of Anscombe data; the value of the statistic is 1.5073. For comparison, the Durbin-Watson statistic for the data set representing normal “good” data is 2.4816. Is DW perfect? Not at all. The way it is calculated, the highest-probability value (the “expected” value) for DW is, as we saw above, 2. Yet it is possible to construct a data set that has a DW value of 2, and is clearly and obviously not linear, as well as being non-random. That data set is 0 1 0 −1 0 1 0 −1 0 1 0 −1 0 1 0 −1 0 Data set1 But for ordinary data, we would not expect such a sequence to happen. This is the reason most statistics work as general indicators of data performance: the special cases that cause them to fail are themselves low-probability occurrences. In this case the problem is not whether or not the data are nonlinear, the problem is that they are nonrandom. This is a perfect example of the data failing to meet a criterion other than the one you are concerned with. Therefore the Durbin-Watson test fails, as would any statistical test fail for such data; they are simply not amenable to meaningful statistical calculations. Nevertheless, a “blind” computation of the Durbin-Watson statistic would give an apparently satisfactory value. But this is a warning that other characteristics of the data can cause it to appear to meet the criteria. And you have to know what CAN occur. But the mechanics of calculating DW for testing linearity is relatively simple, once you have gone through all the above: sort the data set according to the values predicted from the calibration model, then do the calculation specified in Equation 64-3. Note that, while the sorting is done using the predicted values from the model, the DW calculations are done using the residuals. But anyone doing calibration work should read Draper and Smith anyway, it is the “bible” of regression analysis. The full reference is given in the reference list [1]. The discussions of DW are on p. 69 and 181–192 of Draper and Smith (third edition – the second edition contains a similar but somewhat less extensive discussion). They also include an algorithm and tables of critical values for deciding whether the correlation is statistically significant or not. You might also want to check out page 64 for the proof that the linear correlation between residuals and predicted values from the calibration is zero. So DW and R2 test different things. As a specific test for nonlinearity, what is the relative utility of DW versus R2 for that purpose? Basically, the answer was that when done according to the way Draper and Smith (and I) described, then DW is specifically sensitive to nonlinearity in the predictions. So, for example, in the case of the Anscombe data, all the other statistics (including R2 might be considered satisfactory, and since they are the same for all four sets of data then all four sets would be considered satisfactory. But if you do the DW test on the data showing nonlinearity, it will flag it as having a low value of the statistic, Anscombe did not provide enough samples worth of synthetic data in his sets, however, for the calculated statistics to be statistically meaningful. We also note that as a practical matter, meaningful calculation of the Durbin-Watson Statistic requires many samples worth of data. We noted above that for fewer than

Linearity in Calibration: Act III Scene II

433

fifteen samples, critical values for this statistic are not listed in the tables. The reason for requiring so many samples, is that we are essentially comparing two variances (or, at least, two measures of the same variance). Since variances are distributed as 2 , for small numbers of samples this statistic has a very wide range of values indeed, so that comparisons become virtually meaningless because almost anything will fall within the confidence interval, giving this test low statistical power. On the other hand, characterizing R2 as a general measure of how good the fit is does not make us flinch, either; it is one of the standard statistics for doing that evaluation. Quite the contrary, when we saw it being specified as way to test linearity, we wondered why that was chosen by the FDA and ICH, since it is so NON-specific. We still do not know why, except for the obvious guess that they did not know about DW. We are in favor of keeping the other statistics as measures of the general “goodness of fit” of the model to the data, but in the specific context of trying to assesess linearity, We still have to promote DW over R2 as being more suited for that special purpose, although we will eventually discuss in our next few chapters an even better method for assessing linearity – after all, it was the section on “Linearity” where this all came up. As for testing other characteristics of a univariate calibration, there are also ways to test for statistical significance of the slope, to see whether unity slope adequately describes the relationship between test results and analyte concentration. These are described in the book Principles and Practice of Spectroscopic Calibration [10]. The Statistics are described there, and are called the “Data Significance t” test and the “Slope Significance t” test (or DST and SST tests!). Unless the DST is statistically significant, the SST is meaningless, though. In principle, there is also a test for the intercept. But since the expected value for the intercept depends on the slope, it gets a bit hairy. It also makes the confidence interval so large that the test is nigh on useless – few statisticians recommend it. But let us add this coda to the discussion of DW: the fact that DW is specifically sen sitive to nonlinearity does not mean that it is perfect. There may be cases of nonlinearity that will not be detected (especially if it is marginal amount), linear data will occasion ally be flagged as nonlinear (% of the time, in the long run) and other types of defects in the data may show up by giving a statistically significant value to DW. But all this is true for any and all statistics. The existence of at least one data set that is known to fool the calculation is a warning that the Durbin-Watson statistic, while a (large) step in the right direction, is not the ultimate answer. Some further comments here: there does seem to be some confusion between the usage of the statistics recommended by the guidelines, which are excellent for their intended purpose of testing the general “goodness of fit” of a model, and the specific testing of a particular model characteristic, such as linearity. A good deal of this confusion is probably due to the fact that the guidelines recommend those general statistics for the specific task of testing linearity. As Anscombe shows, however, and as we referred to previously, those generalized statistics are not up to the task. In our next chapter we will discuss other methods of testing for linearity that have appeared in the literature. Afterward, we will then turn our attention to a new test that has been devised. In fact, it turns out that while DW has much to recommend it, it is not the final or best answer. The new method, however, is much more direct and specific even than DW. It is the correct way to test for linearity. We will discuss it all in due course, in this same place.

434

Chemometrics in Spectroscopy

REFERENCES 1. Draper, N. and Smith, H., Applied Regression Analysis, 3rd ed. (John Wiley & Sons, New York, (1998). 2. Durbin, J. and Watson, G.S., Biometrika 37, 409–428 (1950). 3. Durbin, J. and Watson, G.S., Biometrika 38, 159–178 (1951). 4. Durbin, J., Biometrika 56, 1–15 (1969). 5. Durbin, J., Econometrica 38, (422–429), (1970). 6. Durbin, J. and Watson, G.S., Biometrika 58, 1–19 (1971). 7. Mark, H. and Workman, J., Spectroscopy 2(11), 38–42 (1987). 8. Ritchie, G. and Ciurczak, E., American Pharmaceutical Review 3(3), 34–40 (2000). 9. Anscombe, F.J., The American Statistician 27, 17–21 (1973). 10. Mark, H., Principles and Practice of Spectroscopic Calibration (John Wiley & Sons, New York, 1991).

65

Linearity in Calibration: Act III Scene III – Other Tests

for Nonlinearity

We continue what our discussion of the previous chapter subject matter: discussions of other ways to test data for nonlinearity. Let us begin by reviewing what we want to test. The FDA/ICH guidelines, starting from a univariate perspective, considers the relationship between the actual analyte concentration and what they generically call the “test result”, a term that is independent of the technology used to ascertain the analyte concentration. This term therefore holds good for every analytical methodology from manual wet chemistry to the latest hightech instrument. In the end, even the latest instrumental methods have to produce a number, representing the final answer for that instrument’s quantitative assessment of the concentration, and that is the test result from that instrument. This is a univariate concept to be sure, but the same concept that applies to all other analytical methods. Things may change in the future, but this is currently the way analytical results are reported and evaluated. So the question to be answered, for any given method of analysis, is the relationship between the instrument readings (test results) and the actual concentration linear? Three tests of this characteristic were discussed in the previous chapters: the FDA/ICH recommendation of linear regression with a report of various regression statistics, visual inspection of a plot of test results versus the actual concentrations, and use of the Durbin-Watson Statistic. Since we previously analyzed these tests we will not further discuss them here, but will summarize them in Table 65-1, along with other tests for nonlinearity that we explain and discuss in this chapter. So we now proceed to present various linearity tests that can be found in the statistical literature:

F -TEST Figure 65-1 shows a schematic representation of the F -test for linearity. Note that there are some similarities to the Durbin-Watson test. The key difference between this test and the Durbin-Watson test is that in order to use the F -test as a test for (non) linearity, you must have measured many repeat samples at each value of the analyte. The variabilities of the readings for each sample are pooled, providing an estimate of the within-sample variance. This is indicated by the label “Operative difference for denominator”. By Analysis of Variance, we know that the total variation of residuals around the calibration line is the sum of the within-sample variance S 2 within plus the variance of the means around the calibration line. Now, if the residuals are truly random, unbiased, and in particular the model is linear, then we know that the means for each sample will cluster

436

Chemometrics in Spectroscopy

Table 65-1 Various tests for (non) linearity that have been proposed and a summary of their characteristics Test method

Advantages

Disadvantages

Visual inspection of plot

Works

Cannot be automated Cannot be tested statistically Subjective

Durbin-Watson statistic

Works Objective Is statistically testable Can be computerized

Has “fatal flaw” Requires large number of samples Low statistical power

FDA/ICH recommendation: Linear regression with report of slope, intercept, correlation coefficient, and residual sum of squares

Objective Can be computerized Uses standard statistics

Doesn’t work as a test of linearity

F -test

Objective Computerized Uses standard statistics

Requires large number of samples Low statistical power Usually not applicable to historical data Not specific for nonlinearity; other defects in the data may be flagged as nonlinearity

Normal distribution of residuals

Objective Can be computerized Uses standard statistics

Very insensitive Very low statistical power Not specific for nonlinearity

randomly around the calibration line, and their variance will equal S 2 within /n1/2 (indicated by the label “Operative difference for numerator”). The ratio of these two variances will be distributed as the F -distribution, with an expected value of unity. If there is nonlinearity, such as is shown in Figure 65-1, then the variance corresponding to the means will be inflated by the systematic offset of each sample, and the computed F -ratio will statistically significantly larger than unity. This test thus shares several characteristics with the Durbin-Watson test. It is based on well-known and rigorously sound statistics. It is amenable to automated computerized calculation, and suitable for automatic operation in an automated process situation. It does not have the “fatal flaw” of the Durbin-Watson Statistic. On the other hand, it also shares some of the disadvantages of the Durbin-Watson Statistic. It is also based on a comparison of variances, so that it is of low statistical power. It requires many more samples and readings than the Durbin-Watson statistic does, since each sample must be measured many times. In general, it is not applicable

Linearity in Calibration: Act III Scene III

437

Residuals

Operative difference for numerator

Operative difference for denominator

Mean

0

Predicted values

Figure 65-1 Schematic representation of the residuals for the F -test.

to historical data, since the data must have been collected using the proper protocols, and rarely are so many readings taken for each sample as this test requires. It is also not specific for nonlinearity. Outliers, poorly fitting models, bias or error in the reference values or other defects of the data may appear to be nonlinearity.

NORMALITY OF RESIDUALS In a well-behaved calibration model, residuals will have a Normal (i.e., Gaussian) distribution. In fact, as we have previously discussed, least-squares regression analysis is also a Maximum Likelihood method, but only when the errors are Normally distributed. If the data does not follow the straight line model, then there will be an excessive number of residuals with too-large values, and the residuals will then not follow the Normal distribution. It follows, then, that a test for Normality of residuals will also detect nonlinearity. Over time, statisticians have devised many tests for the distributions of data, including one that relies on visual inspection of a particular type of graph. Of course, this is no more than the direct visual inspection of the data or of the calibration residuals themselves. However, a statistical test is also available, this is the 2 test for distributions, which we have previously described. This test could be applied to the question, but shares many of the disadvantages of the F -test and other tests. The main difficulty is the practical one: this test is very insensitive and therefore requires a large number of samples and a large departure from linearity in order for this test to be able to detect it. Also, like the F -test it is not specific for nonlinearity, false positive indication can also be triggered by other types of defects in the data. We continue in our next chapter with a explanation of a new test that has been devised, that overcomes the limitations of the various tests we have described.

This page intentionally left blank

66

Linearity in Calibration: Act III Scene IV – How to Test

for Nonlinearity

In Chapter 65, dealing with linearity [1], we promised we would present a description of what we believe is the best way to test for linearity (or nonlinearity, depending on your point of view). In our Chapters 63 through 65 [1–3], we examined the DurbinWatson statistic along with other methods of testing for nonlinearity. We found that while the Durbin-Watson statistic is a step in the right direction, we also saw that it had shortcomings, including the fact that it could be fooled by data that had the right (or wrong!) characteristics. The method we now present is mathematically sound, more subject to statistical validity testing, based on well-known mathematical principles, is of much higher statistical power than DW and can distinguish different types of nonlinearity from each other. This new method has also been recently described in the literature [4]. But let us begin by discussing what we want to test. The FDA/ICH guidelines, starting from a univariate perspective, considers the relationship between the actual analyte concentration and what they generically call the “test result”, a term that is independent of the technology used to ascertain the analyte concentration. This term therefore holds good for every analytical methodology from manual wet chemistry to the latest hightech instrument. In the end, even the latest instrumental methods have to produce a number, representing the final answer for that instrument’s quantitative assessment of the concentration, and that is the test result from that instrument. This is a univariate concept to be sure, but the same concept that applies to all other analytical methods. Things may change in the future, but this is currently the way analytical results are reported and evaluated. So the question to be answered, for any given method of analysis, is the relationship between the instrument readings (test results) and the actual concentration linear? This method of determining nonlinearity can be viewed from a number of different perspectives, and can be considered as coming from several sources. One way to view it is as having a pedigree as a method of numerical analysis [5]. Our new method of determining nonlinearity (or showing linearity) is also related to our discussion of derivatives, particularly when using the Savitzky-Golay method of convolution functions, as we discussed recently [6]. This last is not very surprising, once you consider that the Savitzky-Golay convolution functions are also (ultimately) derived from considerations of numerical analysis. In some ways it also bears a resemblance to the current method of assessing linearity that the FDA and ICH guidelines recommend, that of fitting a straight line to the data, and assessing the goodness of the fit. As we showed [2, 3], based on the work of Anscombe [7], the currently recommended method for assessing linearity is faulty because it cannot distinguish linear from nonlinear data, nor can it distinguish between nonlinearity and other types of defects in the data. But an extension of that method can.

440

Chemometrics in Spectroscopy

In our recent chapter we proposed a definition of linearity [2]. We defined linearity as “The property of data comparing test results to actual concentrations, such that a straight line provides as good a fit (using the least-squares criterion) as any other mathematical function.” This almost seems to be the same as the FDA/ICH approach, which we just discredited. But there is a difference. The difference is the question of fitting other possible functions to the data; the FDA/ICH guidelines only specify trying to fit a straight line to the data. This is also more in line with our own proposed definition of linearity. We can try to fit functions other than a straight line to the data, and if we cannot obtain an improved fit, we can conclude that the data is linear. But it is also possible to fit other functions to a set of data, using least-squared mathematics. In fact, this is what the Savitzky-Golay method does. The Savitzky-Golay algorithm, however, does a whole bunch of things, and lumps all those things together in a single set of convolution coefficients: it includes smoothing, differentiation, curve-fitting of polynomials of various degrees, least-squares calculations, does not include interpo lation (although it could) and finally combines all those operations into a single set of numbers that you can multiply your measured data to directly get the desired final answer. For our purposes, though, we do not want to lump all those operations together. Rather, we want to separate them and retain only those operations that are useful for our own purposes. For starters, we discard the smoothing, derivatives and performing a successive (running) fit over different portions of the data set, and keep only the curvefitting. Texts dealing with numerical analysis tell us what to do and how to do it. Many texts exist dealing with this subject, but we will follow the presentation of Arden [5]. Arden points out and discusses in detail, many applications of numerical analysis: fitting data, determining derivatives and integrals, interpolation (and extrapolation), solving systems of equations and solving differential equations. These methods are all based on using a Taylor series to form an approximation to a function describing a set of data. The nature of the data and the nature of the approximation considered differ from what we are used to thinking about, however. The data is assumed to be univariate (which is why this is of interest to us here) and to follow the form of some mathematical function, although we may not know what the function is. So all the applications mentioned are based on the concept that since a function exists, our task is to estimate the nature of that function, using a Taylor series, and then evaluate the parameters of the function by imposing the condition that our approximating function must pass through all the data points available, since those data points are all described exactly by that function. Using a Taylor series implies that the approximating function that we wind up with will be a polynomial, and perhaps one of very high degree (the “degree” of a polynomial being the highest power to which the variable is raised in that polynomial). If we have chosen the wrong function, then there may be some error in the estimate of data between the known data points, but at the data points the error must be zero. A good deal of mathematical analysis goes into estimating the error that can occur between the data points. The concepts of interest to us are contained in Arden’s book in a chapter entitled “Approximation”. This chapter takes a slightly different tack than the rest of the discussion, but one that goes exactly in the direction that we want to go. In this chapter, the scenario described above is changed very slightly. There is still the assumption that there is a single (univariate) mathematical system (corresponding to “analyte concen tration” and “test reading”), and that there is a functional relationship between the two variables of interest, although again the nature of the relationship may be unknown. The

Linearity in Calibration: Act III Scene IV

441

difference, however, is the recognition that data may have error, therefore we no longer impose the condition that the function we arrive at must pass through every data point. We replace that criterion with a different criterion, the criterion we use is one that will allow us to say that the function we use to describe the data “follows” the data in some sense. While other criteria can be used, a common criterion used for this purpose is the “least squares” principle: to find parameters for any given function that minimizes the sum of the squares of the differences between the data and a corresponding point of the function. Similarly, many different types of functions can be used. Arden discusses, for example, the use of Chebyshev polynomials, which are based on trigonometric functions (sines and cosines). But these polynomials have a major limitation: they require the data to be collected at uniform X-intervals throughout the range of X, and real data will seldom meet that criterion. Therefore, since they are also by far the simplest to deal with, the most widely used approximating functions are simple polynomials; they are also convenient in that they are the direct result of applying Taylor’s theorem, since Taylor’s theorem produces a description of a polynomial that estimates the function being reproduced. Also, as we shall see, they lead to a procedure that can be applied to data having any distribution of the X-values. Y = a0 + a1 X + a2 X 2 + a3 X 3 + · · · an X n

(66-4)

Note that here again we continue our usual practice of continuing equation and figure numbering through a set of related chapters. While discussing derivatives, we have noted in a previous chapter that for certain data a polynomial can provide a better fit to that data than can a straight line (see Figure 66-6B of [8]). In fact, we reproduce that Figure 66-6B here again as Figure 66-3 in this chapter, for ease of reference. Higher degree polynomials may provide an even better fit, if the data requires it. Arden points this out, and also points out that, for example in the non-approximation case (assuming exact functionality), if the underlying function is

0.0015 Parabola 0.0005

Response

–0.0005 1 5

9

13 17 21 25 29 33 37 41 45 49 53 57 61 65 69 73 77 81

–0.0015

Second derivative

–0.0025 –0.0035 –0.0045 –0.0055

Wavelength

Figure 66-3 A quadratic polynomial can provide a better fit to a nonlinear function over a given region than a straight line can; in this case the second derivative of a Normal absorbance band.

442

Chemometrics in Spectroscopy

in fact itself a polynomial of degree n, then no higher degree polynomial is needed in that case, and in fact, it is impossible to fit a higher polynomial to the data. Even if an attempt is made to do so, the coefficients of any higher-degree terms will be zero. For functions other than polynomials the “best” fit may not be clear, but as we shall see, that will not affect us. The mathematics of fitting a polynomial by least squares are relatively straightforward, and we present a derivation here, one that follows Arden, but is rather generic, as we shall see: Starting from equation 66-4, we want to find coefficients (the ai ) that minimize the sum-squared difference between the data and the function’s estimate of that data, given a set of values of X. Therefore we first form the differences: D = a0 + a1 X + a2 X 2 + a3 X 3 + · · · + an X n − Y

(66-5)

Then we square those differences and sum those squares over all the sets of data (corresponding to the samples used to generate the data): � 2 � D = i �a0 + a1 X + a2 X 2 + a3 X 3 + · · · + an X n − Y�2 (66-6) i The problem now is to find a set of values for the ai that minimizes �D2 with respect to each ai . We do this by the usual procedure of taking the derivative of �D2 with respect to each ai and setting each of those derivatives equal to zero. Note that since there are n + 1 different ai , we wind up with n + 1 equations, although we only show the first three of the set: � �� � �� i D2 � � �a0 + a1 X + a2 X 2 + a3 X 3 + · · · + an X n − Y�2 = =0 (66-7a) �a0 �a0 �� � � �� i D2 � � �a0 + a1 X + a2 X 2 + a3 X 3 + · · · + an X n − Y�2 = =0 (66-7b) �a1 �a1 �� � � �� i D2 � � �a0 + a1 X + a2 X 2 + a3 X 3 + · · · + an X n − Y�2 = =0 (66-7c) �a2 �a2 and so on. Now we actually indicated derivative of each term and separate the summations. � take the � Noting that �� i F 2 � = 2 i F �F (where F is the inner summation of the ai X): � � � � � � �1� + 2a1 i X + 2a2 i X 2 + 2a3 i X 3 + · · · + 2an i X n − 2 i Y = 0 (66-8a) � � 2 � 3 � 4 � n+1 � 2a0 i X + 2a1 i X + 2a2 i X + 2a3 i X + · · · + 2an i X − 2 i XY = 0 (66-8b) � 2 � 3 � 4 � 5 � n+2 � 2 2a0 i X + 2a1 i X + 2a2 i X + 2a3 i X + · · · + 2an i X − 2 i X Y = 0 (66-8c) 2a0

and so on.

Linearity in Calibration: Act III Scene IV

443

Dividing both sides of equations 66-8 (a–c) by two eliminates the constant term and subtracting the term involving Y from each side of the resulting equations puts the equations in their final form: � � � � � � (66-9a) a0 �1� + a1 i X + a2 i X 2 + a3 i X 3 + · · · + an i X n = i Y � � � � � � (66-9b) a0 i X + a1 i X 2 + a2 i X 3 + a3 i X 4 + · · · + an i X n+1 = i XY � 2 � 3 � 4 � 5 � n+2 � 2 = i X Y (66-9c) a0 i X + a1 i X + a2 i X + a3 i X + · · · + an i X and so on. The values of X and Y are known, since they constitute the data. Therefore equa tions 66-9 (a–c) comprise a set of n + 1 equations in n + 1 unknowns, the unknowns being the various values of the ai since the summations, once evaluated, are constants. Therefore, solving equations 66-9 (a–c) as simultaneous equations for the ai results in the calculation of the coefficients that describe the polynomial (of degree n) that best fits the data. In principle, the relationships described by equations 66-9 (a–c) could be used directly to construct a function that relates test results to sample concentrations. In practice, there are some important considerations that must be taken into account. The major consideration is the possibility of correlation between the various powers of X. We find, for example, that the correlation coefficient of the integers from 1 to 10 with their squares is 0.974 – a rather high value. Arden describes this mathematically and shows how the determinant of the matrix formed by equations 66-9 (a–c) becomes smaller and smaller as the number of terms included in equation 66-4 increases, due to correlation between the various powers of X. Arden is concerned with computational issues, and his concern is that the determinant will become so small that operations such as matrix inversion will be come impossible to perform because of truncation error in the computer used. Our concerns are not so severe; as we shall see, we are not likely to run into such drastic problems. Nevertheless, correlation effects are still of concern for us, for another reason. Our goal, recall, is to formulate a method of testing linearity in such a way that the results can be justified statistically. Ultimately we will want to perform statistical testing on the coefficients of the fitting function that we use. In fact, we will want to use a t-test to see whether any given coefficient is statistically significant, compared to the standard error of that coefficient. We do not need to solve the general problem, however, just as we do not need to create the general solution implied by equation 66-4. In the broadest sense, equation 66-4 is the basis for computing the best-fitting function to a given set of data, but that is not our goal. Our goal is only to determine whether the data represent a linear function or not. To this end it suffices only to ascertain whether the data can be fitted better by any polynomial of degree greater than 1, than it can by a straight line (which is a polynomial of degree 1). To this end we need to test a polynomial of any higher degree. While in some cases, the use of more terms may be warranted, in the limit we need test only the ability to fit the data using only one term of degree greater than one. Hence, while in general we may wish to try fitting equations of degrees 2, 3, � � � m (where m is some upper limit less than n), we can begin by using polynomials of degree 2, that is quadratic fits.

444

Chemometrics in Spectroscopy

A complication arises. We learn from considerations of multiple regression analysis that when two (or more) variables are correlated, the standard error of both variables is increased over what would be obtained if equivalent but uncorrelated variables are used. This is discussed by Daniel and Wood (see p. 55 in [9]), who show that the variance of the estimates of coefficients (their standard errors) is increased by a factor of VIF =

1 1 − R2

(66-10)

when there is correlation between the variables, where R represents the correlation coefficient between the variables and we use the term VIF, as is sometimes done, to mean Variance Inflation Factor. Thus we would like to use uncorrelated variables. Arden describes a general method for removing the correlation between the various powers of X in a polynomial, based on the use of orthogonal Chebyshev polynomials, as we briefly mentioned above. But this method is unnecessarily complicated for our current purposes, and in any case has limitations of its own. When applied to actual data, Chebyshev and other types of orthogonal polynomials (Legendre, Jacobi and others) that could be used will be orthogonal only if the data is uniformly, or at least symmetrically, distributed; real data will not always meet that requirement. Since, as we shall see, we do not need to deal with the general case, we can use a simpler method to orthogonalize the variables, based on Daniel and Wood, who showed how a variable can be transformed so that the square of that variable is uncorrelated with the variable. This is a matter of creating a new variable by simply calculating a quantity Z and subtracting that from each of the original values of X. A symmetric distribution of the data is not required since that is taken into account in the formula. Z is calculated using the expression (see p. 121 in [9]). In Appendix A, we present the derivation of this formula: N �

Z=

j=1

2

Xj2 �Xj − X�

N �

(66-11) 2

�Xj − X�

j=1

Then the set of values �X − Z�2 will be uncorrelated with X, and estimates of the coefficients will have the minimum possible variance, making them suitable for statistical testing. In Appendix A, we also present formulas for making the cubes, quartics and, by induction, higher powers of X be orthogonal to the set of values of the variable itself. In his discussion of using these approximating polynomials, Arden presents a com putationally efficient method of setting up and solving the pertinent equations. But we are less concerned with abstract concepts of efficiency than we are with achieving our goal of determining linearity. To this end, we point out that the equations 66-9 and indeed the whole derivation of them is familiar to us, although in a different con text. We are all familiar with using a relationship similar to equation 66-4; in using spectroscopy to do quantitative analysis, one of the representations of the equation involved is C = b0 + b1 X1 + b2 X2 + · · · + bn Xn

(66-12)

Linearity in Calibration: Act III Scene IV

445

which is the form we commonly use to represent the equations needed for doing quantitative spectroscopic analysis using the MLR algorithm. The various Xi in equation 66-12 represent entirely different variables. Nevertheless, starting from equation 66-12, we can derive the set of equations for calculating the MLR calibration coefficients, in exactly the same way we derived equation 66-9 (a–c) from equation 66-4. An example of this derivation is presented in [10]. Because of this parallelism we can set up the equivalencies: a 0 = b0 a1 = b 1

X1 = X

a2 = b 2

X2 = X 2

a3 = b 3

X3 = X 3

and so on. and we see that by replacing our usual MLR-oriented variables X1 , X2 , X3 , and so on with X, X 2 , X 3 , and so on, respectively, we can use our common and wellunderstood mathematical methods (and computer programs) to perform the necessary calculations. Furthermore, along with the values of the coefficients, we can obtain all the usual statistical estimates of variances, standard errors, goodness of fit, and so on that MLR programs produce for us. Of special interest is the fact that MLR pro grams compute estimates of the standard errors of the coefficients, as described by Draper and Smith (see, for example, p. 129 in [11]). This allows testing the statis tical significance of each of the coefficients, which, as we recall, are now the coef ficients of the various powers of X that comprise the polynomial we are fitting to the data. This is the basis of our tests for nonlinearity. We need not use polynomials of high degree since our goal is not necessarily to fit the data as well as possible. Especially since we expect that well-behaved methods of chemical analysis will produce results that are already close to linearly related to the analyte concentrations, we expect nonlinear terms to decrease as the degree of the fitting equation used increases. Thus we need to only fit a quadratic, or at most a cubic equation to our data to test for linearity, although there is nothing to stop us from using equations of higher degree if we choose. Data well-described by a linear equation will produce a set of coefficients with a statistically significant value for the term X 1 (which is X, of course) and non-significant values for the coefficients of X 2 or higher degree.

CONCLUSION This is the basis for our new test of linearity. It has all the advantages we described: it gives an unambiguous determination of whether any nonlinearity is affecting the relationship between the test results and the analyte concentration. It provides a means of distinguishing between different types of nonlinearity, if they are present, since only those that have statistically significant coefficients are active. It also is more sensitive than any other statistical linearity test including the Durbin-Watson statistic. The tables

446

Chemometrics in Spectroscopy

in Draper and Smith for the thresholds of the Durbin-Watson statistic only give values for more than ten samples. As we shall shortly see, however, This method of linearity testing is quite satisfactory for much smaller numbers of samples. As an example, we applied these concepts to the Anscombe data [7]. Table 66-1 shows the results of applying this to both the “normal” data (Anscombe’s X1, Y 1 set) and the data showing nonlinearity. We also computed the nature of the fit using only a straight-line (linear) fit as was done originally by Anscombe and also fitted a polynomial using the quadratic term as well. It is interesting to compare results both ways. We find that in all four cases, the coefficient of the linear term is 0.5. In Anscombe’s original paper, this is all he did, and obtained the same result, but this was by design: the synthetic data he generated was designed and intended to give this result for all the data sets. The fact that we obtained the same coefficient (for X) using the polynomial demonstrates that the quadratic term was indeed uncorrelated to the linear term. The improvement in the fit from the quadratic polynomial applied to the nonlinear data indicated that the square term was indeed an important factor in fitting that data. In fact, including the quadratic term gives well-nigh a perfect fit to that data set, limited only by the computer truncation precision. The coefficient obtained for the quadratic term is comparable in magnitude to the one for linear term, as we might expect from the amount of curvature of the line we see in Anscombe’s plot [7]. The coefficient of the quadratic term for the “normal” data, on the other hand, is much smaller than for the linear term.

Table 66-1 The results of applying the new method of detecting nonlinearity to Anscombe’s data sets, both the linear and the nonlinear, as described in the text Parameter

Results for nonlinear data Constant Linear term Square term SEE R

Coefficient when using only linear term

3.000 0.500 --------------

t-value when using only linear term

Coefficient using square term

4.24 --------------

4.268 0.5000 −0�1267

1.237 0.816

t-value using square term

3135.5 −2219�2

0.0017 1.0

Results for normal data Constant Linear term Square term SEE R

3.000 0.500 -------------1.237 0.816

4.24 --------------

3.316 0.500 −0�0316 1.27 0.8291

4.1 −0�729

Linearity in Calibration: Act III Scene IV

447

As we expected, furthermore, for the “normal”, linear relationship, the t-value for the quadratic term for the linear data is not statistically significant. This demonstrates our contention that this method of testing linearity is indeed capable of distinguishing the two cases, in a manner that is statistically justifiable. The performance statistics, the SEE and the correlation coefficient show that including the square term in the fitting function for Anscombe’s nonlinear data set gives, as we noted above, essentially a perfect fit. It is clear that the values of the coefficients obtained are the ones he used to generate the data in the first place. The very large t-values of the coefficients are indicative of the fact that we are near to having only computer round-off error as operative in the difference between the data he provided and the values calculated from the polynomial that included the second-degree term. Thus this new test also provides all the statistical tests that the current FDA/ICH test procedure recommends. and it also provides information as to whether, and how well, the analytical method gives a good fit of the test results to the actual concentration values. It can distinguish between different types of nonlinearities, if necessary, while simultaneously evaluating the overall goodness of the fitting function. As the results from applying it to the Anscombe data show, it is eminently suited to evaluating the linearity characteristics of small data set as well as large ones.

APPENDIX A: DERIVATION AND DISCUSSION OF THE FORMULA IN EQUATION 66–11 Starting with a set of data values Xi , we want to create a set of other values from these Xi such that the squares of those values are uncorrelated to the Xi themselves. We do this by subtracting a value Z, from each of the Xi and find a suitable value of Z, so that the set of values (Xi − Z�2 is uncorrelated with the Xi . From the definition of the correlation coefficient, then, this means that the following must hold: �� � �� i

i

Xi − X

� Xi − X �Xi − Z�2

�2 �2 � � �Xi − Z�2 − �Xi − Z�2

=0

(66-A1)

i

Multiplying both sides of equation 66-A1 by the denominator of the LHS of equa tion 66-A1 results in the much-simplified expression: �� � (66-A2) Xi − X �Xi − Z�2 = 0 i

We now need to solve this expression for Z. We begin by expanding the square term: �� �� � (66-A3) Xi − X Xi2 − 2Xi Z + Z2 = 0 i

We then multiply through �� 2 � � � � � �� Xi Xi − X − 2Xi Z Xi − X + Z2 Xi − X = 0 i

(66-A4a)

448

Chemometrics in Spectroscopy

distributing the summations and bringing constants outside the summations: � � �� � 2� � � � Xi Xi − X − 2Z Xi Xi − X + Z2 Xi − X = 0 (66-A4b) i

i

i

� �� Since Xi − X = 0, the last term in equation 66-A4b vanishes, leaving i

�

� � � � � Xi2 Xi − X − 2Z Xi Xi − X = 0

i

(66-A5)

i

equation 66-A5 is now readily rearranged to solve for Z: � � 2� X i Xi − X i � Z= � � 2 Xi Xi − X

(66-A6)

i

Equation 66-A6 appears to differ from the expression in Daniel and Wood [9], in that the denominator expressions differ. To show that they are equivalent, we start with the denominator term of the expression on p. 121 of [9]: ��

Xi − X

�2

(66-A7)

i

Again, we expand this expression: �

Xi2 − 2

i

�

Xi X +

�

X

2

(66-A8)

i

and separating and collecting terms: � i

Xi2 −

�

Xi X −

��

i

2

X − Xi X

� (66-A9)

i

Rearranging the last term in the expression: � 2 � �� � Xi − Xi X − X X − Xi i

i

(66-A10)

i

And we find that again, the last term in equation 66-A10 vanishes since

�� i

leaving: � i

Xi2 −

�

Xi X

� Xi − X = 0,

(66-A11)

i

And upon combining the summations and factoring out Xi : � � � X i Xi − X

(66-A12)

i

which is thus seen to be the same as the denominator term we derived in equation 66-A6: QED

Linearity in Calibration: Act III Scene IV

449

By similar means we can derive expressions that will create transformations of other powers of the X-variable that make the corresponding power uncorrelated to the X variable itself. Thus, analogously to equation 66-A2, if we wish to find a quantity Z3 that will make �Xi − Z3 �3 be uncorrelated with X, we set up the expression: �� � Xi − X �Xi − Z3 �3 = 0 (66-A13) i

which provides the following polynomial in Z3 : � � � � � � � 1 � 3� Xi Xi − X − 3Z3 Xi2 Xi − X + 3Z3 2 Xi Xi − X = 0 3 i i i

(66-A14)

Equation 66-A14 is quadratic in Z3 , and thus, after evaluating the summations is easily solved through use of the Quadratic Formula. Similarly, for fourth powers we set up the expression: �� � Xi − X �Xi − Z4 �4 = 0 (66-A15) i

which gives � � � � � � 6 � � � � 1 � 4� Xi Xi − X + 4Z Xi3 Xi − X + Z2 Xi 2 Xi − X − Z3 Xi Xi − X = 0 4 4 i i i i

(66-A16) Again, equation 66-A16 is cubic in Z4 and can be solved by algebraic methods. For higher powers of the variable we can derive similar expressions. After the sixth power, algebraic methods are no longer available to solve for the Zi , but after evaluating the summations, computerized approximation methods can be used. Thus the contribution of any power of the X-variable to the nonlinearity of the data can be similarly tested by these means.

REFERENCES 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11.

Mark, H. and Workman, J., Spectroscopy 20(4), 38–39 (2005). Mark, H. and Workman, J., Spectroscopy 20(1), 56–59 (2005). Mark, H. and Workman, J., Spectroscopy 20(3), 34–39 (2005). Mark, H., Journal of Pharmaceutical and Biomedical Analysis 33, 7–20 (2003). Arden, B.W., An Introduction to Digital Computing, 1st ed. (Addison-Wesley Publishing Co., Inc., Reading, MA, 1963). Mark, H. and Workman, J., Spectroscopy 18(12), p. 106–111 (2003). Anscombe, F.J., The American Statistician 27, 17–21 (1973). Mark, H. and Workman, J., Spectroscopy 18(9), 25–28 (2003). Daniel, C. and Wood, F., Fitting Equations to Data – Computer Analysis of Multifactor Data for Scientists and Engineers, 1st ed. (John Wiley & Sons, 1971). Mark, H., Principles and Practice of Spectroscopic Calibration (John Wiley & Sons, New York, 1991). Draper, N. and Smith, H., Applied Regression Analysis, 3rd ed. (John Wiley & Sons, New York, 1998).

This page intentionally left blank

67

Linearity in Calibration: Act III Scene V – Quantifying

Nonlinearity

In Chapters 63–66 [1–4], we discussed shortcomings of current methods used to assess the presence of nonlinearity in data, and presented a new method that addresses those shortcomings. This new method is statistically sound, provides an objective means to determine if nonlinearity is present in the relationship between two sets of data, and is inherently suitable for implementation as a computer program. A shortcoming of the method presented is one that it has in common with virtually all statistical tests: while it provides a means of unambiguously and objectively determining the presence of nonlinearity, if we find that nonlinearity is present, it does not address the question of how much nonlinearity is present. This chapter therefore presents results from some computer experiments designed to assess a method of quantifying the amount of nonlinearity present in a data set, assuming that the test for the presence of nonlinearity has already been applied and found that indeed, a measurable, statistically significant degree of nonlinearity exists. The spectroscopic community, and indeed, the chemical community at large is not the only group of scientists concerned with these issues. Other scientific disciplines also are concerned with ways to evaluate methods of chemical analysis. Notable among them are the pharmaceutical communities and the clinical chemistry communities. In those communities, considerations of the sort we are addressing are even more important, for at least two reasons: 1) These disciplines are regulated by governmental agencies, especially the Food and Drug Administration. In fact, it was considerations of the requirements of a regulatory agency that created the impetus for this series of chapters in the first place [1]. 2) The second reason is what drives the whole effort of ensuring that everything that is done, is done “right”: an error in an analytical result can conceivably, in literal fact, cause illness or even death to occur. Thus the clinical chemistry community has also investigated issues such as the linearity of the relationship between test results and actual chemical composition, and an interesting article provides the impetus for creating a method of assessing the degree of nonlinearity present in the relationship between two sets of data [5]. The basis for this calculation of the amount of nonlinearity is illustrated in Figure 67-1. In Figure 67-1a we see a set of data showing some nonlinearity between the test results and the actual values. If a straight line and a quadratic polynomial are both fit to the data, then the difference between the predicted values from the two curves give a measure of the amount of nonlinearity. Figure 67-1a shows data subject to both random error and nonlinearity, and the different ways linear and quadratic polynomials fit the data.

452

Chemometrics in Spectroscopy

Linear fit

Result

Quadratic fit

Concentration

Figure 67-1(a) An illustration of the method of measuring the amount of nonlinearity showing hypothetical synthetic data to which each of the functions are fit.

As shown in Figure 67-1a, at any given point, there is a difference between the two functions which represents the difference between the Y-values corresponding to a given X-value. Figure 67-1b shows that irrespective of the random error of the data, the difference between the two functions depends only on the nature of the functions and can be calculated from the difference between the Y-values corresponding to each X-value. If there is no nonlinearity at all, then the two functions will coincide, and all the differences

Linear fit

Result

Quadratic fit

Concentration

Xi

Xn

Figure 67-1(b) The functions, without the data, showing the differences between the functions at two values of X. The circles show the value of the straight line, the crosses show the value of the quadratic function at the given values of X.

Linearity in Calibration: Act III Scene V

453

will be zero. Increasing amounts of nonlinearity will cause increasingly large differences between the values of the two functions corresponding to each X-value, and these can be used to calculate the nonlinearity. The calculation used is the calculation of the sum of squares of the differences [5]. This calculation is normally applied to situations where random variations are affecting the data, and, indeed, is the basis for many of the statistical tests that are applied to random data. However, the formalism of partitioning the sums of squares, which we have previously discussed [6] (also in [7], p. 81 in the first edition or p. 83 in the second edition), can be applied to data where the variations are due to systematic effects rather than random effects. The difference is that the usual statistical tests (t 2 F , etc.) do not apply to variations from systematic causes because they do not follow the required statistical distributions. Therefore it is legitimate to perform the calculation, as long as we are careful how we interpret the results. Performing the calculations on function fitted to the raw data has another ramification: the differences, and therefore the sums of squares, will depend on the units that the Y -values are expressed in. It is preferable that functions with similar appearances give the same computed value of nonlinearity regardless of the scale. Therefore the sumof-squares of the differences between the linear and the quadratic functions fitted to the data is divided by the sum-of-squares of the Y -values that fall on the straight line fitted to the data. This cancels the units, and therefore the dependency of the calculation on the scale. A further consideration is that the value of the calculated nonlinearity will depend not only on the function that fits the data, we suspect that it will also depend on the distribution of the data along the X-axis. Therefore, for pedagogical purposes, here we will consider the situation for two common data distributions: the uniform distribution and the Normal (Gaussian) distribution. Figure 67-2 presents some quadratic curves containing various amounts of nonlinear ity. These curves represent data that was, of course, created synthetically. The purpose of generating these curves was for us to be able to compare the visual appearance of curves containing known amounts of nonlinearity with the numerical values for the various test parameters that describe the curves. Figure 67-2 represents data having a uniform distribution of X-values, although, of course, data with a different distribution of X-values would follow the same curves. The curves were generated as follows: 101 values of a uniformly distributed variable (used as the X-variable) was generated by creating a set of numbers from 0 to 1 at steps of 0.01. The Y -values for each curve were generated by calculating the Y -value from the corresponding X-value according to the following formula: Y = X − kX 2 + kX

(67-1)

The parameter k in equation 67-1 induces a varying amount of nonlinearity in the curve. For the curves in Figure 67-1, k varied from 0 to 2 in steps of 0.2. The subtraction of the quadratic term in equation 67-1 gives the curves their characteristic of being convex upward, while adding the term kX back in ensures that all the curves, and the straight line, meet at zero and at unity. Table 67-1 presents the results of computing the linearity evaluation results for the curves shown in Figure 67-1, for the case of a uniform distribution of data along the

454

Chemometrics in Spectroscopy 1.2 1 k = 2.0 0.8 0.6 0.4 k=0 0.2

1

0.9

0.95

0.85

0.8

0.7

0.75

0.6

0.65

0.55

0.5

0.45

0.4

0.3

0.35

0.2

0.25

0.15

0.1

0

0.05

0

Figure 67-2 Curves illustrating varying amounts of nonlinearity.

X-axis. It presents the coefficients of the linear models (straight lines) fitted to the several curves of Figure 67-1, the coefficients of the quadratic model, the sum-of-squares of the differences between the fitted points from the two models, and the ratio of the sum-of-squares of the differences to the sum-of-squares of the X-data itself, which, as we said above, is the measure of nonlinearity. Table 67-1 also shows the value of the correlation coefficient between the linear fit and the quadratic fit to the data, and the square of the correlation coefficient. In Table 67-1 we see an interesting result: the ratio of sums of squares we are using for the linearity measure is equal to 1 (unity) minus the square of the computed correlation coefficient value between the linear fit and the quadratic fit to the data. This should not surprise us. As noted above, the same formalisms that apply to random data can also be applied to data where the differences are systematic. Therefore, the equality we see here corresponds to the well-known property of sums of squares from any regression analysis, that from the analysis of variance of the regression results, the correlation coefficient is related to sum-squared-error of the analysis in a similar way (see, for example, p. 17 in [8]). It is also interesting to note that the coefficients of the models resulting from the calculations on the data (shown in Figure 67-1) are not the same as the original generating functions for the data. This is because the generating functions (from equation 67-1) are not the best-fitting functions (nor, as we shall see, are they the orthogonalized functions), which is what is used to create the models, and the predicted values from the models. Since the correlation coefficient is an already-existing and known statistical function, why is there a need to create a new calculation for the purpose of assessing nonlinearity? First, the correlation coefficient’s roots in Statistics direct the mind to the random aspects of the data that it is normally used for. In contrast, therefore, using the ratio of the sum of squares helps keep us reminded that we are dealing with a systematic effect whose magnitude we are trying to measure, rather than a random effect for which we want to ascertain statistical significance.

k(in equation 67-1)

0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0

Linear coefficients

Quadratic coefficients

b1 (slope) for linear fit

b0 (intercept) for linear fit

b2 (quadratic term) for quadratic fit

b1 (linear slope) for quadratic fit

b0 (intercept) for quadratic fit

1 1 1 1 1 1 1 1 1 1 1

0 0033 0066 0099 0132 0165 0198 0231 0264 0297 033

0 −02 −04 −06 −08 −10 −12 −14 −16 −18 −20

1 12 14 16 18 20 22 24 26 28 30

0 0 0 0 0 0 0 0 0 0 0

Sum-ofsquares of diffs

Linearity measure (ratio of sums of squares)

0 00334 00337 02100 03735 05836 08403 11438 14940 18908 23344

0 00026 00107 00242 00430 00673 00969 01319 01723 02180 02692

Corr. coeff.

1 09986 09946 09879 09789 09676 09543 09393 09229 09052 08866

Square of corr. coeff.

Linearity in Calibration: Act III Scene V

Table 67-1 Uniform data distribution

1 09972 09892 09761 09583 09363 09108 08824 08517 08195 07862

455

456

Chemometrics in Spectroscopy

Secondly, as a measure of nonlinearity, the calculation conforms more closely to that concept than the correlation coefficient does. As a contrast, we can consider terms such as precision and accuracy, where “high precision” and “high accuracy” mean data with small values of <whatever measure is used, such as standard deviation> while “low precision” and “low accuracy” mean large values of the measure. Thus, for those two characteristics, the measured value changes in opposition to the concept. If we were to use the correlation coefficient calculation as the measure of nonlinearity, we would have the same situation. However, by defining the “linearity” calculation the way did, the calculation now runs parallel to the concept: a calculated value of zero means “no nonlin earity” while increasing values of the calculation corresponds to increasing nonlinearity. Another interesting comparison is between the coefficients for the functions repre senting the best-fitting models for the data and the coefficients for the functions that result from performing the linearity test as described in the previous chapter [4]. We have not looked at these before since they are not directly involved in the linearity test. Now, however, we consider them for their pedagogic interest. These coefficients, for the case of testing a quadratic nonlinearity of the data from Figure 67-1, are listed in Table 67-2. We note that the coefficients for the quadratic terms are the same in both cases. However, the best fitting functions have a constant intercept and varying slopes, while the functions based on the orthogonalized quadratic term has a constant slope and varying intercept. We now take a look at the linearity values obtained when the X-data is Normally distributed. The nonlinearity used is the same as we used above for the case of uniformly distributed data, and the same diagram (Figure 67-2) applies, so we need not reproduce it. The difference is that the X-data is Normally distributed, so that there are more samples at X = 05 than at the extremes of the range of Figure 67-2, the falloff varying appropriately. The standard deviation of the X-values used was 0.2, so that the ends of the range corresponded to ±2.5 standard deviations. Again, synthetic data at the same 101 values of X were generated. In this case, however, multiple data at each X-value were created, the number of data at each X-value being proportional to the value of the Normal distribution corresponding to that X-value. The total number of data points generated, therefore, was 5,468. Table 67-2 Coefficients for orthogonalized functions k (in equation 67-1) 0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0

b2 (quadratic term)

b1 (linear slope)

b0 (intercept)

0 −02 −04 −06 −08 −10 −12 −14 −16 −18 −20

1 1 1 1 1 1 1 1 1 1 1

0 005 01 015 02 025 03 035 04 045 05

Linearity in Calibration: Act III Scene V

457

We can compare the values in Table 67-3 with those in Table 67-1: the coefficients of the models are almost the same. The coefficients for the quadratic model are, unsurpris ingly, identical in all cases, since the data values are identical and there is no random error. The main difference in the linear model is the value of the intercept, reflecting the higher average value of the Y -data resulting from the center of the curves being more heavily weighted. The sums-of-squares are of necessity larger, simply because there are more data points contributing to this sum. The interesting (and important) difference is in the values for the ratio of sumsof-squares, which is the nonlinearity measure. As we see, at small values of nonlinearity (i.e., k = 0 1 2) the values for the nonlinearity are almost the same. As k increases, however, the value of the nonlinearity measure decreases for the case of Normally distributed data, as compared to the uniformly distributed data, and the discrepancy between the two gets greater as k continues to increase. In retrospect, this should also not be surprising, since in the Normally distributed case, more data is near the center of the plot, and therefore in a region where the local nonlinearity is smaller than the nonlinearity over the full range. Therefore the Normally distributed data is less subject to the effects of the nonlinearity at the wings, since less of the data is there. As a quantification of the amount of nonlinearity, we see that when we compare the values of the nonlinearity measure between Tables 67-1 and 67-3, they differ. This indicates that the test is sensitive to the distribution of the data. Furthermore, the disparity increases as the amount of curvature increases. Thus this test, as it stands, is not completely satisfactory since the test value does not depend solely on the amount of nonlinearity, but also on the data distribution. In our next chapter we will consider a modification of the test that will address this issue.

Table 67-3 Normal data distribution k (in equation 67-1)

0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0

Linear coefficients

Quadratic coefficients

b1 (slope) for linear fit

b0 (inter cept) for linear fit

b2 (quadratic term) for quadratic fit

b1 (linear slope) for quadratic fit

b0 (inter cept) for quadratic fit

1 1 1 1 1 1 1 1 1 1 1

0 00414 00829 01243 01658 02072 02487 02901 03316 03730 04145

0 −02 −04 −06 −08 −10 −12 −14 −16 −18 −20

1 12 14 16 18 20 22 24 26 28 30

0 0 0 0 0 0 0 0 0 0 0

Sum-ofsquares of diffs

0 06000 23996 53991 95984 14997 21596 29395 38393 48592 59990

Linearity measure (ratio of sums of squares)

0 00025 00102 00230 00410 00641 00923 01257 01642 02078 02566

458

Chemometrics in Spectroscopy

REFERENCES 1. 2. 3. 4. 5. 6. 7.

Mark, H. and Workman, J., Spectroscopy 20(1), 56–59 (2005). Mark, H. and Workman, J., Spectroscopy 20(3), 34–39 (2005). Mark, H. and Workman, J., Spectroscopy 20(4), 38–39 (2005). Mark, H. and Workman, J., Spectroscopy 20(9), 26–35 (2005). Kroll, M.H. and Emancipator, K., Clinical Chemistry 39(3), 405–413 (1993). Workman, J. and Mark, H., Spectroscopy 3(3), 40–42 (1988). Mark, H. and Workman, J., Statistics in Spectroscopy, 1st ed. (Academic Press, New York, 1991). 8. Daniel, C. and Wood, F., Fitting Equations to Data – Computer Analysis of Multifactor Data for Scientists and Engineers, 1st ed. (John Wiley & Sons, New York, 1971).

68

Linearity in Calibration: Act III Scene VI – Quantifying Nonlinearity, Part II, and a News Flash

In Chapters 63 through 67 [1–5], we devised a test for the amount of nonlinearity present in a set of comparative data (e.g., as are created by any of the standard methods of calibration for spectroscopic analysis), and then discovered a flaw in the method. The concept of a measure of nonlinearity that is independent of the units that the X and Y data have is a good one. The flaw is that the nonlinearity measurement depends on the distribution of the data; uniformly distributed data will provide one value, Normally distributed data will provide a different value, randomly distributed (i.e., what is commonly found in “real” data sets) will give still a different value, and so forth, even if the underlying relationship between the pairs of values is the same in all cases. “Real” data, in fact, may not follow any particular describable distribution at all. Or the data may not be sufficient to determine what distribution it does follow, if any. But does that matter? At the point we have reached in our discussion, we have already determined that the data under investigation does indeed show a statistically significant amount of nonlinearity, and we have developed a way of characterizing that nonlinearity in terms of the coefficients of the linear and quadratic contributions to the functional form that describes the relationship between the X and Y values. Our task now is to come up with a way to quantifying the amount of nonlinearity the data exhibits, independent of the scale (i.e., units) of either variable, and even independent of the data itself. Our method of addressing this task is not unique, there are other ways to reach the goal. But we will base our solution on the methodology we have already developed. We do this by noting that the first condition is met by converting the nonlinear component of the data to a dimensionless number (i.e., a statistic), akin to but different than the correlation coefficient, as we showed in our previous chapter first published as [5]. The second condition can be met by simply ignoring the data itself, once we have reached this point. What we need is a standard way to express the data so that when the statistic in computed, the standard data expression will give rise to a given value of the statistic, regardless of the nature of the original data. For this purpose, then, it would suffice to replace the original data with a set of syn thetic data with the necessary properties. What are those properties? The key properties comprise the number of data values, the range of the data values and their distribution. The range of the synthetic data we want to generate should be such that the X-values have the same range as the original data. The reason for this is obvious: when we apply the empirically derived quadratic function (found from the regression) to the data, to compute the Y -values, those should fall on the same line, and in the same relationship to the X as the original data did. Choosing the distribution is a little more nebulous. However, a uniform distribution is not only easy to compute, but it also will neither go outside the specified range nor will

460

Chemometrics in Spectroscopy

the range change with the number of samples, as data following other distributions might (see, for example, reference [6], or Chapter 6 in [7], where we discussed the relationship between the range and the standard deviation for the Normal distribution when the number of data differ, although our discussion was in a different context). Therefore, in the interest of having the range and the nonlinearity measure be independent of the number of readings, we should generate data following a uniform distribution. The number of data points to generate in order to get an optimum value for the statistic is not obvious. Intuition indicates that the value of the statistic may very well be a function of the number of data points used in its calculation. At first glance, this would also seem to be a “showstopper” in the use of this statistic for the purpose of quantifying nonlinearity. However, intuition also indicates that even so, use of “sufficiently many” data points will give a stable value, since “sufficiently many” eventually becomes an approximation to “infinity”, and therefore even in such a case will at least tend toward an asymptotic value, as more and more data points are used. Since we have already extracted the necessary information from the actual data itself, computations from this point onward are simply a computer exercise, needing no further input from the original data set. Therefore, in fact, the number of points to generate is a consideration that itself needs to be investigated. We do so by generating data with controlled amounts of nonlinearity as we did previously [5] and filling in the range of the X-values with varying numbers of data points (uniformly spaced), computing the corresponding Y -values (according to the computed values for the coefficients of the quadratic equation) and then the statistic we described [5]. We performed this computation for several different combinations of number of data points generated and the value of k, using the nonlinearity term generator from equation 67-1 found in Chapter 67, and present the results in Table 68-1. Although not shown, similar computations were performed for 200,000 and 1,000,000 points. There was no further change in any of the entries, compared to the column corresponding to 100,000 points. As we can see, the value of the nonlinearity value converges to a limit for each value of k, as the number of points used to calculate it increases. Furthermore, it converges more

Table 68-1 Table of computed nonlinearity values for varying numbers of simulated samples k (from equation 67-1) 0 0.1 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0

N= 10

N= 100

N= 500

N= 1000

N= 2000

N= 5000

N= 10000

N= 100000

0 0.0045 0.0181 0.0408 0.0725 0.1133 0.1632 0.2221 0.2901 0.3671 0.4532

0 0.0036 0.0142 0.0320 0.0570 0.0890 0.1282 0.1745 0.2279 0.2284 0.3560

0 0.0035 0.0139 0.0313 0.0556 0.0869 0.1252 0.1704 0.2226 0.2817 0.3478

0 0.0035 0.0139 0.0312 0.0555 0.0867 0.1248 0.1699 0.2219 0.2809 0.3468

0 0.0035 0.0138 0.0312 0.0554 0.0866 0.1246 0.1697 0.2216 0.2805 0.3462

0 0.0035 0.0138 0.0311 0.0553 0.0865 0.1245 0.1695 0.2214 0.2802 0.3459

0 0.0035 0.0138 0.0311 0.0553 0.0865 0.1245 0.1695 0.2213 0.2801 0.3458

0 0.0035 0.0138 0.0311 0.0553 0.0865 0.1245 0.1695 0.2213 0.2800 0.3457

Linearity in Calibration: Act III Scene VI

461

slowly when the amount of nonlinearity in the data increases. The results in Table 68-1 are presented to four figures, and to require that degree of convergence means that fully 10,000 points must be generated if the value of k approaches two (or more). Of course, if k is much above two, it might require even more points to achieve this degree of exactness in the convergence. For k = 0.1, however, this same degree of convergence is achieved with only 500 points. Thus, the user must make a trade-off between the amount of computation performed and the exactness of the calculated nonlinearity measure, taking into account the actual amount of nonlinearity in the data. However, if sufficient points are used, the results are stable and depend only on the amount of nonlinearity in the original data set. Or need the user do anything of the sort? In fact, our computer exercise is just an advanced form of a procedure that we all learned to do in second-term calculus; evaluate a definite integral by successively better approximations, the improvement coming via exactly the route we took, using smaller and smaller intervals at which to perform the numerical integration. By computing the value of a definite integral, we are essentially taking the computation to the limit of an infinite number of data points. Generating the definite integral to evaluate is in fact a relatively simple exercise at this point, since the underlying functions are algebraic. We recall that the pertinent quantities are 1) The sum of squares of the differences between the linear and the quadratic lines fit to the data 2) The sum of squares of the Y -data linearly related to the X-data. As we recall from the previous chapter [5], the nonlinearity measure we devised equals the first divided by the second. Let us now develop the formula for this. We will use a subscripted small “a” for the coefficients of the quadratic equation, and a subscripted small “k” for those of the linear equation. Thus the equation describing the quadratic function fitted to the data is YQ = a0 + a1 X + a2 X 2

(68-4)

The equation describing the linear function fitted to the data is YL = k0 + k1 X

(68-13)

Where the ai and the ki are values obtained by the least-squares fitting of the quadratic and linear fitting functions, respectively. The differences, then, are represented by D = YQ − YL = a0 + a1 X + a2 X 2 − k0 + k1 X

(68-14)

D = a0 − k0 + a1 − k1 X + a2 X 2

(68-15)

and the squares of the differences are D2 = a0 − k0 + a1 − k1 X + a2 X 2 2

(68-16)

462

Chemometrics in Spectroscopy

which expands to D2 = a0 − k0 + 2a0 − k0 a1 − k1 X + a1 − k1 2 X 2 + 2a2 a0 − k0 X 2 + 2a2 a11 − k1 X 3 + a22 X 4

(68-17)

We can simplify it slightly to a regular polynomial in X: D2 = a0 − k0 + 2a0 − k0 a1 − k1 X + 2a2 a0 − k0 + a1 − k1 2 X 2 + 2a2 a11 − k1 X 3 + a22 X 4

(68-18)

The denominator term of the required ratio is the square of the linear Y term, according to equation 68-15. The square involved is then: YL2 = Y − Y 2

(68-19)

and substituting for each Y , the expression for X: YL2 = k0 + k1 X − k0 + k1 X2

(68-20)

With a little algebra this can also be put into the form of a regular polynomial in X: YL2 = k1 X − k1 X2 YL2 = k12 X 2 − 2k12 XX + k12 X

(68-21) 2

(68-22)

which, unsurprisingly, equals YL2 = k12 X − X2

(68-23)

although we will find equation 68-22 more convenient. Equations 68-18 and 68-22 represent the quantities whose sums form the required measurement. They are each a simple polynomial in X, whose definite integral is to be evaluated between Xlow and Xhigh , the ends of the range of the data values, in order to calculate the respective sums-of-squares. Despite the apparently complicated form of the coefficients of the various powers of X in equation 68-18, once they have been determined as described in our previous chapter, they are constants. Therefore the various coefficients of the powers of X are also constants, and may be replaced by a new label, we can use subscripted small “c” for these; then equation 68-18 becomes D2 = c0 + c1 X + c2 X 2 + c3 X 3 + c4 X 4

(68-24)

Put into this form it is clear that forming the definite integral of this function (to form the sum of squares) is relatively straightforward, we merely need to apply the formula for the integral of a power of a variable to each term in equation 68-24. We recall that from elementary calculus the integral of a power of a variable is �

X n dX =

X n+1 n+1

(68-25)

Linearity in Calibration: Act III Scene VI

463

Applying this formula to equation 68-24, we achieve SSD = c0

�

Xhigh

Xlow

+ c4 SSD = c0 X

�

1dX + c1

Xhigh

�

Xhigh Xlow

XdX + c2

�

Xhigh

Xlow

X 2 dX + c3

�

Xhigh

X 3 dX

Xlow

X 4 dX

(68-26)

Xlow

�Xhigh Xlow

X2 + c1 2

�Xhigh Xlow

X3 + c2 3

�Xhigh Xlow

X4 + c3 4

�Xhigh Xlow

X5 + c4 5

�Xhigh (68-27) Xlow

where the various ci represent the calculation based on the corresponding coefficients of the quadratic and linear fitting functions, as indicated in equation 68-18. The denominator term for the ratio is derived from equation 68-22 in similar fashion; the result is �Xhigh �Xhigh �Xhigh 3 2 2 X 2 2 2X SSY = k1 + 2k1 X + k1 X X (68-28) 3 2 Xlow

Xlow

Xlow

And the measure of nonlinearity is then the result of equation 68-27 divided by equa tion 68-28.

NEWS FLASH!! It will be helpful at this point to again review the background of why (non)linearity is important, in order to understand why we bring up the “News Flash”. In the context of multivariate spectroscopic calibration, for many years most of the attention was on the issues of noise effects (noise and error in both the X (spectral) and the Y (constituent values) variables. The only attention paid to the relation between them was the effect of the calibration algorithm used, and how it affected and responded to the noise content of the data. There is another key relationship between the X and the Y data, and that is the question of whether the relationship is linear, but that is not addressed. In fact, hardly anybody talks (or writes) about it even though it is probably the only remaining major effect that influences the behavior of calibration models. A thorough understanding of it would probably allow us to solve most of the remaining problems in creating calibrations. A nonlinear relation can potentially cause larger errors than any random phenomenon affecting a data set (see, for example, reference [8]). The question of linearity inevitably interacts with the distribution of constituent values in the samples (not only of the analyte but of the interferences as well – see the referenced Applied Spectroscopy article). I first got my attention turned onto this issue back when MLR was king of the NIR hill, and we could not understand how the wavelength selection process worked, and why it picked certain wavelengths that appeared to have no special character. The Y -error was the same for all sets of wavelengths. The X-error might vary somewhat from wavelength to wavelength, but the precision of the NIR instruments was so good that the maximum differences in random absorbance error simply could not account for the variations in the wavelengths chosen. Eventually the realization arose that the only explanation that

464

Chemometrics in Spectroscopy

was never investigated was that a wavelength selection algorithm would find those wavelengths where the fit (in terms of linearity of the absorbance versus constituent concentrations) of the data could change far more than any possible change in random error. Considerations of nonlinearity potentially explains lots of things: the inability to extrapolate models, the “unknown” differences between instruments that prevents calibration transfer, and so on. Recently we wrote some chapters that showed that PCR and PLS are also subject to the effects of nonlinearity and are not simply correctable (see Chapters 29–33 in this book, as well as references [9–14]). So there is a big effect here that hardly anybody is paying attention to – at least not insofar as they are quantitatively evaluating the effect on calibration models. I think this is key, because it is inevitably one of the major causes of error in the X variable (at least, as long as the X-variable represents instrument readings). Now here is the news flash: we recently became aware that Philip Brown has written a paper [15] nominally dealing with wavelength selection methods for calibration using the MLR algorithm (more about this paper later). We are old MLR advocates (since 1976, when we first got involved with NIR and MLR was the only calibration algorithm used in the NIR world then). But what has happened is that until fairly recently the role of nonlinearity in the selection of wavelengths for MLR as well as other effects on the modeling process have been mostly ignored (and only partly because MLR itself has been mostly ignored until fairly recently). For a long time, however, there was much confusion in the NIR world over the question of why computerized wavelength searches would often select wavelengths on the side of absorption bands instead of at the peaks (or in other unexpected places), and manual selection of wavelengths at absorption peaks would produce models that did not perform as well as when the wavelengths on the side of the peaks were used. This difference existed in calibration, validation, and in long-term usage. It also was (and still is, for that matter) independent of the methods of wavelength selection used. This behavior puzzled the NIR community for a long time, especially since it was well-known that a wavelength on the side of an absorbance band would be far more sensitive to small changes in the actual wavelength measured by an instrument (due to non-repeatability of the wavelength selection portion of the instrument) than a wavelength at or near the peak, and we expected that random error from that source should dominate the wavelength selection process. In hindsight, of course, we recognize that if a nonlinear effect exists in the data, it will implicitly affect the modeling process, regardless of whether the nonlinearity is recognized or not. There are other “mysteries” in NIR (and other applications of chemometrics) that nonlinearity can also explain. For example, as indicated above, one is the difficulty of transferring calibration models between instruments, even of the same type. Where would our technological world be if a manufacturer of, say, rulers could not reliably transfer the calibration of the unit of length from one ruler to the next? But here is what Philip Brown did: He took a different tack on the question. He set up and performed an experiment wherein he took different sugars (fructose, glucose, and sucrose) and made up solutions by dissolving them in water, each at five different concentration levels, and made solutions using all combinations of concentrations. That gave an experimental design with 125 samples. He then measured the spectra of all of those samples. Since the samples were all clear solutions there were no extraneous effects due to optical scatter. The nifty thing he then did was this: he then applied an ANOVA to the data, to deter mine which wavelengths were minimally affected by nonlinearity. We have discussed

Linearity in Calibration: Act III Scene VI

465

ANOVA in these chapters also, back when it was still called “Statistics in Spectroscopy” [16–19] although, to be sure, our discussions were at a fairly elementary level. The experiment that Philip Brown did is eminently suitable for that type of computation. The experiment was formally a three-factor multilevel full-factorial design. Any nonlinearity in the data will show up in the analysis as what Statisticians call an “interaction” term, which can even be tested for statistical significance. He then used the wavelengths of maximum linearity to perform calibrations for the various sugars. We will discuss the results below, since they are at the heart of what makes this paper important. This paper by Brown is very welcome – The four-component sugar solutions (water being one of the components, even though it is ignored in the analysis, which, by the way, may be a mistake. We will also discuss that further below). The use of this experimental design is a good way to analyze the various effects he investigates, but is unfortunately not applicable to the majority of sample types that are of interest in “real” applications, where neither experimental designs nor non-scattering samples are available or can be generated. In fact, it can be argued that the success of NIR as an analytical method is largely due to the fact that it can be applied to all those situations of interest where neither of those characteristics exist (in addition to the reasons usually given about it being non-destructive, etc.). Nevertheless, we must recognize that in trying to uncover new information about a technique, “walking before we run” is necessary and desirable, and this paper should be taken in that spirit. Especially since Brown does explicitly consider and directly attack the question of nonlinearity, which is a favorite topics of ours (in case you couldn’t tell), largely because it has mostly been previously ignored as a contributor to the error in calibration modeling, and because the effects occur in very subtle ways – which is largely what has hidden this phenomenon from our view. Not that questions of nonlinearity had been completely ignored in the past. Not only had we taken an interest as far back as 1988 [8], but others in the chemometric community have also, for example [20], who was able to successfully extrapolate a model despite nonlinearity in the data. The problem with these efforts is that they are idiosyncratic to the data set being analyzed. Whether a particular calibration can be extrapolated or not is beside the point. Missing is a general method to determine whether a model based on a given data set will or will not be extrapolatable. Brown’s paper demonstrates a novel approach to the problem, which shows promise for being the basis of that type of general method, and for that reason is new and exciting. Overall, Brown’s paper is a wonderful paper, despite the fact that there are some criticisms. The fact that it directly attacks the issue of nonlinearity in NIR is one reason to be so pleased to see it, but the other main reason is that it uses well-known and well-proven statistical methodology to do so. It is delightful to see classical Statistical tools used as the primary tool for analyzing this data set. Since we tend to be rather disagreeable sorts, let us start by disagreeing with a couple of statements Brown makes. First, while discussing the low percentage of variance in the 1900-nm region accounted for by the sugars, he states “ where there is most variability in the spectrum and might wrongly be favored region in some methods of analysis” (at the top of p. 257). We have to disagree with his decision that using the 1900-nm region is “wrong”. This is a value judgment and not supported by any evidence. To the contrary, he is erroneously treating the water component of the mixtures as though it had no absorbance, despite his recognition that water, and the 1900-nm region in particular, has the strongest absorbance of any component in his samples.

466

Chemometrics in Spectroscopy

Why say this? Because of the result of combining two facts: 1. The system is closed in that the total concentrations of all four components add to 100%, and also because the total variance due to all four components (and interactions, etc.) add to 100%. 2. The water not only has absorbance, it is the strongest absorber in the mixtures. If water had no absorbance, that is if it was the “perfect non-absorbing solvent” that we like to deal with, then Brown’s statement would be correct: it would not contribute to the variance and the three sugars would be the source of all variance. But in that case the total variance in the 1900-nm region would also be less than it actually is, so we cannot say a priori what would happen “if”. But we can say the following: since the absorbance of the water in that wavelength region is strong, we can consider the possibility that a measurement there will be a (inverse, to be sure) measure of “total sugar” or some equivalent. However, the way the experiment is set up precludes a determination of the presence of nonlinearity of the water absorbance in that region. If it were linear, then it should be determinable with the least error of all four components, since it has the strongest absorbance and therefore any fixed amount of random error would have the least relative effect. Then it would be a matter of determining which two sugars could be determined most accurately, and then the third by difference. This is essentially what he does for the linear effects he analyzes, so this would not be breaking any new ground, just using the components that are most accurately determined to compute all concentrations. But to get back to where this all came from, this is the reason we disagree with his statement that using the water absorbance is “wrong”. Now let us do a thought experiment, illustrated in Figure 68-1 (Figure 68-1a is copied from [5]): imagine a univariate spectroscopic calibration (with some random error superimposed) that follows what is essentially a straight line, as shown, over some range of values. Now raise the question: what prevents extrapolating that calibration? We believe it is nonlinearity. For the univariate case it is well-nigh self-evident. At least it is to us – see Figure 68-1b. As Figure 68-1b shows, if the underlying data is linear, (a)

(b)

Test results

Test results

Extrapolated data

0

End of original range

0 0 Analyte concentration

0

Analyte concentration

Figure 68-1 (a) Artificial data representing a linear relationship between the two variables. This data represents a linear, one-variable calibration. (b) The same artificial data extended in a linear manner. The extrapolated calibration line (broken line) can predict the data beyond the range of the original calibration set with equivalent accuracy, as long as the data itself is linear.

Linearity in Calibration: Act III Scene VI

467

there should be no problem extending the calibration line (the extension being shown as a broken line) and using the extended line to perform the analysis with the same accuracy as the original data was analyzed. Yet, to not be able to extrapolate a calibration is something “everyone knows”. What nobody knows, near as we can tell, is why we have to put up with that limitation. There are a couple of other, low-probability, answers that could be brought up, such as some sort of discontinuity in one or the other of the variables, but otherwise, any deviation of the data in the region of extrapolation would ipso facto indicate nonlinearity. Therefore, by far the most common cause of not being able to extrapolate that calibration is nonlinearity (almost by definition: a departure from the straight line is essentially the definition of nonlinearity). Engineers can point to various known physical phenomena of instruments to explain where nonlinearity in spectra can arise: stray light at the highabsorbance end and detector saturation effects at the low-absorbance end of the ranges, for example. Chemists can point to chemical interactions between the components as a source of nonlinearity at any part of the range. But mathematically, if you can make those effects go away, there is no reason left why you could not reliably extrapolate the calibration model. Now let us consider a two-wavelength model for one of the components in a solution containing two components in a nonabsorbing solvent (hypothetical case, NOT water in the NIR!). The effect of nonlinearity in the relationship of the two components to their absorbances will have different effects. If the component being calibrated for has a nonlinear relationship, that will show up in the plot of the predicted versus actual values, as a more-or-less obvious curvature in the plot, somewhat as we showed as Figure 68-1b in our Chapter 68 [1]. A nonlinear relationship in the “other” component, however, will not show up that way. Let us try to draw a word picture to describe what we are trying to say here (the way we draw, this is by far the easier way): since we could imagine this being plotted in three dimensions, the nonlinear relation will be in the depth dimension, and will be projected on the plane of the predicted-versus-actual plot of the component being calibrated for. In this projection, the nonlinearity will show up as an “extra” error superimposed on the data, and will be in addition to whatever random error exists in the “known” values of the composition. Unless the concentrations of the “other” component are known, there is no way to separate the effects of the nonlinearity from the random error, however. While we cannot actually draw this picture, graphical illustration of these effects have been previously published [8]. Again, however, if there is perfect linearity in the relationship of the absorbance at both wavelengths with respect to the concentrations of the components, one should equally well be able to extrapolate the model beyond the range of either or both components in the calibration set, just as in the univariate case. The problem is knowing where, and how much nonlinearity exists in the data. Here is where Brown has made a good start on determining this property in his paper back in 1993, at least for the limited case he is dealing with: a designed experiment with (optically) nonscattering samples. Now for Philip Brown’s main (by our reckoning) result: when he used the wavelengths of minimum nonlinearity to perform the calibration at, he found that he was indeed able to extrapolate the calibration. Repeat: under circumstances where the effects of data nonlinearity (from all sources) are minimized, he was able to extrapolate the calibration.

468

Chemometrics in Spectroscopy

In this paper he makes the statement, “One might argue that trying to predict values of composition outside the data used in calibration breaks the cardinal rule of not predicting outside the training data.” He seems almost surprised at being able to do that. But given our discussion above, he should not be. So in this case it is only surprising that he is able to extrapolate the predictions – we think that it is inevitable, since he has found a way to utilize only those wavelengths where nonlinearity is absent. Now what we need are ways to extend this approach to samples more nearly like “real” ones. And if we can come up with a way to determine the spectral regions where all components are linearly related to their absorbances, the issue of not being able to extrapolate a calibration should go away. Surely it is of scientific as well as practical and commercial interest to understand the reasons we cannot extrapolate calibration models. And then devise ways to circumvent those limitations. Chemometricians do not believe that good calibration diagnostics properly interpreted can estimate prediction performance, and insist on a separate validation data set. Statis ticians, on the other hand, do believe that. Certainly, it is good practice and statisticians also prefer to verify the estimates through the use of validation data when that is avail able, but in some cases they are not always available. In those cases, having generalized statistics available so that you can know when a model will be a good estimate of prediction performance is a major benefit. Statisticians have a long history of dealing with situations of limited data. In one sense we are “spoiled” by having our data being easy and cheap to acquire, so that asking for another 1,000 data points is usually no problem. But any experienced statistician has been in situations where each experiment, giving only one data point each, might cost upward of $10∧ 6. Estimating prediction performance from the calibration data becomes VERY important under those circumstances. Especially when, say, an “outlier” could mean a fatality. Under those circumstances you do not get a whole lot of volunteers for just testing the prediction performance of the model – you have got to know you can rely on it before you “predict”! The problem that statisticians have had regarding linearity is the same one that everybody else has had: they have not had a good statistic for determining linearity any more than anybody else, so they also have been limited to idiosyncratic empirical methods. But Philip Brown’s approach may just form the basis of one. Obviously, however, someone needs to do more research on that topic. I contacted Philip Brown and asked him about this topic. Unfortunately, linearity per se is not of interest to him; the emphasis of the paper he wrote was on role of linearity in the wavelength-selection process, not the nonlinearity itself. Furthermore, in the years since that paper appeared, his interests have changed and he is no longer pursuing spectroscopic applications. But to extend the work to understanding the role of nonlinearity in calibration, how to deal with it when an experimental design is not an option, and what to do when the optical scatter is the dominant phenomenon in the measurement of samples’ spectra are still very open questions.

REFERENCES 1. Mark, H. and Workman, J., Spectroscopy 20(1), 56–59 (2005). 2. Mark, H. and Workman, J., Spectroscopy 20(3), 34–39 (2005).

Linearity in Calibration: Act III Scene VI 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20.

469

Mark, H. and Workman, J., Spectroscopy 20(4), 38–39 (2005). Mark, H. and Workman, J., Spectroscopy 20(9), 26–35 (2005). Mark, H. and Workman, J., Spectroscopy 20(12), 96–100 (2005). Mark, H. and Workman, J., Spectroscopy 2(9), 37–43 (1987). Mark, H. and Workman, J., Statistics in Spectroscopy, 1st ed. (Academic Press, New York, 1991). Mark, H., Applied Spectroscopy 42(5), 832–844 (1988). Mark, H. and Workman, J., Spectroscopy 13(6), 19–21 (1998). Mark, H. and Workman, J., Spectroscopy 13(11), 18–21 (1998). Mark, H. and Workman, J., Spectroscopy 14(1), 16–17 (1998). Mark, H. and Workman, J., Spectroscopy 14(2), 16–27,80 (1999). Mark, H. and Workman, J., Spectroscopy 14(5), 12–14 (1999). Mark, H. and Workman, J., Spectroscopy 14(6), 12–14 (1999). Brown, P., Journal of Chemometrics 7, 255–265 (1993). Mark, H. and Workman, J., Spectroscopy 5(9), 47–50 (1990). Mark, H. and Workman, J., Spectroscopy 6(1), 13–16 (1991). Mark, H. and Workman, J., Spectroscopy 6(4), 52–56 (1991). Mark, H. and Workman, J., Spectroscopy 6(July/August), 40–44 (1991). Kramer, R., Chemometric Techniques for Quantitative Analysis, (Marcel Dekker; New York, 1998).

This page intentionally left blank

69

Connecting Chemometrics to

Statistics: Part 1 – The Chemometrics Side

We have been writing about statistics and chemometrics for a long time. Long-time readers of the column series published in Spectroscopy magazine will recall that the series name changed since its inception. The original name was “Statistics in Spec troscopy” (which was a multiple pun, since it referred to Statistics in Spectroscopy and Statistics in Spectroscopy as well as statistics (the subject of Statistics) in Spectroscopy (see our third column ever [1] for a discussion of the double meaning of the word “Statistics”. The same discussion is found in the book based on those first 38 columns in the earlier “Statistics” series [2])). Our goal then, as it is now, was to bring the study of Chemometrics and the study of Statistics closer together. While there are isolated points of light, it seems that many people who study Chemometrics have no interest in and do not appreciate the Statistical background upon which many of our Chemometric techniques are based, nor the usefulness of the techniques that we could learn from that discipline. Worse, there are some who actively denigrate and oppose the use of Statistical concepts and techniques in the chemometric analysis of data. The first group can perhaps claim unfamiliarity (ignorance(?)) with Statistical concepts. It is difficult, however, to find excuses for the second group. Nevertheless, at its very fundamental core, there is a very deep and close connection between the two disciplines. How could it be otherwise? Chemometric concepts and techniques are based on principles that were formulated by mathematicians hundreds of years ago, even before the label “Statistics” was applied to the subfield of Mathematics that deals with the behavior and effect of random numbers on data. Nevertheless, recognition of Statistics as a distinct subdiscipline of Mathematics also goes back a long way, certainly long before the term “Chemometrics” was coined to describe a subfield of that subfield. Before we discuss the relationship between these two disciplines, it is, perhaps, useful to consider what they are. We have already defined “Statistics” as “� � � the study of the properties of random numbers � � � ” [3]. A definition of “Chemometrics” is a little trickier of come by. The term was originally coined by Kowalski, but nowadays many Chemometricians use the definition by Massart [4]. On the other hand, one compilation presents nine different definitions for “Chemo metrics” [5, 6] (including “What Chemometricians do”, a definition that apparently was suggested only HALF humorously!). But our goal here is not to get into the argument over the definition of the term, so for our current purposes, it is convenient to consider a perhaps somewhat simplified definition of “Chemometrics” as meaning “multivariate methods of data analysis applied to data of chemical interest”.

472

Chemometrics in Spectroscopy

This definition is convenient because it allows us to then jump directly to what is arguably the simplest Chemometric technique in use, and consider that as the prototype for all chemometric methods; that technique is multiple regression analysis. Written out in matrix notation, multiple regression analysis takes the form of a relatively simple matrix equation: −1 B = AT C AT A

(69-1)

where B represents the vector of coefficients, A represents the matrix of independent variables and C represents the vector −1 of dependent variables. One part of that equation, AT A , appears so commonly in chemometric equations that it has been given a special name, it is called the pseudoinverse of the matrix A. The uninverted term AT A is itself fairly commonly found, as well. The pseudoinverse appears as a common component of chemometric equations because it confers the Least Squares property on the results of the computations; that is, for whatever is being modeled, the computations defined by equation 69-1 produce a set of coefficients that give the smallest possible sum of the squares of the errors, compared to any other possible linear model. HUH?? It does? How do we know that? Well, let us derive equation 69-1 and see. We start by assuming that the relationship between the independent variables and the dependent variable can be described by a linear relationship: C = �A

(69-2)

where �, as we have noted previously, represents the “true”, or Population values of the coefficients [1]. Equation 69-2 expresses what is often called the “Inverse Least Squares”, or P-matrix, approach to calibration. Since we do not know what the true values of the coefficients are, we have to calculate some approximation to them. We therefore express the calculation in terms of “statistics”, quantities that we can calculate from the data (see that same chapter for further discussion of these points): C = bA

(69-3)

How are we going to perform that calculation? Well to start with, we need something to base it on, and the consensus is that the calculation will be based on the errors, since in truth, equation 69-3 is not exactly correct because C will in general NOT be exactly equal bA. Therefore we extend equation 69-3: C = bA + error

(69-4)

Now that we have a correct equation, we want to solve this equation (or equation 69-3, which is essentially equivalent) for b. Now, if matrix A had the same number of rows and columns (a square matrix), we could form its inverse, and multiply both sides of equation 69-3 by A−1 : CA−1 = bAA−1

(69-5)

Connecting Chemometrics to Statistics: Part 1

473

and since multip

This page intentionally left blank

Chemometrics in Spectroscopy

Howard Mark Mark Electronics

Suffern, New York

USA

Jerry Workman Jr. Thermo Fischer Scientific Inc.

Molecular Spectroscopy & Microanalysis

Madison, WI

USA

Amsterdam • Boston • Heidelberg • London • New York • Oxford

Paris • San Diego • San Francisco • Singapore • Sydney • Tokyo

Academic Press is an imprint of Elsevier

Academic Press is an imprint of Elsevier 84 Theobald’s Road, London WC1X 8RR, UK Radarweg 29, PO Box 211, 1000 AE Amsterdam, The Netherlands Linacre House, Jordan Hill, Oxford OX2 8DP, UK 30 Corporate Drive, Suite 400, Burlington, MA 01803, USA 525 B Street, Suite 1900, San Diego, CA 92101-4495, USA First edition 2007 Copyright © 2007 Elsevier Inc. All rights reserved No part of this publication may be reproduced, stored in a retrieval system or transmitted in any form or by any means electronic, mechanical, photocopying, recording or otherwise without the prior written permission of the publisher Permissions may be sought directly from Elsevier’s Science & Technology Rights Department in Oxford, UK: phone (+44) (0) 1865 843830; fax (+44) (0) 1865 853333; email: [email protected]. Alternatively you can submit your request online by visiting the Elsevier web site at http://elsevier.com/locate/permissions, and selecting Obtaining permission to use Elsevier material Notice No responsibility is assumed by the publisher for any injury and/or damage to persons or property as a matter of products liability, negligence or otherwise, or from any use or operation of any methods, products, instructions or ideas contained in the material herein. Because of rapid advances in the medical sciences, in particular, independent verification of diagnoses and drug dosages should be made ISBN: 978-0-12-374024-3

For information on all Academic Press publications visit our website at books.elsevier.com

Printed and bound in USA 07 08 09 10 11 10 9 8 7 6 5 4 3 2 1

Working together to grow libraries in developing countries www.elsevier.com | www.bookaid.org | www.sabre.org

Dedication To our families and to our readers � � � – Howard Mark and Jerry Workman

This page intentionally left blank

Contents Preface Note to Readers 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24. 25. 26. 27. 28. 29. 30. 31.

A New Beginning � � � Elementary Matrix Algebra: Part 1 Elementary Matrix Algebra: Part 2 Matrix Algebra and Multiple Linear Regression: Part 1 Matrix Algebra and Multiple Linear Regression: Part 2 Matrix Algebra and Multiple Linear Regression: Part 3 – The Concept of

Determinants Matrix Algebra and Multiple Linear Regression: Part 4 – Concluding

Remarks Experimental Designs: Part 1 Experimental Designs: Part 2 Experimental Designs: Part 3 Analytic Geometry: Part 1 – The Basics in Two and Three Dimensions Analytic Geometry: Part 2 – Geometric Representation of Vectors and

Algebraic Operations Analytic Geometry: Part 3 – Reducing Dimensionality Analytic Geometry: Part 4 – The Geometry of Vectors and Matrices Experimental Designs: Part 4 – Varying Parameters to Expand the Design Experimental Designs: Part 5 – One-at-a-time Designs Experimental Designs: Part 6 – Sequential Designs Experimental Designs: Part 7 – �, the Power of a Test Experimental Designs: Part 8 – �, the Power of a Test (Continued) Experimental Designs: Part 9 – Sequential Designs Concluded Calculating the Solution for Regression Techniques:

Part 1 – Multivariate Regression Made Simple Calculating the Solution for Regression Techniques: Part 2 – Principal

Component(s) Regression Made Simple Calculating the Solution for Regression Techniques: Part 3 – Partial Least

Squares Regression Made Simple Looking Behind and Ahead: Interlude A Simple Question: The Meaning of Chemometrics Pondered Calculating the Solution for Regression Techniques: Part 4 – Singular

Value Decomposition Linearity in Calibration Challenges: Unsolved Problems in Chemometrics Linearity in Calibration: Act II Scene I Linearity in Calibration: Act II Scene II – Reader’s Comments � � � Linearity in Calibration: Act II Scene III

xi

xiii

1

9

17

23

33

43

47

51

57

63

71

77

81

85

89

91

93

97

101

103

107

109

113

117

119

127

131

135

141

145

149

viii

32. 33. 34. 35. 36. 37. 38. 39. 40. 41. 42. 43. 44. 45. 46. 47. 48. 49. 50. 51. 52. 53. 54. 55. 56. 57. 58. 59. 60. 61. 62. 63. 64. 65. 66. 67. 68. 69. 70. 71.

Contents

Linearity in Calibration: Act II Scene IV Linearity in Calibration: Act II Scene V Collaborative Laboratory Studies: Part 1 – A Blueprint Collaborative Laboratory Studies: Part 2 – using ANOVA Collaborative Laboratory Studies: Part 3 – Testing for Systematic Error Collaborative Laboratory Studies: Part 4 – Ranking Test Collaborative Laboratory Studies: Part 5 – Efficient Comparison of Two

Methods Collaborative Laboratory Studies: Part 6 – MathCad Worksheet Text Is Noise Brought by the Stork? Analysis of Noise: Part 1 Analysis of Noise: Part 2 Analysis of Noise: Part 3 Analysis of Noise: Part 4 Analysis of Noise: Part 5 Analysis of Noise: Part 6 Analysis of Noise: Part 7 Analysis of Noise: Part 8 Analysis of Noise: Part 9 Analysis of Noise: Part 10 Analysis of Noise: Part 11 Analysis of Noise: Part 12 Analysis of Noise: Part 13 Analysis of Noise: Part 14 Derivatives in Spectroscopy: Part 1 – The Behavior of the Derivative Derivatives in Spectroscopy: Part 2 – The “True” Derivative Derivatives in Spectroscopy: Part 3 – Computing the Derivative Derivatives in Spectroscopy: Part 4 – Calibrating with Derivatives Comparison of Goodness of Fit Statistics for Linear Regression:

Part 1 – Introduction Comparison of Goodness of Fit Statistics for Linear Regression:

Part 2 – The Correlation Coefficient Comparison of Goodness of Fit Statistics for Linear Regression:

Part 3 – Computing Confidence Limits for the Correlation Coefficient Comparison of Goodness of Fit Statistics for Linear Regression:

Part 4 – Confidence Limits for Slope and Intercept Correction and Discussion Regarding Derivatives Linearity in Calibration: Act III Scene I – Importance of Nonlinearity Linearity in Calibration: Act III Scene II – A Discussion of the

Durbin-Watson Statistic, a Step in the Right Direction Linearity in Calibration: Act III Scene III – Other Tests for Nonlinearity Linearity in Calibration: Act III Scene IV – How to Test for Nonlinearity Linearity in Calibration: Act III Scene V – Quantifying Nonlinearity Linearity in Calibration: Act III Scene VI – Quantifying Nonlinearity, Part

II, and a News Flash Connecting Chemometrics to Statistics: Part 1 – The Chemometrics Side Connecting Chemometrics to Statistics: Part 2 – The Statistics Side Limitations in Analytical Accuracy: Part 1 – Horwitz’s Trumpet

159

163

167

179

183

185

187

193

223

227

235

243

253

271

277

285

293

299

313

317

323

329

339

351

359

371

379

385

393

399

413

421

427

435

439

451

459

471

477

481

Contents

72. Limitations in Analytical Accuracy: Part 2 – Theories to Describe the

Limits in Analytical Accuracy 73. Limitations in Analytical Accuracy: Part 3 – Comparing Test Results for

Analytical Uncertainty 74. The Statistics of Spectral Searches 75. The Chemometrics of Imaging Spectroscopy Glossary of Terms Index Colour Plate Section

ix

487

491

497

503

509

513

This page intentionally left blank

Preface

This large single volume fulfils the need for chemometric-based tutorials on topics of interest to analytical chemists or other scientists performing modern mathematical and statistical operations for use with analytical measurements. The book covers a very broad range of chemometric topics as indicated in the extensive table of contents. This book is a collection of the series of columns first published in Spectroscopy providing detailed mathematical and philosophical discussions on the use of chemometrics and statistical methods for scientific measurements and analytical methods. In addition the new revolution in biotechnology and the use of spectroscopic techniques therein provides an opportunity for those scientists to strengthen their use of mathematics and calibration through the use of this book. Subjects covered include those of interest to many groups of scientists, mathemati cians, and practicing analysts for daily problem solving as well as detailed insights into subjects difficult to thoroughly grasp for the non-specialist. The coverage relies more on concept delineation than on rigorous mathematics, but the descriptive mathematics and derivations are included for the more rigorously minded. Sections on matrix algebra, analytic geometry, experimental design, instrument and system calibration, noise, derivatives and their use in data analysis, linearity and nonlinearity are described. Collaborative laboratory studies, using ANOVA, testing for systematic error, ranking tests for collaborative studies, and efficient comparison of two analytical methods are included. Discussion on topics such as the limitations in analytical accuracy; and brief introductions to the statistics of spectral searches; and the chemometrics of imaging spectroscopy are included. The popularity of the Chemometrics in Spectroscopy series (ongoing since the early 1990s) as well as the Statistics in Spectroscopy series and books has been overwhelming and we sincerely thank our readership over the years. We have received e-mails from many people, one memorable one thanking us that a career change was made due to the renewed and stimulated interest in statistics and chemometrics due largely to our thought-provoking columns. We hope you find this collection useful and will continue to read the columns and write to us with your thoughts, comments, and questions regarding this stimulating topic. Howard Mark Suffern, NY Jerry Workman Madison, WI

This page intentionally left blank

Note to Readers

In some cases there were errors, both trivial and significant, in the original column from which a given chapter was taken. Sometimes we found the error ourselves (unfortunately after the column was printed) and sometimes, more embarrassingly, the error was brought to our attention by one of our ever-vigilant readers. For all significant errors, the necessary corrections were made in a subsequent column; in all cases, the corrected version is what is in this book. Sometimes, for the more serious errors, we note that the corresponding column was erroneous, so that any reader who wants to go back to the original will be aware that a comparison with what is presented here will fail.

This page intentionally left blank

1

A New Beginning � � �

Why do we title this chapter “A New Beginning � � � ”? Well, there are a lot of reasons. First of all, of course, is the simple fact that that is just the way we do things. Secondly, is the fact that we developed this book in much the same way we developed our previous book Statistics in Spectroscopy (SiS). Those of you out there who have followed the series of articles published in Spectroscopy magazine since 1986 know that for the most part, each column in the series was pretty much self-contained and could stand alone, yet also fit into that series in the appropriate place and contributed to the flow of information in that series as a whole. We hope to be able to reproduce that on a larger scale. Just as the series Statistics in Spectroscopy (this is too long to write out each time, from here on we will abbreviate it SiS) was self-contained and stood alone, so too will we try to make this new series stand alone, and at the same time be a worthy successor to SiS, and also continue to develop the concepts we began there. Thirdly is the fact that we are finally starting to write again. To you, our readership, it may seem like we have been writing continuously since we began SiS, but in fact we have been running on backlog for a longer time than you would believe. That was advantageous in that it allowed us time to pursue our personal and professional lives including such other projects as arranging for SiS to be published as a book [1]. The downside of our getting ahead of ourselves, on the other hand, is that we were not able to keep you abreast on the latest developments related to our favorite topic. However, since the last time we actually wrote something, there have been a number of noteworthy developments. Our last series dealt only with the elementary concepts of statistics related to the general practice of calibration used for UV-VIS-NIR and occasionally for IR spec troscopy. Our purpose in writing SiS was to help provide a small foot bridge to cross the gap between specialized chemometrics literature written at the expert level and those general statistics articles and texts dealing with examples and questions far removed from chemistry or spectroscopic practice. Since the beginning of the “Statistics” series in 1986, several reviews, tutorials, and textbooks have been published to begin the construction of a major highway bridging this gap. Most notably, at least in our minds, have been tutorial articles on classical least squares (CLS), principal components regression (PCR), and partial least squares regression (PLSR) by Haaland and Thomas [2, 3]. Other important work includes textbooks on calibration and chemometrics by Naes and Martens [4], and Mark [5]. Chemometric reviews discussing the progress of tutorial and textbook literature appear regularly in Analytical Chemistry, Critical Review issues. Another recent series of articles on chemometric concepts termed “The Chemometric Space” by Naes and Isaksson has appeared [6]. In addition, there is a North American chapter of the International Chemometrics Society (NAmICS) which we are told has

2

Chemometrics in Spectroscopy

over 300 members. Those interested in joining or obtaining further information may contact Professor Thomas O’Haver at the Department of Chemistry and Biochemistry, University of Maryland, College Park, Maryland 20742 (Donald B. Dahlberg, 1993, personal communication). All the foregoing was true as of when the Chemometrics column began in 1993. Now in 2006, when we are preparing this for book publication, there are many more sources of information about Chemometrics. However, since this is not a review of the field, we forebear to list them all, but will correct one item that has changed since then: to obtain information about NAmICS, or to join the discussion group, contact David Duewer at NIST ([email protected])) or send a message to the discussion group ([email protected]). Finally, since imitation is the sincerest form of flattery (or so they tell us), we are pleased to see that others have also taken the route of printing longer tutorial discussions in the form of a series of related articles on a given topic. Two series that we have no qualms recommending, on topics related to ours, have appeared in some of the sister publications of Spectroscopy [7–15] (note: there have been recent indications that the series in Spectroscopy International has continued beyond the ones we have listed. If we can obtain more information we will keep you posted – Spectroscopy International has also undergone some transformations and it is not always easy to get copies). So, overall the chemometrics bridge between the lands of the overly simplistic and severely complex is well under construction; one may find at least a single lane open by which to pass. So why another series? Well, it is still our labor of love to deal with specific issues that plague ourselves and our colleagues involved in the practice of multivariate qualitative and quantitative spectroscopic calibration. Having collectively worked with hundreds of instrument users over 25 combined years of calibration problems, we are compelled, like bees loaded with pollen, to disseminate the problems, answers, and questions brought about by these experiences. Then what would a series named “Chemometrics in Spectroscopy” hope to cover which is of interest to the readers of “Spectroscopy”? We have been taken to task (with perhaps some justice) for using the broader title label “Chemometrics in Spectroscopy” for what we have claimed will be discussions of the somewhat narrower range of topics included in the field of multivariate statistical algorithms applied to chemical problems, when the term “Chemometrics” actually applies to a much wider range of topics. Nevertheless, we will use this title, for a number of reasons. First, that is what we said we were going to do, and we hate to not follow through, even on such a minor point. Secondly, we have said before (with all due arrogance) that this is our column, and we have been pretty fortunate that the editors of Spectroscopy have always pretty much let us do as we please. Finally, at this point we consider the possibility that we may very well eventually extend our range to include some of these other topics that the broader term will cover. As of right now, some of the topics we foresee being able to expand upon over the series will include, but not be limited to • The multivariate normal distribution • Defining the bounds for a data set

A New Beginning � � �

3

• The concept of Mahalanobis distance • Discriminant analysis and its subtopics of – Sample selection – Spectral matching (Qualitative analysis) • Finding the maximum variance in the multivariate distribution • Matrix algebra refresher • Analytic geometry refresher • Principal components analysis (PCA) • Principal components regression (PCR) • More on Multiple linear least squares regression (MLLSR), also known as Multiple linear regression (MLR) and P-matrix, and its sibling, K-matrix • More on Simple linear least squares regression (SLLSR), also known as Simple least squares regression (SLSR) or univariate least squares regression • Partial least squares regression (PLSR) • Validation of calibration models • Laboratory data and assessing error • Diagnosis of data problems • An attempt to standardize statistical/chemometric terms • Special calibration problems (and solutions) • The concept of outliers: theory and practice • Standardization concepts and methods for transfer of calibrations • Collaborative study problems related to methods and instruments. We also plan to include in the discussions the important statistical concepts, such as correlation, bias, slope, and associated errors and confidence limits. Beyond this, it is also our hope that readers will write to us with their comments or suggestions for chemometric challenges which confront them. If time and energy permit, we may be able to discuss such issues as neural networks, general factor analysis, clustering techniques, maximizing graphical presentation of data, and signal processing.

THE MULTIVARIATE NORMAL DISTRIBUTION We will begin with the concept of the multivariate normal distribution. Think of a cigar, suspended in space. If you cannot think of a cigar suspended in space, look at Figure 1-1a. Now imagine the cigar filled with little flecks of stuff, as in Figure 1-1b (it does not really matter what the stuff is, mathematics never concerned itself with such unimportant details). Imagine the flecks being more densely packed toward the middle of the cigar. Now imagine a swarm of gnats surrounding the cigar; if they are attracted to the cigar, then naturally there will be fewer of them far away from the cigar than close to it (Figure 1-1c). Next take away the cigar, and just leave the flecks and the gnats. By this time, of course, you should realize that the flecks and the gnats are really the same thing, and are neither flecks nor gnats but simply abstract representations of points in space. What is left looks like Figure 1-1d.

4

Chemometrics in Spectroscopy (a)

(b)

(c)

(d)

Figure 1-1 Development of the concept of the Multivariate Normal Distribution (this one shown having three dimensions) – see text for details. The density of points along a cross-section of the distribution in any direction is also an MND, of lower dimension.

Figure 1-1d, of course, is simply a pictorial/graphical representation of what a Multivariate Normal Distribution (MND) would look like, if you could see it. Furthermore, it is a representation of only one particular MND. First of all, this particular MND is a three-dimensional MND. A two-dimensional MND will be represented by points in a plane, and a one-dimensional MND is simply the ordinary Normal distri bution that we have come to know and love [16]. An MND can have any number of dimensions; unfortunately we humans cannot visualize anything with more than three dimensions, so for our examples we are limited to such pictures. Also, the MND depicted has a particular shape and orientation. In general, an MND can have a variety of shapes and orientations, depending upon the dispersion of the data along the different axes. Thus, for example, it would not be uncommon for the dispersion along two of the axes to be equal and independent. In this case, which represents one limiting situation, an appropriate cross-section of the MND would be circular rather than elliptical. Another limiting situation, by the way, is for two or more of the variables to be perfectly corre lated, in which case the data would lie along a straight line (or plane, or hyperplane as the corresponding higher-dimensional figure is called). Each point in the MND can be projected onto the planes defined by each pair of the axes of the coordinate system. For example, Figure 1-2 shows the projection of the data onto the plane at the “bottom” of the coordinate system. There it forms a twodimensional MND, which is characterized by several parameters, the two-dimensional MND being the prototype for all MNDs of higher dimension and the properties of this MND are the characteristics of the MND that are the key defining properties of it. First of all, the data contributing to an MND itself has a Normal distribution along any of the

A New Beginning � � �

5

Figure 1-2 Projecting each point of the three-dimensional MND onto any of the planes defined by two axes of the coordinate system (or, more generally, any plane passing through the coor dinate system) results in the projected points being represented by a two-dimensional MND). The correlation coefficients for the projections in all planes are needed to fully describe the original MND.

axes of the MND. We have discussed the Normal distribution previously [16], and have seen that it is described by the expression: f �x� = ae−�

x−x �

�

2

(1-1)

The MND can be mathematically described by an expression that is similar in form, but has the characteristic that each of the individual parts of the expression represents the multivariate analog of the corresponding part of equation 1-1. Thus, for example, where x represents the mean of the data for which equation 1-1 describes the distribution, there is a corresponding quantity X that represents in matrix notation the fact that for each of the axes shown in Figure 1-1, each datum has a value, and therefore the collection of data has a mean value along each dimension. This quantity represented as a list of the set of means along all the different dimensions is called a vector, and is represented as X (as opposed to x, an individual mean). If we project the MND onto each axis of the coordinate system containing the MND, then as stated above, these projections of the data will be distributed as an ordinary Normal distribution, as shown in Figure 1-3. This distribution will itself then have a standard deviation, so that another defining characteristic of the MND is the standard deviation of the projection of the MND along each axis. This must also then be represented by a vector.

Figure 1-3 Projecting the points onto a line results in a point density that is our familiar Normal Distribution.

6

Chemometrics in Spectroscopy

The final key point to note about the MND, which can also be seen from Figure 1-2, is the fact that when the MND is projected onto the plane defined by any two axes of the coordinate system the data may show some correlation (as does the data in Figure 1-2). In fact, the projection onto any of the planes defined by two of the axes will have some value for the correlation coefficient between the corresponding pair of variables. The amount of correlation between projections along any pair of axis can vary from zero, in which case the data would lie in a circular blob, to unity, in which case the data would all lie exactly on a straight line. Since each pair of axes define another plane, many such projections may be possible, depending on the number of dimensions in which the MND exists. Indeed, every possible pair of axes in the coordinate system defines such a plane. As we have shown, we mere mortals cannot visualize more than three dimensions, as so our examples and diagrams will be limited to showing data in three or lesser dimensions, but the mathematical descriptions can be extended with all generality, to as high dimensionality as might be needed. Thus, the full description of the MND must include all the correlations of the data between every pair of axes. This is conventionally done by creating what is known as the correlation matrix. This matrix is a square matrix, in which any given row or column corresponds to a variable, and the individual positions (i.e., the m, n position for example, where m and n represent indices of the variables) in the matrix represent the correlation between the variable represented by the row it lies in and the variable represented by the column it lies in. In actuality, for mathematical reasons, the correlation itself is not used, but rather the related quantity the covariance replaces the correlation coefficient in the matrix. The elements of the matrix that lie along what is called the main diagonal (i.e., where the column and row numbers are the same) are then the variances (the square of the standard deviation – this shows that there is a rather close relationship between the standard deviation and the correlation) of the data. This matrix is thus called the variance-covariance matrix, and sometimes just the covariance matrix for simplicity. Since it is necessary to represent the various quantities by vectors and matrices, the operations for the MND that correspond to operations using the univariate (simple) Normal distribution must be matrix operations. Discussion of matrix operations is beyond the scope of this column, but for now it suffices to note that the simple arithmetic operations of addition, subtraction, multiplication, and division all have their matrix counterparts. In addition, certain matrix operations exist which do not have counterparts in simple arithmetic. The beauty of the scheme is that many manipulations of data using matrix operations can be done using the same formalism as for simple arithmetic, since when they are expressed in matrix notation, they follow corresponding rules. However, there is one major exception to this: the commutative rule, whereby for simple arithmetic: A (operation) B = B (operation) A e.g.: A + B = B + A A−B = B−A does not hold true for matrix multiplication: A×B = B×A

A New Beginning � � �

7

That is because of the way matrix multiplication is defined. Thus, for this case the order of appearance of the two matrices to be multiplied may provide different matrices as the answer. Thus, instead of f�x� and the expression for it in equation 1-1 describing the simple Normal distribution, the MND is described by the corresponding multivariate expression (1-2): f �X� = Ke−�X−X�

T A�X−X�

(1-2)

where now the capital letters X and K represent vectors, and the capital letter A represents the covariance matrix. This is, by the way, a somewhat straightforward extension of the definition (although it may not seem so at first glance) because for the simple univariate case the matrix A degenerates into the number 1, X becomes x, and thus the exponent becomes simply the square of x − x. Most texts dealing with multivariate statistics have a section on the MND, but a particularly good one, if a bit heavy on the math, is the discussion by Anderson [17]. To help with this a bit, our next few chapters will include a review of some of the elementary concepts of matrix algebra. Another very useful series of chemometric related articles has been written by David Coleman and Lynn Vanatta. Their series is on the subject of regression anal ysis. It has appeared in American Laboratory in a set of over twenty-five articles. Copies of the back articles are available on the web at the URL address found in reference [18].

REFERENCES 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14.

Mark, H. and Workman, J., Statistics in Spectroscopy (Academic Press, Boston, 1991) Haaland, D. and Thomas, E., Analytical Chemistry 60, 1193–1202 (1988). Haaland, D. and Thomas, E., Analytical Chemistry 60, 1202–1208 (1988). Naes, T. and Martens, H., Multivariate Calibration (John Wiley & Sons, New York, 1989). Mark, H., Principles and Practice of Spectroscopic Calibration (John Wiley & Sons, New York, 1991). Naes, T. and Isaksson, T., “The Chemometric Space”, NIR News (PO Box 10, Selsey, Chichester, West Sussex, PO20 9HR, UK, 1992). Bonate, P.L., “Concepts in Calibration Theory”, LC/GC, 10(4), 310–314 (1992). Bonate, P.L., “Concepts in Calibration Theory”, LC/GC, 10(5), 378–379 (1992). Bonate, P.L., “Concepts in Calibration Theory”, LC/GC, 10(6), 448–450 (1992). Bonate, P.L., “Concepts in Calibration Theory”, LC/GC, 10(7), 531–532 (1992). Miller, J.N., “Calibration Methods in Spectroscopy”, Spectroscopy International 3(2), 42–44 (1991). Miller, J.N., “Calibration Methods in Spectroscopy”, Spectroscopy International 3(4), 41–43 (1991). Miller, J.N., “Calibration Methods in Spectroscopy”, Spectroscopy International 3(5), 43–46 (1991). Miller, J.N., “Calibration Methods in Spectroscopy”, Spectroscopy International 3(6), 45–47 (1991).

8

Chemometrics in Spectroscopy

15. Miller, J.N., “Calibration Methods in Spectroscopy”, Spectroscopy International 4(1), 41–43 (1992). 16. Mark, H. and Workman, J., “Statistics in Spectroscopy – Part 6 – The Normal Distribution”, Spectroscopy 2(9), 37–44 (1987). 17. Anderson, T.W., An Introduction to Multivariate Statistical Analysis (Wiley, New York, 1958). 18. Coleman, D. and Vanatta, L., Statistics in Analytical Chemistry, International Scientific Com munications, Inc. found at http://www.iscpubs.com/articles/index.php?2.

2

Elementary Matrix Algebra: Part 1

You may recall that in the first chapter we promised that a review of elementary matrix algebra would be forthcoming; so the next several chapters will cover this topic all the way from the very basics to the more advanced spectroscopic subjects. You may already have discovered that the term “matrix” is a fanciful name for a table or list. If you have recently made a grocery list you have created an n×1 matrix, or in more correct nomenclature, an Xn×1 matrix where n is the number of items you would like to buy (rows) and 1 is the number of columns. If you have become a highly sophisticated shopper and have made lists consisting of one column for Store A and a second one for Store B, you have ascended into the world of Xn×2 matrix. If you include the price of each item and put brackets around the entire column(s) of prices, you will have created a numerical matrix. By definition, a numerical matrix is a rectangular array of numbers (termed “ele ments”) enclosed by square brackets [ ]. Matrices can be used to organize information such as size versus cost in a grocery department, or they may be used to simplify the problems associated with systems or groups of linear equations. Later in this chapter we will introduce the operations involved for linear equations (see Table 2-1 for common symbols used).

Table 2-1 Common symbols used in matrix notation Matrix∗ Determinant∗ Vectors∗ Scalars∗ Parameters or matrix names Errors and residuals Addition Subtraction Multiplication Division Empty or null set Inverse of a matrix Transpose of a matrix Generalized inverse of a matrix Identity matrix ∗

[X] or X �X� x x A, B, C, G, H, P, Q, R, S, U, V D, E, F + − × or • ÷ or / � [X]−1 �X�� or [X]T [X]− [I] of [1]

Where X or x are represented by any letter, generally those are listed under “Parameters or matrix names” in this table.

10

Chemometrics in Spectroscopy

The symbols below represent a matrix:

a1 a2

b1 b2

Note that a1 and a2 are in column 1, b1 and b2 are in column 2, a1 and b1 are in row 1, and a2 and b2 are in row 2. The above matrix is a 2 × 2 (rows × columns) matrix. The first number indicates the number of rows, and the second indicates the number of columns. Matrices can be denoted as X2×2 using a capital, boldface letter with the row and column subscript.

MATRIX OPERATIONS The following illustrations are useful to describe very basic matrix operations. Discus sions covering more advanced matrix operations will be included in later chapters, but for now, just review these elementary operations.

Matrix addition To add two matrices, the following operation is performed:

a1 a2

b1 c + 1 b2 c2

d1 a + c1 = 1 d2 a2 + c2

b1 + d1 b2 + d2

To add larger matrices, the following operation applies:

a1 a2

b1 b2

c1 c2

d1 e + 1 e2 d2

f1 f2

g1 g2

h1 a + e1 = 1 a2 + e2 h2

b1 + f1 b2 + f2

c1 + g1 c2 + g2

d1 + h1 d2 + h2

c1 − g1 c2 − g2

d1 − h1 d2 − h2

Subtraction For subtraction, use the following operations:

a1 a2

b1 c − 1 b2 c2

d1 a − c1 = 1 d2 a2 − c2

b1 − d1 b2 − d2

The same operation holds true for larger matrices such as

a1 a2

b1 b2

and so on.

c1 c2

d1 e − 1 d2 e2

f1 f2

g1 g2

h1 a − e1 = 1 h2 a2 − e2

b1 − f1 b2 − f2

Elementary Matrix Algebra: Part 1

11

Matrix multiplication To multiply a scalar by a matrix (or a vector) we use a A 1 a2

A × a1 b1 = b2 A × a2

A × b1 A × b2

where A is a scalar value.

The product of two matrices (or vectors) is given by

a1 a2

b1 c × 1 b2 c2

d1 a c + b1 c2 = 1 1 d2 a2 c1 + b2 c2

a1 d1 + b1 d2 a2 d1 + b2 d2

In another example, in which an X1×2 matrix is multiplied by an X2×1 matrix, we have:

a1

b1

a × 2 = �a1 b1 + a2 b2 � b2

denoted by X1 × X2 in matrix notation.

Matrix division Division of a matrix by a scalar is accomplished:

a1 a2

b1 a A ÷ A = 1 b2 a2 A

b1 A b2 A

where A is a scalar value.

Inverse of a matrix The inverse of a matrix is the conceptual equivalent to its reciprocal. Therefore if we denote our matrix by X, then the inverse of X is denoted as X−1 and the following relationship holds. X × X−1 = �1� = X−1 × X where [1] is an identity matrix. Only square matrices, which have an equal number of rows and columns (for example, 2 × 2, 3 × 3, 4 × 4, etc.) have inverses. Several computer packages provide the algorithms for calculating the inverse of square matrices. The identity matrix for a 2 × 2 matrix is �1�2×2 =

1 0

0 1

12

Chemometrics in Spectroscopy

and for a 3 × 3 matrix, the identity matrix is ⎡

1 �1�3×3 = ⎣ 0 0

0 1 0

⎤ 0 0⎦ 1

and so on. Note that the diagonal is always composed of ones for the identity matrix, and all other values are zero. To summarize, by definition: X2×2 × X−1 2×2 = �1�2×2 The basic methods for calculating X−1 will be addressed in the next chapter.

Transpose of a matrix The transpose of a matrix is denoted by X� (or, alternatively, by XT �. For example, for the matrix: �X� = a1 a2

b1 b2 ⎡

then

a1 �X�� = ⎣ b1 c1

c1 c2

⎤ a2 b2 ⎦ c2

The first column of [X] becomes the first row of �X�� ; the second column of [X] becomes the second row of �X�� ; the third column of [X] becomes the third row of �X�� ; and so on.

ELEMENTARY OPERATIONS FOR LINEAR EQUATIONS To solve problems involving calibration equations using multivariate linear models, we need to be able to perform elementary operations on sets or systems of linear equations. So before using our newly discovered powers of matrix algebra, let us solve a problem using the algebra many of us learned very early in life. The elementary operations used for manipulating linear equations include three simple rules [1, 2]: • Equations can be listed in any order for convenience and organizational purposes. • Any equation may be multiplied by any real number other than zero. • Any equation in a series of equations can be replaced by the sum of itself and any other equation in the system. As an example, we can illustrate these operations using

Elementary Matrix Algebra: Part 1

13

the three equations below as part of what is termed an “equation system” or simply a “system” (equations 2-1 through 2-3): 1a1 + 1b1 = −2

(2-1)

4a1 + 2b1 + c1 = 6

(2-2)

6a1 − 2b1 − 4c1 = 14

(2-3)

To solve for this system of three equations, we begin by following the three elementary operations rules above: • We can rearrange the equations in any order. In our case the equations happen to be in a useful order. • We decide to multiply equation 2-1 by a factor such that the coefficients of a are of opposite sign and of the same absolute value for equations 2-1 and 2-2. Therefore, we multiply equation 2-1 by −4 to yield −4a1 − 4b1 = 8

(2-4)

• We can eliminate a1 in the first and the second equations by adding equations 2-4 and 2-2 to give equation (2-5) �−4a1 − 4b1 = 8� + �4a1 + 2b1 + c1 = 6� = 6a1 − 2b1 + c1 = 14

(2-5)

and we bring equation 2-1 back in the system by dividing equation 2-4 by −4 to get a1 + b1 = −2

(2-6)

−2b1 + c1 = 14

(2-7)

6a1 − 2b1 − 4c1 = 14

(2-8)

Now to eliminate the a1 term in equations 2-6 and 2-8, we multiply equation 2-6 by −6 to yield −6a1 − 6b1 = 12

(2-9)

Then we add equation 2-9 to equation 2-8: �−6a1 − 6b1 = 12� + �6a1 − 2b1 − 4c1 = 14� = −8b1 − 4c1 = 26

(2-10)

14

Chemometrics in Spectroscopy

Now we bring back equation 2-6 in its original form by dividing equation 2-9 by −6, and our system of equations looks like this: a1 + b1 = −2

(2-11)

−1b1 + c1 = 14

(2-12)

−8b1 − 4c1 = 26

(2-13)

We can eliminate the b1 term from equations 2-12 and 2-13 by multiplying equation 2-12 by −8 and equation 2-13 by 2 to obtain 16b1 − 8c1 = −112

(2-14)

−16b1 − 8c1 = 52

(2-15)

−16c1 = −60

(2-16)

Adding these equations, we find

Restore equation 2-7 by dividing equation 2-14 by −8 to yield a1 + b1 = −2

(2-17)

−2b1 + c1 = 14

(2-18)

−16c1 = −60

(2-19)

The solution Solving for c1 , we find c1 = �−60/ − 16� = 3�75� Substituting c1 into equation 2-18, we obtain −2b1 + 3�75 = 14� Solving this for b1 , we find b1 = −5�13� Substituting b1 into equation 2-17 , we find a1 + �−5�13� = −2. Solving this for a1 , we find a1 = 3�13� Finally, a1 = 3�13 b1 = −5�13 c1 = 3�75 A system of equations where the first unknown is missing from all subsequent equations and the second unknown is missing from all subsequent equations is said to be in echelon form. Every set or equation system comprised of linear equations can be brought into echelon form by using elementary algebraic operations. The use of augmented matrices can accomplish the task of solving the equation system just illustrated.

Elementary Matrix Algebra: Part 1

15

For our previous example, the original equations a1 + b1 = −2

(2-20)

4a1 + 2b1 + c1 = 6

(2-21)

6a1 − 2b1 − 4c1 = 14

(2-22)

can be written in augmented matrix form as: ⎡ ⎤ 1 1 0 −2 ⎣4 2 1 6⎦ 6 −2 −4 14

(2-23)

The echelon form of the equations can also be put into matrix form as follows. Echelon form: a1 + b1 = −2

(2-24)

−2b1 + c1 = 14

(2-25)

−16c1 = −60

(2-26)

Matrix form: ⎡

1 ⎣0 0

1 −2 0

⎤ 0 −2 1 14 ⎦ −16 −60

(2-27)

SUMMARY In this chapter, we have used elementary operations for linear equations to solve a problem. The three rules listed for these operations have a parallel set of three rules used for elementary matrix operations on linear equations. In our next chapter we will explore the rules for solving a system of linear equations by using matrix techniques.

REFERENCES 1. Kowalski, B.R., Recommendations to IUPAC Chemometrics Society (Laboratory for Chemo metrics, Department of Chemistry, BG-10, University of Washington, Seattle, WA 98195; 1985), pp. 1–2. 2. Britton, J.R. and Bello, I., Topics in Contemporary Mathematics (Harper & Row, New York, 1984), pp. 408–457.

This page intentionally left blank

3 Elementary Matrix Algebra: Part 2

ELEMENTARY MATRIX OPERATIONS To solve the set of linear equations introduced in our previous chapter referenced as [1], we will now use elementary matrix operations. These matrix operations have a set of rules which parallel the rules used for elementary algebraic operations used for solving systems of linear equations. The rules for elementary matrix operations are as follows [2]: 1) Rows can be listed in any order for convenience or organizational purposes. 2) All elements within a row may be multiplied using any real number other than zero. 3) Any row can be replaced by the element-by-element sum of itself and any other row. To solve a system of equations, our first step is to put zeros into the second and the third rows of the first column, and into the third row of the second column. For our exercise we will bring forward equations 2-1 through 2-3 as (equation set 3-1): 1a1 + 1b1 = −2 4a1 + 2b1 + 1c1 = 6 6a1 − 2b1 − 4c1 = 14

(3-1)

We can put the above set or system of equations in matrix notation as: ⎡

1 A = ⎣4 6

⎤ 0 1⎦ −4

1 2 −2

⎡ ⎤ a1 B = ⎣ b1 ⎦ c1

⎡

⎤ −2 C = ⎣ 6⎦ 14

and so, AB = C

or

A • B = C

Matrix A is termed the “matrix of the equation system”. The matrix formed by A C is termed the “augmented matrix”. For this problem the augmented matrix is given as:

⎡

1 A C = ⎣4 6

1 2 −2

0 1 −4

⎤ −2 6⎦ 14

18

Chemometrics in Spectroscopy

Now if we were to find a set of equations with zeros in the second and the third rows of the first column, and in the third row of the second column we could use equations 2-17 through 2-19 [1] which look like (equation set 3-2): a1 + b1 = −2 −2b1 + c1 = 14 −16c1 = −60 we can rewrite these equations in matrix notation as: ⎡ ⎤ ⎡ ⎤ 1 1 0 a1 1⎦ H = ⎣ b1 ⎦ G = ⎣0 −2 0 0 −16 c1

(3-2) ⎡

⎤ −2 P = ⎣ 14⎦ −60

and the augmented form of the above matrices is written as: ⎡ ⎤ 1 1 0 −2 G P = ⎣0 −2 1 14⎦ 0 0 −16 −60 For equation 2-7, we can reduce or simplify the third row in G P by following Rule 3 of the basic matrix operations previously mentioned. As such we can multiply row III in G P by 1/2 to give ⎡ ⎤ 1 1 0 −2 G P = ⎣0 −2 1 14⎦ 0 0 −8 −30 We can use elementary also known as elementary matrix to row operations, operations obtain matrix G P from A C . By the way, if we can achieve G P from A C using these operations, the matrices are termed “row equivalent” denoted by X1 ∼ X2 . To begin with an illustration of the use of elementary matrix operations let us use the following example. Our original A matrix above can be manipulated to yield zeros in rows II and III of column I by a series of row operations. The example below illustrates this: ⎡ ⎤ ⎡ ⎤ 1 1 0 −2 1 1 0 −2 ⎣4 2 1 6⎦ ∼ ⎣0 −2 1 14⎦ 6 −2 −4 14 0 −8 −4 26 The left-hand augmented matrix is converted to the right-hand augmented matrix by II/II − 4I or row II is replaced by row II minus 4 times row I. Then III/III − 6I or row III is replaced by row III minus 6 times row I. To complete the row operations to yield G P from A C we write ⎡ ⎤ ⎡ ⎤ 1 1 0 −2 1 1 0 −2 ⎣0 −2 1 14⎦ ∼ ⎣0 −2 1 14⎦ 0 −8 −4 26 0 0 −8 −30

Elementary Matrix Algebra: Part 2

19

This is accomplished by III/III − 4II or row III is replaced by row III minus 4 times row II. As we have just shown using two series of row operations we have ⎡ ⎤ 1 1 0 −2 ⎣0 −2 1 14⎦ 0 0 −8 −30 which is equivalent to equations 2-17 through 2-19, and equations (3-3) above; this is shown here as (equation set 3-3). a1 + b1 = −2 −2b1 + c1 = 14 −8c1 = −30

(3-3)

Now, solving for c1 = −30/− 8 = 375; substituting c1 into equation 2-18, we find −2b1 + 375 = 14, therefore b1 = −513; and substituting b1 into equation 2-17, we find a1 + −513 = −2, therefore a1 = 313; and so, a1 = 313 b1 = −513 c1 = 375 Thus matrix operations provide a simplified method for solving equation systems as compared to elementary algebraic operations for linear equations.

CALCULATING THE INVERSE OF A MATRIX In Chapter 2, we promised to show the steps involved in taking the inverse of a matrix. Given a 2 × 2 matrix X2×2 , how is the inverse calculated? We can ask the question another way as, “What matrix when multiplied by a given matrix Xr×c will give the identity matrix ([I])? In matrix form we may write a specific example as: −2 −3

1 1 ∼ 2 0

0 1

Therefore, −2 −3

1 c × 1 2 d1

1 d1 = 0 d2

0 =1 1

or stated in matrix notations as A × B = I, where B is the inverse matrix of A, and [I] is the identity matrix.

20

Chemometrics in Spectroscopy

By multiplying A × B we can calculate the two basic equation systems to use in solving this problem as: −2c1 + 1c2 = 1 System 1 −3c1 + 2c2 = 0 −2d1 + 1d2 = 0

System 2

−3d1 + 2d2 = 1 The augmented matrices are denoted as: −2 1 −3 2

1 0

0 1

The first (preceding) matrix is reduced to echelon form (zeros in the first and the second rows of column one) by −2 1 1 0 −2 1 1 0 ∼ −3 2 0 1 0 −1 3 −2 The row operation is II/3I − 2II or row II is replaced by three times row I minus two times row II. The next steps are as follows: −2 1 1 0 −2 0 4 −2 ∼ 0 −1 3 −2 0 −1 3 −2 with row operations as (I/I + II) and I/ − 1/2I.

Thus c1 = −2, c2 = −3, d1 = 1, and d2 = 2. So B = A−1 (inverse of A) and

−2 1 −1 A = −3 2 So now we check our work by multiplying A • A−1 as follows: −2 1 −2 1 −2 × −2 + 1 × −3 −2 × 1 + 1 × 2 −1 × = A × A = −3 2 −3 2 −3 × −2 + 2 × −3 −3 × 1 + 2 × 2 1 0 = = 1 0 1 By coincidence, we have found a matrix which when multiplied by itself gives the identity matrix or, saying it another way, it is its own inverse. Of course, that does not generally happen, a matrix and its inverse are usually different.

SUMMARY Hopefully Chapters 1 and 2 have refreshed your memory of early studies in matrix algebra. In this chapter we have tried to review the basic steps used to solve a system of linear equations using elementary matrix algebra. In addition, basic row operations

Elementary Matrix Algebra: Part 2

21

were used to calculate the inverse of a matrix. In the next chapter we will address the matrix nomenclature used for a simple case of multiple linear regression.

REFERENCES 1. Workman, J., Jr. and Mark, H., Spectroscopy 8(7), 16–19 (1993). 2. Britton, J.R. and Bello, I., Topics in Contemporary Mathematics (Harper & Row, New York, 1984), pp. 408–457.

This page intentionally left blank

4

Matrix Algebra and Multiple Linear Regression: Part 1

In a previous chapter we noted that by augmenting the matrix of coefficients with unit matrix (i.e., one that has all the members equal to zero except on the main diagonal, where the members of the matrix equal unity), we could arrive at the solution to the simultaneous equations that were presented. Since simultaneous equations are, in one sense, a special case of regression (i.e., the case where there are no degrees of freedom for error), it is still appropriate to discuss a few odds and ends that were left dangling. We started in the previous chapter with the set of simultaneous equations: 1a + 1b + 0c = −2

(4-1a)

4a + 2b + 1c = 6

(4-1b)

6a − 2b − 4c = 14

(4-1c)

(where we now leave the subscripts off the variables for simplicity, with no loss of generality for our current purposes). Also note that here we write all the coefficients out explicitly, even when the ones and zeroes do not necessarily appear in the original equations – this is so that they will not be inadvertently left out of the matrix expressions, where the “place filling” function must be performed), and we noted that we could express these equations in matrix notation as: ⎡ ⎤ ⎡ ⎤ ⎡ ⎤ 1 1 0 a −2 2 1⎦ B = ⎣b ⎦ C = ⎣ 6⎦ A = ⎣4 6 −2 −4 c 14 where the equations then take the matrix form: A ∗ B = C

(4-2)

The question here is, how did we get from equations 4-1 to equation 4-2? The answer is that it is not at all obvious, even in such a simple and straightforward case, how to break up a group of algebraic equations into their equivalent matrix expression. It turns out, however, that going in the other direction is often much simpler and straightforward. Thus, when setting up matrix expressions, it is often desirable to run a check on the work to verify that the matrix expression indeed correctly represents the algebraic expression of interest. In the current case, this can be done very simply by carrying out the matrix multiplication indicated on the left-hand side of equation 4-2. Thus, expanding the matrix expression AB into its full representation, we obtain ⎡ ⎤ ⎡ ⎤ 1 1 0 A ⎣4 2 1⎦ × ⎣ B ⎦ (4-3) 6 −2 −4 C

24

Chemometrics in Spectroscopy

From our previous chapter defining the elementary matrix operations, we recall the operation for multiplying two matrices: the i j element of the result matrix (where i and j represent the row and the column of an element in the matrix respectively) is the sum of cross-products of the ith row of the first matrix and the jth column of the second matrix (this is the reason that the order of multiplying matrices depends upon the order of appearance of the matrices – if the indicated ith row and jth column do not have the same number of elements, the matrices cannot be multiplied). Now let us apply this definition to the pair of matrices listed above. The first matrix (A) has three rows and three columns. The second matrix (B) has three rows and one column. Since each row of A has three elements, and the single column of B has three elements, matrix multiplication is possible. The resulting matrix will have three rows, each row resulting from one of the rows of matrix A, and one column, corresponding to the single column in the matrix B. Thus the first row of the result matrix will have the single element resulting from the sum-of-products of the first row of A times the column of B, which will be 1a + 1b + 0c

(4-4)

Similarly the second row of the result matrix will have the single element resulting from the sum-of-products of the second row of A times the column of B, which will be 4a + 2b + 1c

(4-5)

and the third row of the result matrix will have the single element resulting from the sum-of-products of the third row of A times the column of B, which will be 6a + −2b + −4c

(4-6)

6a − 2b − 4c

(4-7)

or, simplifying:

The entire matrix product, then, is ⎡

⎤ 1a + 1b + 0c AB = ⎣4a + 2b + 1c⎦ 6a − 2b − 4c Equations 4-4, 4-5, and 4-6 represent the three elements of the matrix product of A and B. Note that each row of this resulting matrix contains only one element, even though each of these elements is the result of a fairly extensive sequence of arithmetic operations. Equations 4-4, 4-5, and 4-7, however, represent the symbolism you would normally expect to see when looking at the set of simultaneous equations that these matrix expressions replace. Note further that this matrix product AB is the same as the entire left-hand side of the original set of simultaneous equations that we originally set out to solve. Thus we have shown that these matrix expressions can be readily verified through straightforward application of the basic matrix operations, thus clearing up one of the loose ends we had left.

Matrix Algebra and Multiple Linear Regression: Part 1

25

Another loose end is the relationship between the quasi-algebraic expressions that matrix operations are normally written in and the computations that are used to implement those relationships. The computations themselves have been covered at some length in the previous two chapters [1, 2]. To relate these to the quasi-algebraic operations that matrices are subject to, let us look at those operations a bit more closely.

QUASI-ALGEBRAIC OPERATIONS Thus, considering equation 4-2, we note that the matrix expression looks like a simple algebraic expression relating the product of two variables to a third variable, even though in this case the “variables” in question are entire matrices. In equation 4-2, the matrix B represents the unknown quantities in the original simultaneous equations. If equation 4-2 were a simple algebraic equation, clearly the solution would be to divide both sides of this equation by A, which would result in the equation B = C/A. Since A and C both represent known quantities, a simple calculation would give the solution for the unknown B. There is no defined operation of division for matrices. However, a comparable result can be obtained by multiplying both sides of an equation (such as equation 4-2 by the inverse of matrix A. The inverse (of matrix A, for example) is conventionally written as A−1 . Thus, the symbolic solution to equation 4-2 is generated by multiplying both sides of equation 4-2 by A−1 : A−1 AB = A−1 C

(4-8)

There are a couple of key points to note about this operation. The main point is that since the order of appearance of the matrices matters, it is important that the new matrix, the one we are multiplying both sides of the equation by, is placed at the beginning of the expressions on each side of the equation. The second key point is the accomplishment of a desired goal: on the left-hand side of equation 4-8 we have the expression A−1 A. We noted earlier that the key defining characteristic of the inverse of a matrix is that fact that when multiplied by the original matrix (that it is the inverse of), the result is a unit matrix. Thus equation 4-8 is equivalent to 1B = A−1 C

(4-9)

where [1] represents the unit matrix. Since the property of the unit matrix is that when multiplied by any other matrix, the result is the same as the other matrix, then [1]B = B, and equation 4-9 becomes B = A−1 C

(4-10)

Thus we have symbolically solved equation 4-2 for the unknown matrix B, the elements of which are the unknown variables of the original set of simultaneous equations. Performing the matrix multiplication of A−1 C will then provide the values of these unknown variables.

26

Chemometrics in Spectroscopy

Let us examine these symbolic transformations with a view toward seeing how they translate into the required arithmetic operations that will provide the answers to the original simultaneous equations. There are two key operations involved. The first is the inversion of the matrix, to provide the inverse matrix. This is an extremely intensive computational task, so much so that it is in general done only on computers, except in the simplest cases for pedagogical purposes, such as we did in our previous chapter. In this regard we are reminded of an old, and somewhat famous, cartoon, where two obviously professor-type characters are staring at a large blackboard. On the left side of the blackboard are a large number of mathematical symbols, obviously representing some complicated and abstruse mathematical derivations. On the right side of the blackboard is a similar set of symbols. In the middle of the blackboard is a large blank space, in the middle of which is written, in big letters: “AND THEN SOME MAGIC HAPPENS”, and one of the characters is saying to the other: “I think you need to be a bit more explicit here in step 10.” To some extent, we feel the same way about matrix inversions. The complications and amount of computation involved in actually doing a matrix inversion are enough to make even the most intrepid mathematician/statistician/chemometrician run for the nearest computer with a preprogrammed algorithm for the task. Indeed, there sometimes seem to be just about as many algorithms for performing a matrix inversion as there are people interested in doing them. In most cases, then, this process is in practice treated as a “black box” where “some magic happens”. Except for the theoretical mathematician, however, there is usually little interest in “being more explicit”, as long as the program gives the right answer. As is our wont, however, our previous chapter worked out the gory details for the simplest possible case, the case of a 2 × 2 matrix. For larger matrices, the amount of computation increases so rapidly with matrix size that even the 3 × 3 matrix is left to the computer to handle. But how can we tell then if the answer is correct? Well, there is a way, and one that is not too overwhelming. From the definition of the inverse of a matrix, you should obtain a unit matrix if you multiply the inverse of a given matrix by the matrix itself. In our previous chapter [1] we showed this for the 2 × 2 case. For the simultaneous equations at hand, however, the process is only a little more extensive. From the original matrix of coefficients in the simultaneous equations that we are working with, the one called A above, we find that the inverse of this matrix is ⎡

−1

A

−0375 = ⎣ 1375 −125

025 −025 05

⎤ 00625 −00625⎦ −0125

(4-11)

How did we find this? Well, we used some of our magic. The details of the computations needed were described in the previous chapter, for the 2 × 2 case; we will not even try to go through the computations needed for the 3 × 3 case we concern ourselves with here. However, having a set of numbers that purports to be the inverse of a matrix, we can verify whether or not it is the inverse of that matrix: all we need to do is multiply by the original matrix and see if the result is a unit matrix. We have done this for the 2 × 2 matrix in our previous chapter. An exercise for the reader is to verify that the matrix shown in equation 4-11 is, in fact, the inverse of the matrix A.

Matrix Algebra and Multiple Linear Regression: Part 1

27

That was the hard part. It now remains to calculate out the expressions shown in equation 4-10, to find the final values for the unknowns in the original simultaneous equations. Thus, we need to form the matrix product of A−1 and C: ⎡ ⎤ ⎡ ⎤ −0375 025 00625 −2 (4-12) A−1 C = ⎣ 1375 −025 −00625⎦ × ⎣ 6⎦ −125 05 −0125 14 This matrix multiplication is similar to the one we did before: we need to multiply a 3 × 3 matrix by a 3 × 1 matrix; the result will then also have dimensions of three rows and one column. The three rows of this matrix will thus be the result of these computations: C11 = −0375 ∗ −2 + 025 ∗ 6 + 00625 ∗ 14 = 075 + 15 + 0875 = 3125

(4-13a)

C21 = 1375 ∗ −2 + −025 ∗ 6 + −00625 ∗ 14 = −275 + −15 + −875 = −5125

(4-13b)

C31 = −125 ∗ −2 + 05 ∗ 6 + −0125 ∗ 14 = 25 + 3 + −175 = 375

(4-13c)

Thus, in matrix terms, the matrix C is ⎡

⎤ 3125 C = ⎣−5125⎦ 375

(4-14)

and this may be compared to the result we obtained algebraically in the last chapter (and found to be identical, within the limits of different roundings used). At first glance it would seem as though this approach has the additional characteristic of requiring fewer computations than our previous method of solving similar equations. However, the computations are exactly the same, but most of them are “hidden” inside the matrix inversion. It might also seem that we have been repetitive in our explanation of these simul taneous equations. This is intentional – we are attempting to explicate the relationship between the algebraic approach and the matrix approach to solving the equations. Our first solution (in the previous chapter) was strictly algebraic. Our second solution used matrix terminology and concepts, in addition to explicitly writing out all the arithmetic involved. Our third approach uses symbolic matrix manipulation, substituting numbers only in the last step.

28

Chemometrics in Spectroscopy

MULTIPLE LINEAR REGRESSION In Chapters 2 and 3, we discussed the rules related to solving systems of linear equations using elementary algebraic manipulation, including simple matrix operations. The past chapters have described the inverse and transpose of a matrix in at least an introductory fashion. In this installment we would like to introduce the concepts of matrix algebra and their relationship to multiple linear regression (MLR). Let us start with the basic spectroscopic calibration relationship: Concentration = Bias +

(Regression Coefficient 1) × (Absorbance at Wavelength 1) +

(Regression Coefficient 2) × (Absorbance at Wavelength 2)

Also written as:

Concentration = 0 + 1 A1 + 2 A2

(4-15)

In this example we state that the concentration of an analyte within a sample is a linear combination of two variables. These variables, in our case, are measured in the same units, that is Absorbance units. In this case the concentration is known as the dependent variable or response variable because its magnitude depends or responds to the values of the changes in Absorbances at Wavelengths 1 and 2. The Absorbances are the x-variables, referred to as independent variables, regressor variables, or predictor variables. Thus an equation such as equation 4-4 through 4-15 attempts to explain the relationship between concentration and changes in Absorbance. This calibration equation or calibration model is said to be linear because the relationship is a linear combination of multiplier terms or regression coefficients as predictors of the concentration (response or dependent variable). Note that the 1 and 2 terms are called Regression Coefficients, Multiplier Terms, Multipliers, or sometimes Parameters. The analysis described is referred to as Linear Regression, Least-Squares, Linear Least-Squares, or most properly, MLR. In more formal notation, we can rewrite Equation 4-15 as: Ecj = 0 + 1 A1 + 2 A2

(4-16)

where Ecj is the expected value for the concentration. Note: the difference between Ecj and cj is the difference between the predicted or expected value Ecj and the actual or observed value cj . This can be rewritten as: cj − Ecj = cj − 0 + 1 A1 + 2 A2

(4-17)

cj = 0 + 1 A1 + 2 A2 + j

(4-18)

and

where j is termed the Prediction Error, Residual Error, Residual, Error, Lack of Fit Error, or the Unexplained Error.

Matrix Algebra and Multiple Linear Regression: Part 1

29

We can also rewrite the equation in matrix form as: ⎡

⎤ c1 ⎢ c2 ⎥ ⎢ ⎥ ⎢•⎥ ⎢ C = ⎢ ⎥ ⎥ ⎢•⎥ ⎣•⎦ cN

⎡ 1 ⎢1 ⎢ ⎢1 ⎢ A = ⎢ ⎢• ⎣• 1

A11 A21 A31 • • AN 1

⎤ A12 A22 ⎥ ⎥ A32 ⎥ ⎥ • ⎥ ⎥ • ⎦ AN 2

⎡

⎡ ⎤ 0 = ⎣ 1 ⎦ 2

⎤ 1 ⎢ ⎥ ⎢ 2 ⎥ ⎢ 3 ⎥ ⎥ =⎢ ⎢ ⎥ ⎢•⎥ ⎣•⎦ N

(4-19)

This equation of the model in matrix notation is written as: C = A +

(4-20)

THE LEAST SQUARES METHOD The problem now becomes: how do we handle the situation in which we have more equations than unknowns? When there are fewer equations than unknowns it is clear that there is not enough information available to determine the values of the unknown variables. When we have more equations than unknowns, however, we would seem to have the problem of having too much information; how do we handle all this extra information and put it to use? For example, consider the following set of simultaneous equations: 1a + 1b + 0c = −2

(4-21a)

4a + 2b + 1c = 6

(4-21b)

6a − 2b − 4c = 14

(4-21c)

1a + 3b + −1c = −15

(4-21d)

This is a set of equations in three unknowns. The first three of these equations are the ones we dealt with above, and we have seen that the solution to the first three equations is a = 3125

(4-22a)

b = −5125

(4-22b)

c = 375

(4-22c)

However, when we replace a, b and c in equation 4-21d by those values, we find that 1 × 3125 + 3 × −5125 + −1 × 375 = −16 rather than the −15 that the equation specifies. If we were to use different subset of groups of three of these equations at a time, we would obtain different answers depending

30

Chemometrics in Spectroscopy

on which set of three equations we used. There seems to be an inconsistency here, yet in the set of four equations represented by equations 4-21 (a–d) all the equations have the same significance; there are no a priori criteria for eliminating any one of them. This is the situation we must handle. We cannot simply ignore one or more of these equations arbitrarily; dealing with them properly has become known variously as the Least Squares method, Multiple Least Squares, or Multiple Linear Regression. As spectroscopists, we are concerned with the application of these mathematical techniques to the solution of spectroscopic problems, particularly the use of spectroscopy to perform quantitative analysis, which is done by applying these concepts to a set of linear equations, as we will see. In this least squares method example the object is to calculate the terms 0 , 1 and 2 which produce a prediction model yielding the smallest or “least squared” differences or residuals between the actual analyte value cj , and the predicted or expected concentration Ecj . To calculate the multiplier terms or regression coefficients j for the model we can begin with the matrix notation: A� A = A� C

(4-23)

When solving for ˆ the expression becomes ˆ To illustrate the matrix ⎡ 2 1 j ⎢ ⎢ A� A = ⎢ j 1 × Aj1 ⎣ 1 × Aj2 j

= A� A−1 A� C

algebra involved for this problem we write 2 2 ⎤ ⎡ Aj1 × 1 Aj2 × 1 N j j A1•2 ⎥ ⎢A Aj1 ⎥ Aj1 × Aj1 Aj2 × Aj1 ⎥ = ⎢ •1 j ⎣ j j ⎦ A•2 Aj1 Aj2 Aj2 × Aj2 Aj1 × Aj2 j j

j

(4-24)

⎤ A2• Aj2 Aj1 ⎥ ⎥ j 2 ⎦ Aj2 j

(4-25) Then rewriting in summation notation we have N

12 = N

and

j=1 N

Aj1 × Aj2 =

Aj1 Aj2

(4-26)

j=1 N j=1

Aj1 =

Aj•

j

Note that A� C is also required for the computations (see equation 4-24) and is given as: ⎡ ⎤ ⎡ ⎤ 1 × Cj NCj j ⎢ ⎥ ⎢ A C ⎥ ⎢ ⎥ j1 j ⎥ (4-27) A� C = ⎢ j Aj1 Cj ⎥ = ⎢ j ⎣ ⎦ ⎣ A C ⎦ j2 j Aj2 Cj j j

Matrix Algebra and Multiple Linear Regression: Part 1

31

If we represent our spectroscopic data using the following symbols: j Cj N Aj1 Aj2

= Spectrum number = Actual concentration for each spectrum = Rank of each spectrum (1) = Absorbance at Wavelength 1 = Absorbance at Wavelength 2.

From this information we can calculate the ˆ (see equation 4-8) using ⎡ ⎤ c1

⎢c2 ⎥

⎢ ⎥ ⎢•⎥ ⎥ C = ⎢ ⎢•⎥ ⎢ ⎥ ⎣•⎦ cj ⎡

1 ⎢1 ⎢ ⎢1 A = ⎢ ⎢• ⎢ ⎣• 1

A11 A21 A31 • • Aj1

⎤ A12 A22 ⎥ ⎥ A32 ⎥ ⎥ • ⎥ ⎥ • ⎦ Aj2

(4-28)

⎡

⎤ NC j ⎢ Aj1 Cj ⎥ ⎥ A� C = ⎢ j ⎣ ⎦ Aj2 Cj j

If we then calculate the inverse of A� A, written as A� A−1 , the computations are nearly complete and we finally obtain ⎡ ⎤ ˆ0 ⎢ˆ⎥ ˆ = A� A−1 A� C = ⎣ (4-29) 1 ⎦ ˆ 2 which in conclusion gives the completed regression equation ECˆ = ˆ0 + ˆ1 A1 + ˆ2 A2

(4-30)

In our next installment, we will review the “how to” of the matrix operations for this example using numerical data. Authors’ note: This initial chapter dealing with matrix algebra and regression has been adapted for spectroscopic nomenclature from Shayle R. Searle’s book, Matrix Algebra Useful for Statistics (John Wiley & Sons, New York, 1982), pp. 363–368. Other particularly useful reference sources with page numbers are listed below as [1–3].

32

Chemometrics in Spectroscopy

REFERENCES 1. Draper, N.R. and Smith, H., Applied Regression Analysis (John Wiley & Sons, New York, 1981), pp. 70–87. 2. Kleinbaum, D.G. and Kupper, L.L., Applied Regression Analysis and Other Multivariable Methods (Duxbury Press, Boston, 1978), pp. 508–520. 3. Workman, J., Jr. and Mark, H., Spectroscopy 8(7), 16–19 (1993).

5

Matrix Algebra and Multiple Linear Regression: Part 2

In the previous chapter we presented the problem of fitting data when there is more information (in the form of equations relating the several variables involved) available than the minimum amount that will allow for the solution of the equations. We then presented the matrix equations for calculating the least squares solution to this case of overdetermined variables. How did we get from one to the other? As we described the situation, when there are more equations than unknowns, one possibility is to ignore some of the equations. This is unsatisfactory, for a number of reasons. In the first place, there is no a priori criterion for deciding which equations to ignore, so that any choice is arbitrary. Secondly, by rejecting some of the equations, we are also rejecting and wasting the work that went into the collection of the data represented by those equations. Thirdly, and perhaps most importantly, when we ignore some of the equations, we are also ignoring the (rather important) fact that the lack of perfect fit to all the equations is itself an important piece of information. What the set of equations is telling us in this case is that there is, in fact, not a perfect fit of the data, taken as a whole, of any of the equations in the set. Rather, there is some average equation, that in some sense gives a best fit to all of the data taken as a set, without favoring any particular subset of them. It is this “average” equation that we would like to be able to find. In the history of the development of mathematics, one important branch was the study of the behavior of randomness. Initially, there were no highfalutin ideas of making “science” out of what appeared to be disorder; rather, the investigations of random phenomena that lead to what we now know as the science of Statistics began as studies of the behavior of the random phenomena that existed in the somewhat more prosaic context of gambling. It was not until much later that the recognition came that the same random phenomena that affected, say, dice, also affected the values obtained when physical measurements were made. By the time this realization arose, it was well recognized that random phenomena were describable only by probabilistic statements; by definition it is not possible to state a priori what the outcome of any given random event will be. Thus, when the attention of the mathematicians of the time turned to the description of overdetermined systems, such as we are dealing with here, it was natural for them to seek the desired solution in terms of probabilistic descriptions. They then defined the “best fitting” equation for an overdetermined set of data as being the “most probable” equation, or, in more formal terminology, the “maximum likelihood” equation. Under the proper conditions (said conditions being that the errors that prevent all the data relationships from being described by a single equation are normally [1, 2] distributed) it can be proven mathematically that the “most probable” equation is exactly the one that is the “least square” equation. While we have discussed this point

34

Chemometrics in Spectroscopy

briefly in the past [3] it is, perhaps, appropriate at this point to revisit it, in a bit more detail. The basis upon which this concept rests is the very fact that not all the data follows the same equation. Another way to express this is to note that an equation describes a line (or more generally, a plane or hyperplane if more than two dimensions are involved. In fact, anywhere in this discussion, when we talk about a calibration line, you should mentally add the phrase “� � � or plane, or hyperplane � � � ”). Thus any point that fits the equation will fall exactly on the line. On the other hand, since the data points themselves do not fall on the line (recall that, by definition, the line is generated by applying some sort of [at this point undefined] averaging process), any given data point will not fall on the line described by the equation. The difference between these two points, the one on the line described by the equation and the one described by the data, is the error in the estimate of that data point by the equation. For each of the data points there is a corresponding point described by the equation, and therefore a corresponding error. The least square principle states that the sum of the squares of all these errors should have a minimum value; and as we stated above, this will also provide the “maximum likelihood” equation. It is certainly true that for any arbitrarily chosen equation, we can calculate what the point described by that equation is, that corresponds to any given data point. Having done that for each of the data points, we can easily calculate the error for each data point, square these errors, and add together all these squares. Clearly, the sum of squares of the errors we obtain by this procedure will depend upon the equation we use, and some equations will provide smaller sums of squares than other equations. It is not necessarily intuitively obvious that there is one and only one equation that will provide the smallest possible sum of squares of these errors under these conditions; however, it has been proven mathematically to be so. This proof is very abstruse and difficult. In fact, it is easier to find the equation that provides this “least square” solution than it is to prove that the solution is unique. A reasonably accessible demonstration, expressed in both algebraic and matrix terms, of how to find the least square solution is available. Even though regression analysis (one of the more common names for the application of the least square principle) is a general mathematical technique, when we are dealing with spectroscopic data, so that the equation we wish to fit must be fitted to data obtained from systems that follow Beer’s law, it is convenient to limit our discussion to the properties of spectroscopic systems. Thus we will couch our discussion in terms of quantitative analysis performed using spectroscopic data; then the dependent variable of the least square regression analysis (usually called the “Y” variable by mathematicians) will represent the concentration of analyte in the set of samples used to calibrate the system, and the independent (or “X”) variable will represent absorbance values measured by a suitable instrument in whichever spectral region we are dealing with. We will begin our discussion by demonstrating that, for a non-overdetermined system of equations, the algebraic approach and the least-square approach provide the same solution. We will then extend the discussion to the case of an overdetermined system of equations. Therefore this chapter will continue the multiple linear regression (MLR) discussion introduced in the previous chapter, by solving a numerical example for MLR. Recalling

Matrix Algebra and Multiple Linear Regression: Part 2

35

the basic ultraviolet, visible, near-infrared, and infrared use of MLR for spectroscopic calibration, we have Concentration = Constant term (or Bias) + �Regression coefficient 1� • �Absorbance at wavelength 1� + �Regression coefficient 2� • �Absorbance at wavelength 2� + · · · + �Regression coefficient N� • �Absorbance at wavelength N� Also written in equation form as: Concentration = �0 + �1 A�1 + �2 A�2 + · · · + �N A�N

(5-1)

By including an error term, we can write the equation as: Concentration = �0 + �1 A�1 + �2 A�2 + · · · + �N A�N + e And also in expanded matrix form as: ⎡ ⎤ ⎡ A11 A12 A13 A14 c1 ⎢ A21 A22 A23 A24 ⎢c2 ⎥ ⎢ ⎥ ⎢ ⎢•⎥ ⎢ • • • ⎢ ⎥ A=⎢ • c=⎢ ⎥ ⎢ • • • • • ⎢ ⎥ ⎢ ⎣•⎦ ⎣ • • • • cN AM1 AM2 AM3 AM4

• • • • • •

⎤ • A1N ⎥ • A2N ⎥ ⎥ • • ⎥ • ⎥ • ⎥ • • ⎦ • AMN

⎡ ⎤ �1 ⎢�2 ⎥ ⎢ ⎥ ⎢�3 ⎥ ⎢ ⎥ �=⎢ ⎥ ⎢∗⎥ ⎣•⎦ �N

(5-2) ⎡ ⎤ e1 ⎢e2 ⎥ ⎢ ⎥ ⎢e3 ⎥ ⎢ ⎥ e=⎢ ⎥ ⎢•⎥ ⎣•⎦ eN (5-3)

and in simplified matrix notation, the equation is c = a� + e

(5-4)

Because we have limited time and space, let us solve our problem using two wavelengths (or frequencies) and a basic calculator. To define the problem, we start with a set of calibration samples with the characteristics listed in Table 5-1: The system of equations for solving this problem can be written as 2�0 = �0 + �1 �0�75� + �2 �0�28�

(5-5a)

4�0 = �0 + �1 �0�51� + �2 �0�485�

(5-5b)

7�0 = �0 + �1 �0�32� + �2 �0�78�

(5-5c)

Table 5-1 Characteristics of the calibration samples Sample number 1 2 3

Concentration 2�0 4�0 7�0

Signal at wavelength 1

Signal at wavelength 2

0�75 0�51 0�32

0�28 0�485 0�78

36

Chemometrics in Spectroscopy

and in simplified matrix form as C = �A� • ���

(5-6)

and written in matrix form (with the constant term as the third column) as: ⎡ ⎤ ⎡ ⎤ ⎡ ⎤ 2�0 �0 0�75 0�28 1 C = ⎣ 4�0 ⎦ � � = ⎣ �1 ⎦ � A = ⎣0�51 0�485 1⎦ 7�0 �2 0�32 0�78 1 The augmented matrix formed by [A�C] is ⎡ 0�75 �A�C� = ⎣0�51 0�32

(5-7)

designated as: 0�28 0�485 0�78

1 1 1

⎤ 2�0 4�0⎦ 7�0

(5-8)

The first task is to use elementary matrix row operations to manipulate matrix [A�C] to yield zeros in rows II and III of column I. The row operations are to replace row II by row II minus 0.68 times of row I; that is. II = II − 0�68 × I; followed by replacing row III by row III minus 0.4267 times of row I; that is, III = III − 0�4267 × I. To complete our row operations we must accomplish placing zeros in columns I and II of row III by replacing row III by row III minus 2.242 times of row II: that is: III = III − 2�242 × II. These row operations yield (remember to keep as much precision as possible in your calculations): ⎡ ⎤ 0�75 0�28 1 2�0 ⎣0 0�2946 0�32 2�64 ⎦ (5-9) 0 0 −0�1442 0�2274 In summary, by using two series of row operations, namely III − 0�4267 I: and III = III − 2�242 II we have ⎡ ⎤ ⎡ 0�75 0�28 1 2�0 0�75 0�28 1 ⎣0�51 0�485 1 4�0⎦ ∼ ⎣0 0�2946 0�32 0�32 0�78 1 7�0 0 0 −0�1442

II = II − 0�68 I� III = ⎤ 2�0 2�64 ⎦ 0�2274

(5-10)

These two matrices (original and final) are row equivalent because by using simple row operations the right matrix was formed from the left matrix. The final matrix is equivalent to a set of equations as shown below: 0�75�1 + 0�28�2 + 1�0�0 = 2�0 0�2946�2 + 0�32�0 = 2�64 −0�1442�0 = 0�2274

(5-11a) (5-11b) (5-11c)

Now solving the system of equations yields (−0�1442��0 = 0�2274� �0 = −1�577; solv ing for �2 , we find (0.2946) �2 + 0�32�−1�577� = 2�64� �2 = 10�674; solving for �1 yields (0.75)�1 + 6�28�10�674� + 1�−1�577� = 2�0�1 = 0�784.

Matrix Algebra and Multiple Linear Regression: Part 2

37

And so, �0 = −1�577 �1 = 0�784 �2 = 10�674 Substituting into the original equations and calculating the differences between predicted and actual results, we find the results shown in Table 5-2. The foregoing discussion is all based on one important assumption: that the equation describing the relationship between the data does, in fact, include a constant term. If Beer’s law is strictly followed, however, when the concentration of all absorbing constituents is zero, then the absorbance (at all wavelengths, no less) is also zero: that is, the equation describing the relationship between the data generates a line that passes through the origin. If this condition holds, then the constant term of the equation is also exactly zero, and may be dropped from the equation. It has been shown possible to generate a least squares expression for this case also, that is, with the constant of the equation forced to be zero: it is merely necessary to formulate the expression for the prediction equation, corresponding to equation 5-11d as: Conc� = �1 A1 + �2 A2

(5-11d)

Starting from this expression, one can execute the derivation just as in the case of the full equation (i.e., the equation including the constant term), and arrive at a set of equations that result in the least square expression for an equation that passes through the origin. We will not dwell on this point since it is not common in practice. However, we will use this concept to fit the data presented, just to illustrate its use, and for the sake of comparison, ignoring the fact that without the constant term these data are overdetermined, while they are not overdetermined if the constant term is included – if we had more data (even only one more relationship), they would be overdetermined in both cases. Then, if the equation system is solved with no constant term (�0 �, we have the following results (you can either take our word for it or perform the row operations for yourself. Exercise for the reader: do those row operations.): �2 �0�2946� = 2�64, �2 = 8�9613; and �0�75� + 0�28�8�9613� = 2�0, �1 = −0�679. And so, �1 � = −0�679 �2 � = 8�9613 Table 5-2 Results after substituting into the original equations and calculating the differences between predicted and actual results (using manual row operations) Sample number 1 2 3

�0

+

�1 (A�1 �

+

�2 (A�2 �

= Predicted − Actual = Residual

−1�577 + 0.784(0.75) + 10.674(0.28) = −1�577 + 0.784(0.51) + 10.674(0.485) = −1�577 + 0.784(0.32) + 10.674(0.78) =

2.0 4.0 7.0

− − −

2.0 4.0 7.0

= = =

0 0 0

38

Chemometrics in Spectroscopy

Table 5-3 Results when there is no constant (bias) term after substituting into the original equations and calculating the differences between predicted and actual results �1 �A�1 �

+

�1 �A�2 �

=

Predicted

−

Actual

=

Residual

−0�679�0�75� −0�679�0�51� −0�679�0�32�

+ + +

8.9613(0.26) 8.9613(0.485) 8.9613(0.78)

= = =

2�0 4�0 6�78

− − −

2�0 4�0 7�0

= = =

0�0 0�0 −0�23

Sample number 1 2 3

and the results are shown in Table 5-3. Another exercise for the reader: Why is a bias term often used in regression for spectroscopic data?

THE POWER OF MATRIX MATHEMATICS Now let us see what happens when we use pure, unadulterated matrix power to solve this equation system, such that A� A�ˆ = A� C

(5-12)

as equation 4-23 showed us. When solving for the regression coefficients (��, we have ⎡

⎤ �0 ⎣ �1 ⎦ = �ˆ = �A� A�−1 A� C �2

(5-13)

Noting the matrix algebra for this problem (Equation 25 from reference [1]) ⎡

A2j0

⎢ ⎢ A� A = ⎢ j Aj0 Aj1 ⎣ Aj0 Aj2 j

j

⎤ ⎡ ⎤ Aj1 Aj0 Aj2 Aj0 A•2 N A•1 j j j ⎥ ⎢ ⎥ 2 2

⎢ A•1 Aj1 Aj2 Aj1⎥ Aj1 Aj2 Aj1⎥

= ⎥ ⎢ ⎥ (5-14) j j j j 2 ⎦ ⎣ j 2 ⎦ Aj1 Aj2 Aj2 A•2 Aj1 Aj2 Aj2 j

j

j

j

j

j

and substituting the numbers from our current example, we illustrate the following steps: ⎡

⎤ 1 0�75 0�28 A = ⎣ 1 0�51 0�485 ⎦ 1 0�32 0�78

(5-15)

and so the transpose of A (which is A� ) is ⎡ 1 A� = ⎣0�75 0�28

1 0�51 0�485

⎤ 1 0�32⎦ 0�78

(5-16)

Matrix Algebra and Multiple Linear Regression: Part 2

39

and to continue. A transpose (A� ) times A is ⎡

1×1+1×1+1×1 1 × 0�75 + 1 × 0�51 + 1 × 0�32 A� A = ⎣ 0�75 × 1 + 0�51 × 1 + 0�32 × 1 0�75 × 0�75 + 0�51 × 0�51 + 0�32 × 0�32 0�28 × 1 + 0�485 × 1 + 0�78 × 1 0�28 × 0�75 + 0�485 × 0�51 + 0�78 × 0�32 ⎤ ⎡ ⎤ 1 × 0�28 + 1 × 0�485 + 1 × 0�78 3 1�58 1�5450 0�75 × 0�28 + 0�51 × 0�485 + 0�32 × 0�78 ⎦ = ⎣ 1�58 0�925 0�707 ⎦ 0�28 × 0�29 + 0�485 × 0�485 + 0�78 × 0�78 1�545 0�707 0�922 (5-17) Next we need to calculate the inverse of [A� A], designated [A� A]−1 . Because A� A is an X3×3 problem, we had better use a computer program suitably equipped to calculate the inverse (2). ⎡ 3 ⎣1�58 1�545

1�58 0�925 0�707

⎤ ⎡ 1�545 1 0�707⎦ ∼ ⎣0 0�922 0

0 1 0

⎤ 0 0⎦ 1

(5-18)

Exercise for the reader: See if you are able to determine all the row operations required to find the inverse of A� A (We recommend you set aside the better part of an afternoon to work this one through!) The augmented form is written as ⎡

3 ⎣1�58 1�545

1�58 0�925 0�707

1�545 0�707 0�922

1 0 0

⎤ 0 0⎦ 1

0 1 0

(5-19)

Thanks to the power of computers we find that the inverse of A� A is ⎡

348�0747 −1 �A� A� = ⎣−359�3786 −307�7061

−359�3786 373�6609 315�6969

⎤ −307�7061 315�6969⎦ 274�639

(5-20)

Then the next step is to calculate ⎡

⎤

⎡

⎤ ⎡ Nc• 1 ⎥ ⎢ A c⎥ ⎢ ⎥ ⎢ •1 1 ⎥ ⎣ 0�75 A� c = ⎢ j A•1 c1 ⎥ = ⎢ = j ⎦ ⎣ A c ⎦ ⎣ 0�28 •2 2 A c j

j

⎡

A•0 c0

•2 2

1 0�51 0�485

j

⎤ ⎡ ⎤ 1�2� + 1�4� + 1�7� 13 = ⎣ 0�75�2� + 0�51�4� + 0�32�7� ⎦ = ⎣ 5�78 ⎦ 0�28�2� + 0�485�4� + 0�78�7� 7�96

⎤ ⎡ ⎤ 1 2�0 0�32 ⎦ • ⎣ 4�0 ⎦ 0�78 7�0 (5-21)

40

Chemometrics in Spectroscopy

To solve for the regression coefficients (�i �, we are required to calculate (A� A�−1 A� C as follows (see equation 5-13): ⎡ ⎤ ⎡ ⎤ 348�0747 −359�3786 −307�7061 13�0 373�6609 315�6969⎦ • ⎣ 5�78⎦ � = �A� A�−1 A� C = ⎣−359�3786 −307�7061 315�6969 274�639 7�96 ⎡ ⎤ 348�0707�13� + �−359�3786��5�78� + �−307�7061��7�96� = ⎣ �−359�3786��13� + 373�6609�5�78� + 315�6969�7�96� ⎦ (5-22) �−307�7061��13� + 315�6969�5�78� + 274�639�7�96� ⎡ ⎤ ⎡ ⎤ −1�577 �0

= ⎣ 0�786⎦ = ⎣�1 ⎦ 10�675 �2 And, checking our work, we arrive at Table 5-4. Now, if we took our original set of data, as expressed in equations 5-5a–5-5c, and added one more relationship to them, we come up with the following situation: 2�0 = b0 + b1 �0�75� + b2 �0�28�

(5-23a� )

4�0 = b0 + b1 �0�51� + b2 �0�485�

(5-23b� )

7�0 = b0 + b1 �0�32� + b2 �0�78�

(5-23c� )

8�0 = b0 + b1 �0�40� + b2 �0�79�

(2-23d� )

Now we have the situation we discussed earlier: we have four relationships among a set of data, and only three possible variables (even including the b0 term) that we can use to fit these data. We can solve any subset of three of these relationships, simply by leaving one of the four equations out of the solution. If we do that we come up with the following table of results (we forbear to show all the computations here; however, we do recommend to our readers that they do one or two of these, for the practice): b1

b0 Eliminating Eliminating Eliminating Eliminating

equation equation equation equation

5-1: −9�47843 5-2: −10�86455 5-3: −0�520039 5-4: −1�5777

b2

10�39215 10�15801 4�1461 0�78492

16�86274 10�73589 14�6100 10�675

Table 5-4 Results after substituting into the original equations and calculating the differences between predicted and actual results (using MATLAB calculations) Sample number 1 2 3

�0

+

�1 �A�1 �

+

�2 �A�2 �

= Predicted − Actual = Residual

−1�577 + 0.786(0.75) + 10.675(0.28) = −1�577 + 0.786(0.51) + 10.675(0.485) = −1�577 + 0.786(0.32) + 10.675(0.78) =

2.002 4.001 7.001

− − −

2.0 4.0 7.0

= = =

0.002 0.001 0.001

Matrix Algebra and Multiple Linear Regression: Part 2

41

The last entry in this table, the results obtained from eliminating equation 5-4, rep resents of course the results obtained from the original set of three equations, since eliminating equation 5-4 from the set leaves us with exactly that same set. However, even though there does not seem to be much difference between the various equa tions represented by equations 2a� –2d� , it is clear that the fitting equation depends very strongly upon which subset of these equations we choose to keep in our calculations. Thus we see that we cannot arbitrarily select any subset of the data to use in our computations; it is critical to keep all the data, in order to achieve the correct result, and that requires using the regression approach, as we discussed above. If we do that, then we find that the correct fitting equation is (again, this system of equations is simple enough to do for practice – the matrix inversion can be performed using the row operations as we described previously):

Regression results:

b0

b1

−0�685719

6.15659

b2 15.50951

Note, by the way, that if you thought that the regression solution would simply be the average of all the other solutions, you were wrong. By now some of you must be thinking that there must be an easier way to solve systems of equations than wrestling with manual row operations. Well, of course there are better ways, which is why we will refresh your memory on the concept of determinants in the next chapter. After we have introduced determinants we will conclude our introductory coverage of matrix algebra and MLR with some final remarks.

REFERENCES 1. Mark, H. and Workman, J., Statistics in Spectroscopy, (Academic Press, Boston, 1991), pp. 45–56; see also Mark, H. and Workman, J., Spectroscopy 2(9), 37–43 (1987). 2. Mark, H., Principles and Practice of Spectroscopic Calibration (John Wiley & Sons, New York, 1991), pp. 21–24. 3. Mark, H. and Workman, J., Statistics in Spectroscopy (Academic Press, Boston, 1991), pp. 271–281; see also H. Mark and J. Workman, Spectroscopy 7(3), 20–23 (1992).

This page intentionally left blank

6 Matrix Algebra and Multiple Linear Regression: Part 3 – The Concept of Determinants

In the previous chapter [1] we promised a discussion of an easier way to solve equation systems – the method of determinants [2]. To begin, given an X2×2 matrix [A] as � � a1 b1 A= (6-1) a2 b2 the determinant of A is designated by � �a A = �� 1 a2

� b1 �� b2 �

(6-2)

Note that the brackets [ ] used to denote matrices are converted to vertical lines to denote a determinant. To continue, then the determinant of A is calculated this way: Adet = a1 b2 − a2 b1

(6-3)

The determinant is found by cross-multiplying the diagonal elements in a matrix and subtracting one diagonal product from the other, such that � � �a b1 �� (6-4) = a1 b2 − a2 b1 Adet = �� 1 a2 b2 � A numerical example is given as follows: Given A, find its determinant: � � � � �0�75 0�28 � 0�75 0�28 � � If A = � then Adet = � 0�51 0�485 0�51 0�485� = 0�75 × 0�485 − 0�28 × 0�5 = 0�364 − 0�141 = 0�221

(6-5)

To use determinants to solve a system of linear equations, we look at a simple application given two equations and two unknowns. For the equation system C1 = �1 Ak11 + �2 Ak12

(6-6a)

C2 = �1 Ak21 + �2 Ak22

(6-6b)

we denote �1 and �2 as unknown regression coefficients. By algebraic manipulation,

we can eliminate the �2 term from the equation system by multiplying the first equation

44

Chemometrics in Spectroscopy

by Ak22 and the second equation by Ak12 . By subtracting the two equations, we arrive at equations 6-6 through 6-7d: Ak22 C1 = Ak22 �1 Ak11 + Ak22 �2 Ak12

(6-7a)

�−�Ak12 C2 = Ak12 �1 Ak21 + Ak12 �2 Ak22

(6-7b)

Ak21 C1 − Ak12 C2 = Ak21 �1 Ak11 − Ak12 �1 Ak21

(6-7c)

Ak21 C1 − Ak12 C2 = Ak21 Ak11 − Ak12 Ak21 �1

(6-7d)

and

If the (Ak22 Ak11 − Ak12 Ak2 � term is nonzero, then we can divide this term into the above equation (6-7d) to arrive at �1 =

Ak22 C1 − Ak12 C2 Ak22 Ak11 − Ak12 Ak21

Note the denominator can be written as the determinant � � �Ak11 Bk12 � � � �Ak21 Bk21 �

(6-8)

(6-9)

referred to as the determinant of coefficients. We can also write the numerator as the determinant: � � �C1 Ak12 � � � (6-10) �C2 Ak22 � and so, � � C1 � � C2

�1 = � �Ak11 � �Ak21

� Ak12 �� Ak22 �

� Ak12 �� Ak22 �

(6-11)

We can also solve for �2 by algebraic manipulation of the equation system. Elimination of the �1 term is accomplished by multiplying the first equation by Ak21 and the second equation by Ak11 and subtracting the results, dividing by the common term, and lastly, by converting both the numerator and the denominator to determinants, finally arriving at equation 6-12. � � �Ak11 C1 � � � �Ak21 C2 � � � �2 = (6-12) �Ak11 Ak12 � � � �Ak21 Ak22 �

Matrix Algebra and Multiple Linear Regression: Part 3

45

To summarize what is referred to as Cramer’s rule, we can use the following general expressions given a system of two equations (6-13a and 6-13b) in two unknowns such that C1 = �1 Ak11 + �2 Ak12

(6-13a)

C2 = �1 Ak21 + �2 Ak22

(6-13b)

We can generalize a solution to this system of equations by using the following deter minant notation: � � � � � � �Ak11 Ak12 � �C1 Ak12 � �Ak11 C1 � � � D�1 = � � � � D = �� �C2 Ak22 � � D�2 = �Ak21 C2 � Ak21 Ak21 � And so, if D = 0, then we can solve for �1 , and �2 , using the relationships � � � � �C1 Ak12 � �C2 Ak22 � D�1 � �1 = =� � D �Ak11 Ak12 �� �Ak21 Ak22 �

(6-14)

and

�2 =

D�2 = D

� �Ak11 � �Ak21

� � �Ak11 �Ak21

� C1 �� C2 �

� Ak12 �� Ak22 �

(6-15)

There are, of course, additional rules for solving larger equation systems. We will address this subject again in later chapters when we discuss multivariate calibration in greater depth.

REFERENCES 1. Workman, J., Jr. and Mark, H., Spectroscopy 9(1), 16–19 (1994). 2. Britton, J.R. and Bello, I., Topics in Contemporary Mathematics (Harper & Row, New York, 1984), pp. 445–451.

This page intentionally left blank

7

Matrix Algebra and Multiple Linear Regression:

Part 4 – Concluding Remarks

Our discussions on MLR in previous chapters are all based on one important assumption: that the equation describing the relationship between the data does include a constant term. If Beer’s law is strictly followed, however, when the concentration of all absorbing constituents is zero, then the absorbance (at all wavelengths, no less) is also zero, that is the equation describing the relationship between the data generates a line that passes through the origin. If this condition holds, then the constant term of the equation is also exactly zero, and may be dropped from the equation. It has been shown possible to generate a least square expression for this case also, that is with the constant of the equation forced to be zero: it is merely necessary to formulate the expression for the prediction equation, corresponding to equation 7-1 as: Conc = b1 A1 + b2 A2

(7-1 )

Starting from this expression, one can execute the derivation just as in the case of the full equation (i.e., the equation including the constant term), and arrive at a set of equations that result in the least square expression for an equation that passes through the origin. We will not dwell on this point since it is not common in practice. We will use this concept to fit the data presented, just to illustrate its use, and for the sake of comparison, ignoring the fact that without the constant term these data are overdetermined, while they are not overdetermined if the constant term is included – if we had more data (even only one more relationship) they would be overdetermined in both cases. If we take our original set of data, as expressed in equations 7-5a–7.5c [1], and add one more relationship to them, we come up with the following situation: 20 = b0 + b1 075 + b2 028

(7-2a )

40 = b0 + b1 051 + b2 0485

(7-2b )

70 = b0 + b1 032 + b2 078

(7-2c )

80 = b0 + b1 040 + b2 079

(7-2d )

We now have the situation we discussed earlier: we have four relationships among a set of data, and only three possible variables (even including the b0 term) that we can use to fit these data. We can solve any subset of three of these relationships, simply by leaving one of the four equations out of the solution. If we do that we come up with the

48

Chemometrics in Spectroscopy

following table of results (we forbear to show all the computations here; however, we do recommend to our readers that they do one or two of these, for the practice):

Eliminating Eliminating Eliminating Eliminating

equation equation equation equation

7-1: 7-2: 7-3: 7-4:

b0

b1

−947843 −1086455 −0520039 −15777

10.39215 10.15801 4.1461 0.78492

b2 16.86274 10.73589 14.6100 10.675

The last entry in this table, the results obtained from eliminating equation 7-4, of course represents the results obtained from the original set of three equations, since eliminating equation 7-4 from the set leaves us with exactly that same original set. However, even though there does not seem to be much difference between the various equations represented by equations 7-2a –7-2d , clearly the fitting equation depends very strongly upon which subset of these equations we choose to keep in our calculations. Thus we see that we cannot arbitrarily select any subset of the data to use in our computations; it is critical to keep all the data, to achieve the correct result, and that requires using the regression approach, as we discussed above. If we do that, then we find that the correct fitting equation is (again, this system of equations is simple enough to do for practice – the matrix inversion can be performed using the row operations as we described previously):

Regression results:

b0

b1

b2

−0685719

6.15659

15.50951

Note, by the way, if you thought that the regression solution would simply be the average of all the other solutions, you were incorrect. With this chapter we will suspend our coverage of elementary matrix operations until a later chapter.

A WORD OF CAUTION We have noticed recently, a growing tendency for the chemical/spectroscopic community to draw the inference that the term “chemometrics” is virtually equivalent to “quanti tative analysis algorithms”. This misconception seems to be due to the overwhelming concentration of interest in that aspect of the application of chemometric techniques. This perceived equivalency is, of course, incorrect and non-existent in reality. The purview of chemometrics is much wider than that single application area, and encompasses a wide variety of techniques; including algorithms not only for quantitative and qualitative chemical analysis, but also for methods for analyzing, categorizing and generally dealing with data in a variety of ways (just look at the topic list included in the Analytical Chemistry reviews issue when Chemometrics is included). We ourselves have to plead guilty to some extent to promoting this misconception. While discussing and explaining the underlying concepts, we have also inherently spent much time and attention on that single topic, in much the same way that many other authors do.

Matrix Algebra and Multiple Linear Regression: Part 4

49

However, we do recognize and wish to caution our readers to recognize the fact that Chemometrics does in fact include this variety of methodologies alluded to above. We do, in fact, hope to eventually discuss these other concepts. Two items prevent us from just jumping in chin first, however. The first item is that there are, in fact, useful and important things that need to be said about the application of the quantitative analysis algorithms. The second item is the fact that while we are knowledgeable concerning some of the other areas of chemometric interest, we are not and could not possibly be experts in all such areas. We have discussed this between ourselves, and have decided that the only reasonable way to deal with this limitation is to entertain submissions from our readership. Anyone who has particular expertise in a topic that falls under the wider definition of “chemometrics” is welcome to submit one (or more) chapters dealing with that topic. We only request that you try to keep your discussions both simple and complete, using, as we say, only words of one syllable or less.

REFERENCE 1. Workman, J., Jr. and Mark, H., Spectroscopy 9(1), 16–19 (1994).

This page intentionally left blank

8

Experimental Designs: Part 1

The next several chapters will deal with the philosophy of experimental designs. Exper imental design is at the very heart of the scientific method; without proper design, it is well-nigh impossible to glean high-quality information from experimental data col lected. No amount of sophisticated processing or chemometrics can create information not presented within the data. Every scientist has designed experiments. So what is there left for us to say about that topic that chemometrics/statistics can shed some light on? Well, quite a bit actually, since not all experiments are designed equally, but some are definitely more equal than others (to steal a paraphrase). Another way to say it is that every experiment is a designed experiment, but some designs are better than others. In point of fact, the sciences of both statistics and chemometrics each have their own approach to how experiments should be designed, each with a view toward mak ing experimental procedures “better” in some sense. There is a gradation between the two approaches, nevertheless there is also somewhat of a distinction between what might be thought of as classical “statistical experimental design” and the more currently fashionable experimental designs considered from a chemometric point of view. These differences in approach reflect differences in the nature of the information to be obtained from each. Experimental designs, and in particular “statistical” experimental designs, are used in order to achieve one or more of the following goals: 1) Increase efficiency of resource use, that is, obtain the desired information using the fewest possible necessary experiments (this is usually what is thought of when “statistical experimental designs” are considered). This aspect of experimentation is particularly important when the experiment is large to begin with, or if the experiment uses resources that are rare or expensive, or if the experiment is destructive, so that materials (especially expensive ones) are used up. 2) Determine which variables or phenomena (“factors” in statistical/chemometric par lance) in an experiment are the “important” ones. This has two aspects: first is an effect large enough that we can be sure it is real, and not due simply to noise (or error) alone (i.e., “statistically significant”). We have treated this question to some extent in our previous chapters, and the book from it (both titled “Statistics in Spectroscopy”). The second aspect is, if the effect of a factor is indeed real, is it of sufficiently large magnitude to be of practical importance? While the answer to this question is important to understanding the outcome of the experiment, it is not a statistical question, and we will give it fairly short shrift.

52

Chemometrics in Spectroscopy

3) Accommodate noise and/or other random error. 4) Allow estimates to be made of the magnitude of the noise and/or other random error, if for no other reason than to compare our results to so as to tell if they are statistically significant. 5) Allow estimates to be made of the sensitivity to variations in the several factors. This can help decide whether any of the variations seen are of practical importance. A good design also allows these estimates of sensitivity to be made against an error background that is reduced compared to the actual error. This is accomplished by causing the effects of the factors to be effectively “averaged”, thus reducing the effect of error by the square root of the number of items being averaged. 6) Optimize some characteristic of the experimental system. To achieve these goals, certain requirements are imposed on the design and/or the data to be collected. The maximum amount of information can be obtained when: 1) The standard requirements for the behavior of the errors are met, that is, the errors associated with the various measurements are random, independent, normally (i. e., Gaussian) distributed, and are a random sample from a (hypothetical, perhaps) pop ulation of similar errors that have a mean of zero and a variance equal to some finite value of sigma-squared. 2) The design is balanced. This requirement is critical for certain types of designs and unimportant in others. Balance, in the sense used here, means that the values of a given experimental variable (factor) occurs in combination with all of the values of every other factor. For example, common variables in chemical experimentation are temperature and pressure. For a balanced design, experiments should be carried out where the material is held at low temperature, and at both high and low pressure. Additionally, experiments should be carried out where the material is held at high temperature, and at both high and low pressure. If a third variable, such as con centration of a reactant, is to be studied, then high and low pressure and high and low temperature should coexist with both the high and the low concentrations. The foregoing would seem to imply that a balanced experiment would require all possible combinations of conditions. While all-possible-combinations is certainly one way to achieve this balance, the advan tage of “statistical” deigns comes from the fact that clever ways have been devised to achieve balance while needing far fewer experiments than the all-possible-combinations approach would require (Table 8-1). As an illustration of this, let us consider the three aforementioned variables: tem perature, pressure, and concentration of reactant. An all-possible-combinations design would require eight experiments, with the following set of conditions in each experiment (where H and L represent the high and the low temperatures, pressures, etc.): However, to achieve balance, it is not necessary to carry out eight experiments; balance can be achieved with only four experiments with the conditions suitably set (Table 8-2). Check it out: High reactant concentration occurs in combination with each (high and low) temperature, and with each pressure; similarly for low reactant concentration.

Experimental Designs: Part 1

53

Table 8-1 An all-possible-combinations design of three factors, needing eight experiments and sets of conditions Experiment number 1 2 3 4 5 6 7 8

Temperature

Pressure

Concentration

L L L L H H H H

L L H H L L H H

L H L H L H L H

Table 8-2 Balanced design for three factors, needing only four experiments Experiment number 1 2 3 4

Temperature

Pressure

Concentration

L L H H

L H L H

L H H L

You will find the same situation for the other variables. This is not to say that there are no benefits to the larger experimental design, but we are making the point that balance can be achieved with the smaller one, and for those designs where balance is an important consideration, much work (and resources, and MONEY) can be saved. Balance is not always achievable in practice due to physical constraints on the mea surements that can be made. Certain designs do not require balance, and in fact to enforce balance would mitigate some of the benefits of the design. In particular, there are some designs where future experiments to be performed are determined by the results of the past experiments. To enforce balance here would require extra, unnecessary experimentation that did not contribute to the main goal of the whole venture. The various designs that have been generated can be classified into one of several categories. One way to classify experimetal designs is as follows: 1) 2) 3) 4)

Classical designs Screening designs Analytical designs Optimization designs.

In one sense, it is possible to think of the categories involved as “building blocks” for designs, which can then be combined in various ways which depend upon the information that you want to obtain which, in turn, determines the nature of the data to collect. These

54

Chemometrics in Spectroscopy

general categories, by the way, are not mutually exclusive. It is even possible to consider some types of designs as extensions of others, or, vice versa, as subsets, or special cases of other types of designs. Some of these main categories are A) B) C) D) E)

Factorial designs Fractional factorial designs Nested designs Blocked designs Response surface designs.

The key to all “statistical experimental” designs is planning. A properly planned experi ment can achieve all the goals set forth above, and in fewer runs than you might expect (that’s where achieving the goal of efficiency comes in). However, there are certain requirements that must be met: The experiment must be executed according to the plan! All the planning in the world is of naught if carrying out the experiment results in blunders (e.g., even something as crude as dropping a key sample on the floor – and look at how often that has been done!). The statistical literature contains examples (unfortunately) where large experiments, that cost millions of dollars to perform, were completely ruined by carelessness on the part of the personnel actually carrying it out. As noted above, the variations in the data representing the error must meet the usual conditions for statistical validity: they must be random and statistically independent, and it is highly desirable that they be homoscedastic and Normally distributed. The data should be a representative sampling of the populations that the experiment is supposed to explore. Blunders must be eliminated, and all specified data must be collected. The efficiency of these experimental designs has another side effect: any missing or defective data has a disproportionate effect relative to the amount of information that can be extracted from the final data set. When simpler experimental designs are used, where each piece of data is collected for the sole purpose of determining the effect of one variable, loss of that piece of data results in the loss of only that one result. When the more efficient “statistical” experimental designs are used, each piece of data contributes to more than one of the final results, thus each one is used the equivalent of many times and any missing piece of data causes the loss of all the results that are dependent upon it. These types of experimental designs also have some limitations. The first is the exaggeration of the effect of missing or defective data on the results, as mentioned above. The second is the fact that until the entire plan is carried out, little or no information can be obtained. There are generally few, if any, “intermediate results”; only after all the data is available can any results at all be calculated, and then all of them are calculated at once. This phenomenon is related to the first caveat: until each piece of data is collected, it is “missing” from the experiment, and therefore the results that depend upon it cannot be calculated. The simplest possible experimental design would almost not be recognized as an “experimental design” at all, but does serve as a prototype situation (as we like to use for pedagogical purposes). The situation arises when there is one variable (factor) to investigate, and the question is, does this factor have an effect on the property studied? We have introduced this situation earlier, in our discussion of hypothesis testing, as in

Experimental Designs: Part 1

55

our previous Statistics in Spectroscopy book [1–3]. We will discuss how we treated this situation previously, then change our point of view to see how we would do it from the point of view of an “experimental design”.

REFERENCES 1. H. Mark, and J. Workman, “Statistics in Spectroscopy; Elementary Matrix Algebra and Multiple Linear Regression: Conclusion”, Spectroscopy 9(5), 22–23 (June, 1994). 2. H. Mark, and J. Workman, “Statistics in Spectroscopy’, Spectroscopy 4(7), 53–54 (1989). 3. H. Mark, and J. Workman, Statistics in Spectroscopy (Academic Press, Boston, 1991), chapter 18.

This page intentionally left blank

9

Experimental Designs: Part 2

As we have mentioned in the last chapter, “Experimental Design” often takes a form in scientific investigations, such that some of experimental objects have been exposed to one level of the variable, while others have not been so exposed. Oftentimes this situation is called the “experimental subject” versus the “control subject” type of experiment. In the face of experimental error, or other source of variability of the readings, both the “experimental” and the “control” readings would be taken multiple times. That provides the information about the “natural” variability of the system against which the difference between the two can be compared. Then, a t-test is used to see if the difference between the “experimental” and the “control” subjects is greater than can be accounted for by the inherent variability of the system. If it is, we conclude that the difference is “statistically significant”, and that there is a real effect due to the “treatment” applied to the experimental subject. Of course there are variations on this theme: the difference between the “experimental” and the “control” subjects can be due to different amounts of something applied to the two types of object, for example. That is how we have treated this type of experiment previously. We will now consider a somewhat different way to formulate the same experiment; the purpose being to be able to set up the experimental design, and the analysis of the data, in such a way that it can be generalized to more complicated types of experiments. In order to do this, we recognize that the value of any individual reading, whether from the experimental subject or the control subject, can be expressed as the sum of three quantities. These three quantities arise from a careful consideration of the nature of the data. Given that a particular measurement belongs either to the experimental group or to the control group, then the value of the data collected can be expressed as the sum of these three quantities: 1) The grand mean of all the data (experimental + control)

2) The difference between the mean of the data group (experimental or control) and the

grand mean of the data 3) The difference between the individual reading and the mean reading of its pertinent group. This can then be expressed mathematically as: � � � � Xij = X + X i − X + Xij − X i

(9-1)

58

Chemometrics in Spectroscopy

where, Xij represents each individual datum.

X i represents the mean of the particular data group (experimental or control) that the

individual datum belongs to. X represents the grand mean of all the data (from both groups). By rearranging equation 9-1, we can also express it as follows, wherein the fact that it is a mathematical identity becomes apparent: � � � � Xij = X − X + X − X + Xij (9-2) We have previously shown that through the operation called “partitioning the sums of squares”, the following equality holds [1]: �2 � 2 � 2 �� X −X (9-3) Xi = X + Note that what we call the grand mean here is simply called the mean in the prior discussion. That is because in the prior discussion there was no further splitting of the data into subgroups. In the current discussion we have indeed split the data into subgroups; and we note that what was previously the total difference from the mean now consists of two contributions: the difference of each subgroup’s mean from the grand mean, and the difference of each datum’s value from its subgroup’s mean. We might expect, and it turns out to be so (again we leave the proof as an “exercise for the reader”), that sum of squares of the differences of each datum’s value from the grand mean can also be partitioned; thus,: �2 � � � 2 � 2 �� �2 Xij − X i Xij = X + Xi − X + (9-4) We had previously discussed the situation (from a slightly different point of view) where more than two subgroups of data existed. In that case we noted that we could generate two estimates of sigma, the within-group standard deviation. One estimate is calculated from the pooled within-group standard deviation. The other is calculated from the standard deviation between the means of the various subgroups. This quantity, you recall, is equal to the within-group standard deviation divided by the square root of n, the number of data used in the calculation of each subgroup’s mean. However, the second calculation is correct only if the differences between the means is due to the random variations of the data itself, and there are no external influences. If such influences exist, then the second calculation (from the between-group means) will estimate a larger value for sigma than the first calculation (the pooled within-group standard deviations). This was then used as the basis of a statistical hypothesis test: if the value of sigma calculated from the between-groups means is statistically significantly larger than the value of sigma calculated from with the groups, then we have evidence to conclude that there are indeed, external influences acting upon the data, and we used an F -test to determine whether there was more scatter between the means than could be accounted for by the random variations within the subgroups. In the case at hand, with only two subgroups, we can proceed the same way. The difference is that now, with only two subgroups, there is only one degree of freedom

Experimental Designs: Part 2

59

available for the difference between the subgroups. No matter; an F -test with one degree of freedom is possible. Thus, to analyze the data from the model of equation 9-4, we calculate the mean square between the subgroups, and the mean square within the subgroups and perform an F -test (rather than a t-test as before) between these two mean squares. We would recommend doing it formally, with an ANOVA table, but this is the basic calculation. The conclusions drawn will be identical to those drawn by use of the t-test. Check it out: the tabled values of F for one and n degrees of freedom is equal to the square of the value of t for n degrees of freedom. We might also note here, almost parenthetically, that if the hypothesis test gives a statistically significant result, it would be valid to calculate the sensitivity of the result to the difference between the two groups (i.e., divide the difference in the means of the two groups by the difference in the values of the variable that correspond to the “experimental” and “control” groups). As an example of using an experimental design together with its associated analysis of variance to obtain a meaningful result, we have here an example based on some real data that we have collected. The problem was interesting: to troubleshoot a method of (wet) chemical analysis. A large quantity of sample was available, and had been well-ground and mixed. Suitable data was collected to permit performing a straightforward one-way analysis of variance. To start with, 5 g of sample was dissolved in 100 ml of water, and 20 repeat analyses were performed. The resulting values are shown in Table 9-1. The entry in the third row, second column was noted to have been measured under abnormal conditions. Since an assignable cause for this discrepant value was available, the reading was discarded. The statistics for the remaining data were Mean = 5.01, SD = 0.327. This value for the standard deviation was accepted as the best available approximation to the population value for . The next step was to take several different aliquots from a large sample (a different sample than used previously) and collect multiple readings from each of them. Six aliquots were placed in each of six flasks, and six repeat measurements were made on each of these six flasks. Each aliquot consisted of 10 g of test sample/100 ml water. The results are shown in Table 9-2. The value for the pooled within-flask standard deviation, while somewhat higher than for the twenty repeat readings, is not so high as to be worrisome. Strictly speaking, we should have done an F -test between the variance from the two sets of results to see if there is any extra variance there, but we will ignore that question for now, because the important point here is the highly statistically significant value of the “between” flasks standard deviation, indicating some extra source of variation was superimposed on the analytical value.

Table 9-1 Results from 20 repeat readings of 5 g of sample dissolved in 100 ml water 5.12 5.28 4.97 5.20 4.50

5.60 5.14 3.85 4.69 5.12

5.18 4.74 5.39 4.49 5.61

4.71 4.72 4.94 4.91 4.99

60

Chemometrics in Spectroscopy

Table 9-2 Results of repeat readings of six aliquots in six flasks (from 10-g samples) Flask #

Means: SDs:

1

2

3

4

5

6

7.25 7.68 7.76 8.10 7.50 7.58

10.07 9.02 9.51 10.64 10.27 9.64

5.96 6.66 5.87 6.95 6.54 6.29

7.10 6.10 6.27 5.99 6.32 5.54

5.74 6.90 6.29 6.37 5.99 6.58

4.74 6.75 6.71 6.51 5.95 6.50

7.64 0.28

9.85 0.58

6.37 0.42

6.22 0.51

6.31 0.41

6.19 0.77

Pooled SD = 0.52, “Between” SD = 1.46 Expected “Between” SD = 0.212 F = 47 F (crit) = F (0.95, 5, 30) = 2.53

Having found a statistically significant “between” flasks standard deviation, the next step was to formulate hypotheses as to the possible physical causes of this situation. The list we arrived at was the following: • • • •

Inhomogeneous sample Drift between sets of readings Sampling error Something else.

The first physical cause considered was the possibility of an inhomogeneous sample. To eliminate this as a possibility, the sample was ground before aliquots were taken. The sample size was still 10 g of sample per 100 ml of water. In this case, however, time constraints permitted only three replicate readings per flask. The results are shown in Table 9-3. We note that there is still much larger difference between the different flasks’ readings that can be accounted for by the within-flask repeatability. Therefore we press onward to consider another possible cause of the variation; in this case we consider the possibility of inhomogeneity of the sample, at a scale not affected by grinding. For example, the sample might contain small specks of material that are too small to be ground further, Table 9-3 Results of repeat readings of six aliquots in six flasks (from 10-g samples ground)

Means: SDs:

6.57 6.27 6.35

5.06 6.27 5.88

8.07 7.82 8.52

4.93 5.64 5.19

4.78 5.50 5.99

6.23 7.37 5.27

6.39 0.16

5.74 0.61

8.19 0.35

5.25 0.36

5.43 0.61

7.29 1.01

Pooled SD = 0.58, “Between” SD = 1.14 Expected “Between” SD = 0.33 F = 113 F (crit) = F (0.95, 5, 12) = 3.10

Experimental Designs: Part 2

61

Table 9-4 Results from using 10 × larger (100-gram) samples

Means: SDs:

8.29 8.12 8.72 8.54

8.61 8.72 8.42 8.76

10.04 11.67 11.38 10.19

8.86 9.02 9.29 8.63

8.42 0.26

8.63 0.15

10.82 0.82

8.94 0.26

Pooled SD = 0.46, “Between” SD = 1.10 Expected “Between” SD = 0.23 F = 23 F (crit) = F (0.95, 3, 12) = 3.49

but which are large enough to measurably affect the analysis. In this case, the expected distribution of the sampling variation of such particles would be the Poisson distribution [2]. In such a case, if we take a larger sample, we would expect the standard deviation to decrease as the square root of the sample size. Thus, if we take samples ten times larger than previously, the standard deviation of the “between” readings should become approximately one-third of the previous value. Therefore, for the next test, 100 g samples each were dissolved in 1 liter of water. The results are shown in Table 9-4. Note that the “between” standard deviation is almost identical to the previous value; we conclude that inhomogeneity of the sample is not the problem. The possibility of drift between sets of readings was ruled out by virtue of the fact that many of the steps of the analytical procedure were done simultaneously on the several readings of the different aliquots. The possibility of drift between readings was ruled out by repeating the readings in different orders; the same values were obtained regardless of the order of reading. This left “something else” as the possible cause of the variability. When we considered the nature of the test, which was sensitive to parts per million of organic materials, we realized that one possibility was contamination of the glassware by the soap used to clean it. We next cleaned all glassware with chromic acid cleaning solution, and reran the tests, with the result as shown in Table 9-5. Removal of the extraneous source of variability did indeed reduce the “between-flasks” variance to a level that is now explainable (in the statistical sense) by the underlying random variations attributable to the within-flask variability. Table 9-5 Results after cleaning glassware with chromic acid

Means: SDs:

4.65 5.03 4.38

5.98 4.61 4.49

5.19 3.96 4.92

4.97 4.43 4.79

4.62 4.94 3.37

3.93 4.60 5.95

4.68 0.33

5.16 0.73

4.69 0.64

4.73 0.27

4.31 0.83

4.84 1.03

Pooled SD = 0.69, “Between” SD = 0.27 Expected “Between” SD = 0.39 F = 047 F (crit) = F (0.95, 5, 12) = 3.10

62

Chemometrics in Spectroscopy

Table 9-6 Types of experimental designs Number of levels

Number of factors Single

Multiple

Two

Experimental versus control subjects

One-at-a-time designs Factorial designs Fractional factorial designs Nested designs Special designs

Multiple

Sensitivity testing Simple regression

Response surface designs Multiple regression

End of example From the prototype experiment, we can generate many variations of the basic scheme. The two main ways that the model shown in equation 9-4 can be varied is to increase the number of factors and to increase the number of levels of each factor. A given factor must have at least two levels (even if one of the levels is an implied zero), and may have any number greater than two. Table 9-6 lists the types of designs that fall into each of these categories. The types of designs used by scientists in simple settings, not usually considered “statistical” designs, are the “experimental versus control” designs (discussed above), the one-at-a-time designs (where each factor is individually changed from its “control” value to its “experimental” value, then restored when the next fac tor is changed), and the simple regression (often used in calibration work when only one physical variable is affected – in chemistry, electrochemical and chromatographic applications come to mind). The table is not exhaustive, although it does include a majority of experimental designs that are used. One-at-a-time designs are the usual “non-statistical” type of experiments that are often carried out by scientists in all disciplines. Not included explicitly, however, are experimental designs that are generated from combinations of listed items. For example, a multi-factor experiment may have several levels of some of the factors but only two levels of other factors. Also, due to the nature of the physical factors involved, the values of some of the factors may not be under the experimenter’s control. Thus, some factors may be nested, while others may not be.

REFERENCES 1. Mark, H. and Workman, J., Statistics in Spectroscopy (Academic Press, Boston, 1991), pp. 80–81. 2. Mark, H. and Workman, J., Spectroscopy 5(3), 55–56 (1991).

10

Experimental Designs: Part 3

We continue with this chapter specifically dealing with experimental design issues. When we leave the realm of the simplest designs, we find that the experiments, and the analyses of the data therefrom, acquire characteristics not existing in the simpler designs, and beyond obvious extensions of them. For example, consider a two-factor design with each factor at two levels. This is also a form of all-possible-combinations experiment. One item we note here is that there is more than one way to describe the form of an experiment, and we include a short digression here to explicate this multiplicity of ways of describing an experiment. In this particular case, we have two factors, each at two levels. We can describe it as a listing of values corresponding to each experiment (Table 10-1). Alternatively, we can describe it as the experiment number that will correspond to each set of combinations of factors (Table 10-2): Whichever way we choose to describe the design, it (and the others of this type) has some attractive features. We will illustrate these features with a numerical example. For our example, we will imagine an experiment where the scientist is interested in determining the influence of temperature and of catalyst on the yield of a chemical reaction. The questions to be answered are: does the concentration of catalyst make a difference, and does the type of catalyst make a difference? The experiment is to consist of trying each of the four available catalysts and three solvents, and determining the yield. The experiment can be described by Table 10-3. In a more complicated case, where a physical variable such as temperature, which can be assigned meaningful physical values, was the physical variable and the sensitivity of the yield to temperature was of concern, we would then need to maintain (or control) the information regarding the actual temperatures. For our first look at this experiment we will examine the behavior of the experiment under two sets of conditions. The first scenario gives a set of conditions with the results obtained under the following assumptions: 1) There is no influence of solvent 2) None of the catalysts have an effect 3) There are no random influences on the experiment. The second scenario has similar conditions, but with one change: 1) There is no influence of solvent 2) One of the three catalysts has an effect 3) There are no random influences on the experiment.

64

Chemometrics in Spectroscopy

Table 10-1 All-possible-combinations experiment organized as a list of values Experiment number 1 2 3 4

Factor #1

Factor #2

L L H H

L H L H

Table 10-2 All-possible-combinations experiment organized as a table where the body of the table contains the experiment number corresponding to each set of experimental conditions

L H

Factor #1 1 3

2 4

L H

Factor #2 1 2

3 4

Table 10-3 Conditions for the experiment consisting of determining the yield of a chemical reaction with different solvents and temperatures Catalyst number 1 2 3 4

Solvent #1

Solvent #2

Solvent #3

1 4 7 10

2 5 8 11

3 6 9 12

In both experiments, Conditions 1 and 2 together mean that all results from the experi ment will be the same in the first scenario, and all results except the ones corresponding to the “effective” catalyst will be the same; while that one will differ. Condition 3 means that we do not need to use any statistical or chemometric considerations to help explain the results. However, for pedagogical purposes we will examine this experiment as though random error were present, in order to be able to compare the analyses we obtain in the presence and in the absence of random effects. The data from these two scenarios might look like that shown in Table 10-4. For each scenario, the statistical analysis of this type of experimental design would be a two-way analysis of variance. This is predicated on the construction of the experiment, which includes some implicit assumptions. These assumptions are 1) The influence of the factors changing between the rows is independent of the influence of the factors changing between the columns.

Experimental Designs: Part 3

65

Table 10-4 Hypothetical data under two different scenarios, for the experiment examining the effect of temperature and catalyst on yield; with no random variations affecting the data Catalyst number

1 2 3 4

First scenario

Second scenario

Solvent number

Solvent number

1

2

3

1

2

3

25 25 25 25

25 25 25 25

25 25 25 25

25 25 35 25

25 25 35 25

25 25 35 25

2) The influence of the factors changing between the columns is independent of the influence of the factors changing between the rows. 3) Any error (in these first two scenarios assumed zero) is random, has a mean value of zero, and is Normally distributed. If these assumptions hold, then each quantity in the data table can be expressed as the

sum of the following four factors:

1) 2) 3) 4)

The The The The

grand mean of all the data

influence of the value of the factor corresponding to each row

influence of the value of the factor corresponding to each column.

variation superimposed by any random phenomena affecting the data.

This being the case, quantities computed for a two-way analysis of variance are the

following:

1) The grand mean of all the data

2) The mean of each row, and the difference of each row mean from the grand mean (this estimates the influence of the values of the factor corresponding to the rows) 3) The mean of each column, and the difference of each column mean from the grand mean (this estimates the influence of the values of the factor corresponding to the columns) 4) Any difference between the actual data and the corresponding values calculated from the grand mean and the influences of the row and columns factors (this estimates the error variability). In Tabel 10-5, we present the standard representation of this breakdown of the data. There are two important points to note about the results in this table: first the data, shown in the body of the table in Part A, is in fact equal to the sum of the following quantities: 1) the grand mean (shown in Part A)

2) + row differences from the grand mean (shown in Part B)

66

Chemometrics in Spectroscopy

Table 10-5 Part A – ANOVA for the errorless data from Table 10-4 Catalyst number

First scenario Solvent number 1

2

3

1 2 3 4

25 25 25 25

25 25 25 25

25 25 25 25

Col. means:

25

25

25

∗

Second scenario

Row means

Solvent number

Row means

1

2

3

25 25 25 25

25 25 35 25

25 25 35 25

25 25 35 25

25 25 35 25

25

27.5

27.5

27.5

27.5∗

Grand mean

Table 10-5 Part B – RESIDUALS for ANOVA from Table 10-4 after correcting for row and column means Catalyst number

First scenario Solvent number 1

2

3

1 2 3 4

0 0 0 0

0 0 0 0

0 0 0 0

Mean diff. from grand mean:

0

0

0

Second scenario

Row diffs

0 0 0 0

Solvent number 1

2

3

0 0 0 0

0 0 0 0

0 0 0 0

0

0

0

Row diffs

−25 −25 7.5 −25

3) + column differences from the grand mean (shown in Part B) 4) + residuals (shown in the body of Part B). The second point is that the mean of the residuals, representing the error portion of the data, are zero; the data is accounted for entirely by the systematic variations due to the variations between the rows and the variations between the columns (of course, the column differences happen to be zero in this data). Now the really interesting stuff happens when we do in fact have error in the data. Let us look at what happens to these two scenarios when there is a small amount of random error variability superimposed on the data. Now the experimental conditions for the two scenarios are as follows: Scenario #3: 1) There is no influence of solvent 2) None of the catalysts have an effect 3) There is a random due to error on the experiment.

Experimental Designs: Part 3

67

Scenario #4: 1) There is no influence of solvent 2) One of the three catalysts has an effect 3) The same random error exists as in Scenario #1. For these two situations, let us suppose each error has the value as shown in Table 10-6 for the corresponding datum. The values in Table 10-6 were selected randomly, and have a mean of zero and a standard deviation of unity. When these error values are superimposed on the data, we arrive at the Table 10-7. When we subject this data to the same ANOVA calculations as the errorless data, we arrive at the following results (Table 10-8): It is instructive to compare the values in these tables with the corresponding values in the ANOVA tables for the errorless data. In particular, note that in the table corresponding to Scenario 1, even though there is no underlying systematic variations in the data, both the row and the column means are perturbed by the random variations superimposed on the data. How then, can we differentiate these differences from the ones due to real systematic variations such as are present in Scenario 2? The answer, of course, is to do a statistical hypothesis test, but as it stands, we do not seem to have enough information available for such a test. We can compute variances between rows and also between columns, in order to have the mean squares for the corresponding differences, but what are we going to compare these mean squares to? In particular, what are we going to use

Table 10-6 For Scenarios 3 and 4 each error has the following value for the corresponding datum −03583 −09583 0.0416 −10583

0.8416 −12583 −13583 0.4416

0.5416 1.4416 1.4416 0.2416

Table 10-7 Hypothetical data under two different scenarios; for the experiment examining the effect of solvent and catalyst on yield, random variations (from Table 10-6) have zero mean and unity standard deviation Catalyst number

1 2 3 4

Third scenario

Fourth scenario

Solvent number

Solvent number

1

2

3

1

2

3

25.8416 23.7416 23.6416 25.4416

24.6416 24.0416 25.0416 23.9416

25.5416 26.4416 26.4416 25.2416

25.8416 23.7416 33.6416 25.4416

24.6416 24.0416 35.0416 23.9416

25.5416 26.4416 36.4416 25.2416

68

Table 10-8 Part A – DATA: ANOVA for the hypothetical data containing error with mean equal 0 and standard deviation (S) equal to unity Catalyst number

Third scenario

Fourth scenario

Solvent number 1

2

3

1 2 3 4

25.8416 25.7416 25.6416 25.4416

24.6416 24.0416 25.0416 23.9416

25.5416 26.4416 26.4416 25.2416

Col. means:

25.6666

24.4166

25.9166

Grand mean

Row means

1

2

3

25.3416 24.7416 25.0416 24.875

25.8416 25.7416 33.6416 25.4416

24.6416 24.0416 35.0416 23.9416

25.5416 26.4416 36.4416 25.2416

25.3416 24.7416 35.0416 24.875

25∗

27.1666

26.9166

28.4166

27.5∗ Chemometrics in Spectroscopy

∗

Solvent number

Row means

Experimental Designs: Part 3

Table 10-8 Part B – RESIDUALS for the hypothetical data containing error with mean equal 0 and standard deviation (S) equal to unity Catalyst number

Third scenario

Fourth scenario

Solvent number 1

2

3

1 2 3 4

0.8333 −06666 −10666 0.9

−01166 −01166 0.5833 −035

−07166 0.7833 0.4833 −055

Col. diff from grand mean

−03333

−05833

0.9166

Row diff. from grand mean 0.3416 −02583 0.0416 −0125

Solvent number 1

2

3

0.8333 −06666 −10666 0.9

−01166 −01166 0.5833 −035

−07166 0.7833 0.4833 −055

−03333

−05833

0.9166

Row diff from grand mean −21583 −27583 75416 −2625

69

70

Chemometrics in Spectroscopy

to represent the error, to see if the row mean squares or the column mean squares are larger than can be accounted for by the error of the data? The answer to this question is in the residuals. While the residuals might not seem to bear any relationship to either the original data or the errors (which in this case we know because we created them and they are listed above), in fact the residuals contain the variance present in the errors of the original data. However, the value of the error sum of squares is reduced from that of the original data, because of the subtraction of some fraction of the error variation from the total when the row and column means were subtracted from the data itself. This reduction in the sum of squares can be compensated for by making a corresponding compensation in the degrees of freedom used to calculate the mean square from the sum of squares. In this data the sum of squares of the residuals is 5.24 (check it out). The number of degrees of freedom in these residuals is calculated by starting with the total (which is twelve, one from each piece of data in the experiment) and subtracting one degree of freedom for each quantity calculated from and subtracted from the data. What are these? Well, there is one grand mean, four row means, and three column means. The number of degrees of freedom lost = r − 1c − 1 = 4 − 13 − 2 = 6. Thus there is a loss of six degrees of freedom from the twelve, leaving six for the residuals. The mean square for the residuals is thus 5.24/6, or 0.877, and as a check, the square root of that value, 0.934 is an estimate of the error (which we know is unity).

11 Analytic Geometry: Part 1 – The Basics in Two and Three Dimensions

Analytic geometry is a branch of mathematics in which geometry is described through the use of algebra. Rene Descartes (1596–1650) is credited for conceptualizing this mathematical discipline. Recalling the basics, we can express the points of a plane as a pair of numbers with x-axis and y-axis coordinates, designated by (x, y). Note that the x-axis coordinate is termed the “abscissa”, and the y-axis the “ordinate”.

THE DISTANCE FORMULA In two dimensions (x and y), the distance between two points (x1 , y1 ) and (x2 , y2 ) in two-dimensional space (as shown in Figure 11-1) is given by the Pythagorean theorem as D2 = x2 − x1 2 + y2 − y1 2 = x2 − x1 2 + y2 − y1 2

(11-1)

and D=

√ x2 − x1 2 + y2 − y1 2

(11-2)

Note: This relationship holds even when x1 or y1 or both are negative (also shown in Figure 11-1). In three dimensions (x, y, z), we describe three lines at right angles to one another, designated as the x, y, z axes. Three planes are represented as xy, yz, and zx, and the distance between two points (x1 , y1 , z1 ) and (x2 , y2 , z2 is given by D2 = x2 − x1 2 + y2 − y1 2 + z2 − z1 2 = x2 − x1 2 + y2 − y1 2 + z2 − z1 2

(11-3)

and D=

√

x2 − x1 2 + y2 − y1 2 + z2 − z1 2

(11-4)

72

Chemometrics in Spectroscopy Y

(x2, y2)

X

(x1, y1)

Figure 11-1 The distance between two points in a two-dimensional coordinate space is deter mined using the Pythagorean theorem.

DIRECTION NOTATION For two-dimensional problems, given a line with respect to two axes x and y, there is a set of angles and that are designated as the x direction angle and y direction angle, respectively. Thus, as illustrated by using Figures 11-2a and 11-2b, a clearly defined line segment can be described given the angles and on the coordinate axes x and y. The only restriction that applies here is that both angles and must be ≥ 0 and ≤ 180 .

THE COSINE FUNCTION The cosine function applied to Figures 11-2a and 11-2b is given as cos =

x2 − x1 d

(11-5a)

cos =

y2 − y1 d

(11-5b)

and

(a)

(b) Y

Y

β

β α X X

α

Figure 11-2 Two illustrations of the x-direction angle ( and y-direction angle ( for a two-dimensional coordinate system.

Analytic Geometry: Part 1

73

where, d=

√

x2 − x1 2 + y2 − y1 2

(11-6)

Note that cos a and cos p are referred to as the direction cosines of the line segment described. To summarize in expanded notation: x2 − x1 cos = √ x2 − x1 2 + y2 − y1 2

(11-7a)

and cos = √

y2 − y1 x2 − x1 2 + y2 − y1 2

(11-7b)

Example: To find the direction cosines and corresponding angles for a line segment AB, where A is (3, 5) and B is (2, 7); check your work using cos2 + cos2 = 10, and draw a graphic of the line segment (Figure 11-3). The solution proceeds as follows: x2 − x1 = 2 − 3 = −1

(11-8a)

y2 − y1 = 7 − 5 = 2

(11-8b)

and

Therefore, the distance (d) is given by √

x2 − x1 2 + y2 − y1 2 √ √ d = −12 + 22 = 5

d=

(11-9a) (11-9b)

From the formulas above, we can determine that √ cos = −1/ 5 Y

B

β = 26.57° α = 116.5° A

X

Figure 11-3 The x-direction angle and y-direction angle for a line segment, where A is (3, 5) and B is (2, 7) (see example in text).

74

Chemometrics in Spectroscopy

and the corresponding angle is given as √ = cos−1 −1/ 5 = 11657 We also know that √ cos = 2/ 5 therefore the angle is given by √ = cos−1 2/ 5 = 2657 Checking our work using the formula cos2 + cos2 = 10, we find that cos2 11657 + cos2 2657 = 020 + 080 = 10

DIRECTION IN 3-D SPACE To continue our discussion of direction angles, we will use the same nomenclature: x, designated by ; y, designated by ; and z, newly designated by . We can determine the cosine of any direction angle, given the corresponding x, y, z coordinates for designated points in space as: cos = x2 − x1 /d

(11-10a)

cos = y2 − y1 /d

(11-10b)

cos = z2 − z1 /d

(11-10c)

and

and

where, d=

√

x2 − x1 2 + y2 − y1 2 + z2 − z1 2

(11-11)

It follows algebraically that cos 2 + cos 2 + cos 2 = 10

(11-12)

Example: Find the direction cosines and corresponding angles for a line segment AB where A is (2, −1, 4) and B is (4, 1, 2). To solve, use x2 − x1 = 4 − 2 = 2

Analytic Geometry: Part 1

75

and y2 − y1 = 1 − −1 = 2 and z2 − z1 = 2 − 4 = −2 √

x2 − x1 2 + y2 − y1 2 + z2 − z1 2 √ √ d = 22 + 22 + −22 = 12 = 346

d=

and cos = 2/346 = 0577 cos = 2/346 = 0577 cos = −2/346 = −0577 To find the direction angles corresponding to the above we use = cos−1 0577 = 5476 = cos−1 0577 = 5476 = cos−1 −0577 = 12523 Checking the calculations, we use cos2 + cos2 + cos2 = 10 or 0333 + 0333 + 0333 = 100

DEFINING SLOPE IN TWO DIMENSIONS The slope m of a line segment between two points is given as: m = y2 − y1 /x2 − x1 = tan

(11-13)

where is the x direction angle and 0 < 360 . This well-known expression is also equivalent to the tangent of the x direction angle for the line segment defined by the two points on the line. Thus the slope of the line given in Figure 11-4 is tan120 = −174. Just store this information away for the next several chapters as we build a pre chemometrics view of analytic geometry.

76

Chemometrics in Spectroscopy Y

θ = 120°

X

Figure 11-4 Illustration of the slope of a line given an x-direction angle of 120 .

RECOMMENDED READING We recommend a standard text on vector analytic geometry. One good example is 1. White, P.A., Vector Analytic Geometry (Dickenson, Belmont, CA, 1966).

12

Analytic Geometry: Part 2 – Geometric Representation

of Vectors and Algebraic Operations

We continue with our pre-chemometrics review of analytic geometry, noting the term “vector” in all cases can be represented by a matrix of r × c dimensions, where r = # of rows and c = # of columns. The operations defined below will be employed in future discussions.

VECTOR MULTIPLICATION (SCALAR × VECTOR) If M represents a vector with components (or elements) as (Mx , My , then sM (where s is a real number, also termed a “scalar”) is defined as the vector represented by (sMx , sMy ); and the length of sM is s times the length of M. One can relate the direction angles of M to those of sM as follows: For the case where s > 0 (s is a positive, real number), then cos sM = cos M

(12-1a)

cos sM = cos M

(12-1b)

and

So the vectors sM and M have the exact same direction. For the case where s < 0 (where s is a negative, real number), then cos sM = −cos M

(12-1c)

cos sM = −cos M

(12-1d)

and

In this case, the vectors sM and M have the exact opposite directions. (Note: When

s = 0, there is no definition for the vector or direction.)

Example problem. If M = 1 5, then 2M (where s = 2) = 2 × 1 2 × 5 = 2 10,

represented in Figure 12-1 as the line segment from point (0, 0) to (2, 10). (Note: The

expression −2M = −2 −10 is represented by the line segment from point (0, 0) to

−2 −10.]

78

Chemometrics in Spectroscopy

(2, 10) 2M segment (0, 0) to (2, 10) (1, 5) M segment (0, 0) to (1, 5)

–2M segment (0, 0) to (–2, –10)

(–2, –10)

Figure 12-1 An example of scalar × vector multiplication: if M = 1 5, then 2M = 2 10 and −2M = −2 −10.

VECTOR DIVISION (VECTOR ÷ SCALAR) Vector division is represented as vector multiplication by using a fractional multi plier term. For example, if s = 1/2, then sM = 05 25; if s = −1/2, then sM = −05 −25, and so forth.

VECTOR ADDITION (VECTOR + VECTOR) Given M = Mx , My ), where M = 1 3; and N = Nx , Ny ), where N = 3 1, then M + N = MX + Nx My + Ny

(12-2)

The geometric representation is shown in Figure 12-2 for 1 + 3 3 + 1 = 4 4.

M + N = (4, 4)

M = (1, 3)

N = (3, 1)

Figure 12-2 An example of vector + vector addition: If M = 1 3 and N = 3 1, then M + N = 4 4.

Analytic Geometry: Part 2

79

VECTOR SUBTRACTION (VECTOR − VECTOR) Given M = Mx , My ), where M = 1 3, and N = Nx , Ny ), where N = 3 1, then M − N = Mx − Nx My − Ny The geometric representation of M − N = 1 − 3 3 − 1 = −2 2 is shown in Figure 12-3. In our next chapter we will look at the problem of representing higher dimensional space with fewer dimensions; it will be a precursor to discussions of the dimensional aspects of multivariate algorithms.

M – N = (–2, 2)

–N

M = (1, 3)

N = (3, 1)

Figure 12-3 An example of vector-vector subtraction: If M = 1 3 and N = 3 1 then M −N = −2 2.

This page intentionally left blank

13

Analytic Geometry: Part 3 – Reducing Dimensionality

For this chapter, we will reduce three-dimensional data to one-dimensional data using the techniques of projection and rotation. The (x, y, z) data will be projected onto the (x, z) plane and then rotated onto the x axis. This chapter is purely pedagogical and is intended only to demonstrate the use of projection and rotation as geometric terms.

REDUCING DIMENSIONALITY The exercise for this column is to reduce a point on a vector in 3-D space to a point on a vector in 2-D space, then to further reduce the point on a vector in 2-D space to a point on a vector in 1-D space – all the while maintaining as much information as possible. So (x, y, z) is reduced to (x, z), which is further reduced to (x). This process can be represented in symbolic language as (x, y, z) → (x, z) → x.

3-D TO 2-D BY PROJECTION Let us calculate some of the angles relative to the vector in 3-D space as shown in Figure 13-1. To calculate these angles, we refer to Chapter 1, and if we proceed with our calculations we find = cos−1 07071 = 45

(13-1)

and cos =

y2 − y1 2−0 = √ = 07071 d 8

= cos−1 07071 = 45

(13-2)

where, d=

�

x2 − x2 2 + y2 − y2 2 =

�

2 − 02 + 2 − 02 =

√ 8

82

Chemometrics in Spectroscopy z (2, 2, 6)

α

y

β

α

x

Figure 13-1 A point (X, Y , Z) = (2, 2, 6) located along a vector in 3-D space. Both the angle (the angle to the x-axis) and the angle (the angle to the y-axis), as illustrated in the figure are shown as a projection of the 3-D-vector (2,2,6) onto the (x, y) plane, and the proper calculations for both and from what is then a 2-D vector are correct as given in equations 13-1 and 13-2.

Because the third dimension is represented by the z axis, we calculate the z-direction angle on the (x, z) plane as : � = cos

−1

x2 − x1

� x2 − x1 2 + z2 − z1 2

�

� −1

= cos

�

2−0

�

2 − 02 + 6 − 02

= cos−1 03162 = 7157

(13-3)

Now look at Table 13-1 , which describes the trigonometric functions of a right triangle (Figure 13-2). If we apply Table 13-1 to this problem, we can calculate the length of a vector using trigonometric functions. Figure 13-3 illustrates the geometric problem for solving the length of the vector A to B or from points on the (x, z) axis (0, 0) to (2, 6). The angle calculated in equation 13-3 is represented in Figures 13-3 and 13-4; the angle shown in Figure 13-1 is not discussed. Because the third dimension is represented by the z-axis, we calculate the x-direction angle on the (x z) plane as : The correct calculation for this angle () is given in equation 13-3. To calculate the length of the horizontal vector for the projection of vector AB onto the (x, z) plane, we can use sin = opp/hyp

Table 13-1 Trigonometric functions of a right triangle opposite hypotenuse adjacent cos = hypotenuse opposite tan = adjacent sin =

hypotenuse opposite hypotenuse sec = adjacent adjacent cot = opposite csc =

Analytic Geometry: Part 3

83 Hypotenuse Opposite

θ Adjacent

Figure 13-2 A right triangle showing adjacent (adj.), hypotenuse (hyp.) and opposite sides relative to angle .

B

z

(2, 6)

θ hyp

A

adj

D

x

opp

Figure 13-3 The geometric problem associated with calculating the length of a vector AB, given a point (x, z) = (2, 6) in 2-D space. Note that the angle is equal to 90 − 7157 = 183 .

z L = 6.33

α = 71.57°

x

Figure 13-4 Illustration of two-dimensional reduction to one dimension by an x-directional rotation of 7157 .

which becomes hyp = opp/ sin = 2/ sin1843 = 633 Therefore, we can project the AB vector in 3-D space onto 2-D space by using a projection onto the (x, z) plane, resulting in a point on a vector (on the 2-D (x, z) plane) the vector being 6.33 units in length and having an X-direction angle equal to 7157 (as in Figure 13-4).

84

Chemometrics in Spectroscopy

2-D INTO 1-D BY ROTATION By rotating the vector in 2-D space over 7157 in the X-direction, we can align it to the X axis as a 1-D line 6.33 units in length (as shown in Figure 13-5). z

L = 6.33

x

Figure 13-5 By projecting a vector in (x, y, z) space onto a plane in (x, z) space, and by an x-directional rotation of 7157 in the (x, z) plane, we have the reduction of a point on a vector in 3-D space to a point on a vector in 1-D space.

In our next chapter, we will be applying the lessons reviewed over these past three chapters toward a better understanding of the geometric concepts relative to multivari ate regression.

14

Analytic Geometry: Part 4 – The Geometry of Vectors

and Matrices

In this chapter, we plan to use the information presented over the past three chapters to illustrate the geometry of vectors and matrices; these concepts will continue to be discussed routinely throughout this series in relation to regression vectors.

ROW VECTORS IN COLUMN SPACE Let us begin by representing a row matrix M = 1 2 3 in column space as shown in Figure 14-1. Note that the row vector M = 1 2 3 projects onto the plane defined by columns 1 and 2 as a point (1, 2) or a vector (straight line) with a C1 direction angle () equal to � = cos

−1

C12 − C11 d

�

� = cos

−1

1−0 √ 5

� (14-1)

cos−1 04472 = 6343 and a C2 direction angle () equal to � = cos

−1

C22 − C21 d

�

� = cos

−1

2−0 √ 5

� (14-2a)

cos−1 08944 = 2657 where d=

� � √ C12 − C11 2 + C22 − C21 2 = 12 + 12 = 5

(14-2b)

COLUMN VECTORS IN ROW SPACE �

� 1 2 can be represented 3 4 by 2-D row space as shown in Figure 14-2. Note that each column in the matrix can be represented by a column vector as shown in the figure. A matrix consisting of more than one row, such as M =

86

Chemometrics in Spectroscopy

Column 3

Row vector M = [1, 2, 3]

Column 2

β

Column 1

α

Figure 14-1 A representation of a row vector M = 1 2 3 in column space, and the projection of this vector onto the plane represented by Columns 1 and 2.

Row 2 4

Column 2

Column 1

3

2

1 Row 1

0 0

1

2

3

4

�

� 12 Figure 14-2 The representation of column vectors in row space of matrix M = . 34

PRINCIPAL COMPONENTS FOR REGRESSION VECTORS Figure 14-3a shows the projection of two column vectors – C1 = 1 3 and C2 = 3 1 onto their vector sum (or principal component (PC1)). We note that the product 1 3 × 3 1 = 1 × 3 3 × 1 = 3 3 . The vector sum of the two column vectors passes through the point (3, 3). but the projection of each column onto PC1 gives a vector with a length equal to line segments B + C as shown in Figure 14-3b.

Analytic Geometry: Part 4

87

(a)

(b) 4

4 PC1

Column 1

3

PC1

Column 1

3

B 2

2

D

E A

C 1 ∠D

Column 2

1

Column 2

∠α ∠β

0

∠C

0 0

1

2

3

4

0

1

2

3

4

Figure 14-3 (a) The representation of two columns of a matrix in row space. The vector sum of the two column vectors is the first principal component (PC1). (b) A close-up view of Figure 14-3a, illustrating the line segments, direction angles, and projection of Columns 1 and 2 onto the first principal component.

To determine the geometry for Figures 14-3a and 14-3b, we begin by calculating the length of line segment E (Column 1) by using the Pythagorean theorem as E 2 = Hyp2 = 3−02 + 1−02 = 32 + 12 = 10 √ Therefore: E = 10 = 3162

(14-3)

Then the angle C can be determined using opp 1 = adj 3

(14-4a)

1 = 18435 3

(14-4b)

tanC = and tan−1

So ∠C = 18435 , ∠D = 18435 , and ∠ + ∠ − 2 × 18435 = 90 . Thus, both ∠ and ∠ are each equal to 26565 . It follows that the projection of the vectors represented by Columns 1 and 2 onto the vector PC1 yields a right triangle defined by the three line segments C + B, D, and E. The length of PC1 (the hypotenuse) is equal to line segments C + B and is given by cos =

adj E 3162 ⇒ cos = ⇒ 08944 = = 35353 hyp C +B hyp

(14-5)

So the length of the hypotenuse (segments C + B) is 3.5353. We can check our work by calculating the opposite side (D) length as tan =

opp D opp ⇒ tan = ⇒ 0500 = = 15810 adj E 3162

(14-6)

88

Chemometrics in Spectroscopy

And by using the Pythagorean theorem we can calculate the length of the hypotenuse: 31622 + 158102 = 353522

(14-7)

By representing a row vector in column space, or a column vector in row space, we can illustrate the geometry of regression. These concepts combined with matrix algebra will be useful for further discussions of regression. In Chapters 15–20, we will digress from these topics and revisit experimental design concepts. Readers may wish to study additional materials related to the subject of analytical geometry and regression. We recommend two sources of such information below.

RECOMMENDED READING 1. Beebe, K.R. and Kowalski, B.R., Analytical Chemistry 59(17), 1007A–1017A (1987). 2. Fogiel, M., ed., The Geometry Problem Solver (Research and Education Association, New York, 1987).

15 Experimental Designs: Part 4 – Varying Parameters to Expand the Design

We have discussed experimental designs in previous papers [1–4], and in Chapters 8–10. In those previous chapters, the designs we discussed were, with the exception of one particularly interesting design (representing a special case of a more general type of design that we will discuss later), rather simple and plain, in the sense that the designs included only small numbers of levels of the various factors of interest, and were basically considerations of “all possible combination” of those factors – the types of experiments that scientists have been designing “forever” without any thought or consideration that they were “statistical experimental designs”. Obviously, though, since they represent special cases of wider classes of designs, they must also come under that umbrella. So what is special about the experimental designs that we call “statistical” or “chemometric” designs? Actually, very little, until we take a look at what happens when we need to scale these designs up to larger sample numbers or more complex designs. Before we do that, let us consider the various types of experiments, and the nature of the factors that are used in those experiments, involved. Someone doing an experiment is generally trying to learn about the effect of some phenomenon on some quantity that can be measured. While there are cases that do not fit the description we are about to present, one very common type of experiment involves changing (or allowing the change of) some parameter, and then measure the effect of that change. If there is only one such parameter, the situation is pretty straightforward, but things start getting interesting when two or more possible parameters are involved. Intuitively, the first instinct is to measure the results that are obtained for all possible combinations of the available values of the parameters. In Chapter 8, we looked at some experiments that involved two parameters (factors), each at two levels. In Chapter 10, we briefly looked at a three-factor, two-level design, with attention to how it could be represented geometrically. The use of the term “three factor, two level” to describe the design means that each factor was present at two levels, that is, the corresponding parameters were each permitted to assume two values. There are several ways we can expand a design such as this: we can increase the number of factors, the number of levels of each factor, or we can do both, of course. There are other differences than can be superimposed over the basic idea of the simple, all-possible combinations of factors, such as to consider the effect of whether we can control the levels of the factors (if we can then do things that are not possible to do if we cannot control the levels of the factors), whether the “levels” correspond to physical characteristics that can be evaluated and the values described have real physical meaning (temperature, for example, has real physical meaning, while catalyst type does not, even though different catalysts in an experiment may all have different degrees of effectiveness, and reproducibly so).

90

Chemometrics in Spectroscopy

Another consideration is whether all the factors can be changed independently through their range of possible values, or whether there are limits on the possible values. The most obvious limiting situation is the case of mixtures, where all the components of a mixture must sum to 100%. Other limitations might be imposed by the physical (or chemical) behavior of the materials involved: solubility as a function of temperature, for example, or as a function of other materials present (maximum solubility of salt in water–alcohol mixtures, for example, will vary with the ratio of the two solvents). Other limits might be set by practical considerations such as safety; except for specialized work by scientists experienced in the field, few experimenters would want to work, for example, with materials at concentrations above their explosive limits.

REFERENCES 1. 2. 3. 4.

Mark Mark Mark Mark

H. H. H. H.

and and and and

Workman, Workman, Workman, Workman,

J., J., J., J.,

Spectroscopy Spectroscopy Spectroscopy Spectroscopy

9(8), 26–27 (1994). 9(9), 30–32 (1994). 6(1), 13–16 (1991). 10(1),17–20 (1995).

16

Experimental Designs: Part 5 – One-at-a-time Designs

In Chapter 15, which was based on reference [1] we began our discussions of factorial designs. If we expand the basic n-factor two-level experiment by increasing the number of factors, maintaining the restriction of allowing each to assume only two values, then the number of experiments required is 2n , where n is the number of factors. Even for experiments that are easy to perform, this number quickly gets out of hand; if eight different factors are of interest, the number of experiments needed to determine the effect of all possible combinations is 256, and this number increases exponentially. The other obvious way we might want to expand the experiment is to increase the number of levels (values) that some or all of the factors take. In this case, the number of experiments required increases even faster than 2n . So, for example, if each factor is at three levels, then the number of experiments needed is 3n (for eight factors, corresponding to our previous calculation, this comes to 6,561 experiments!). In the general case, the number of experiments needed is i ni , where ni is the number of levels of the ith factor. It should be clear at this point that the problem with this scenario is the sheer number of experiments needed, which in the real world translates into time, resources, and expense. Something must be done. Several “somethings” have been done. The intuitive experimenter, expert in his partic ular field of science but untrained in “statistical” designs, simplifies the whole process by throwing out all the combinations, and uses what are known as simply “one-at-a-time” designs [2]. Five variations of this basic design are described, but basically these are only useful when the random noise or error is small (compared to the expected magnitude of the effects), and involve the experimenter changing one variable (factor) at a time to see which one(s) cause the greatest effect. Sometimes those are then examined in greater detail, by varying them over larger ranges, and/or at values lying within the original range. This solves the problem of the proliferation of experiments, since the number of experiments needed is now only 1+i ni instead of i ni , a much smaller number. It also provides a first-order indication of the effect of each of the factors. The difficulty now is the possibility of throwing out the baby with the bathwater, so to speak, by losing all information about the actual noise level, and information about any possible synergistic or inhibitory interactions between the factors. Thus, when statisticians got into the act, there saw a need to retain the information that was not included in the one-at-a-time plans, while still keeping the total number of experiments manageable; the birth of “statistical experimental designs”. Several types of “statistical experimental designs” have been developed over the years, with, of course,

92

Chemometrics in Spectroscopy

innumerable variations. However, they can be placed into a fairly small group of main design types: 1) 2) 3) 4)

Factorial Fractional factorial Sequential a) Latin square b) Graeco-latin square c) Latin and Graeco-latin cubes 5) Model-building 6) Response surface. By far the most statistical energy has been spent on the design and analysis of factorial designs. Books dealing with such designs (e.g., [3, 4]) spend a good part of their space discussing the variations required to accommodate such considerations as replication, blocking, how to deal with situations where the experiment itself is destructive (so that the same specimen is never available for retesting), whether the experimental conditions can be reproduced at will, and whether the experimental factors (or the desired response) can be assigned meaningful numerical values. Each of these considerations dictates the types of designs that can be considered and how they must be implemented. For our current discussions, however, we have been taking the path of discussing ways to reduce the required number of experiments, while still retaining the advantages of obtaining several types of information about the system under consideration. The simplest such type of design is the sequential design, simplest if for no other reason than that the type of design it replaces is one of the simplest designs itself. We will discuss this type of design in Chapter 17.

REFERENCES 1. 2. 3. 4.

Mark, H. and Workman, J., Spectroscopy 10(9), 21–22 (1995). Daniel, C., Journal of American Statistical Association 68(342), 353–360 (1973). Davies, O.L., The Design and Analysis of Industrial Experiments (Longman, New York, 1978). Box, G.E.P. Hunter, W.G. and Hunter, J.S., Statistics for Experimenters (John Wiley & Sons, New York, 1978).

17 Experimental Designs: Part 6 – Sequential Designs

We begin our discussion of resource-conserving (for want of a better generic term) experimental design with a look at sequential designs. This is the first of the types of experimental designs that have as one of their goals, a reduction in the required number of experiments, while still retaining the advantages of obtaining several types of information about the system under consideration. The simplest such type of design is the sequential design, simplest if for no other reason than that the type of design it replaces is one of the simplest designs itself. This design is the simple test for comparison of means, using the Z-test or the t-test as the test statistic; we have discussed these in our previous column series and book: “Statistics in Spectroscopy” (now in its second edition [1]). The standard t-test (or Z-test) specifies a predefined number of measurements to be made, either for a single condition or for a pair of conditions (i.e., sample-versus “control”). The difference between the two states is compared to the experimental error evidenced in the data, and a decision made based on whether the difference between the states is “large enough”, compared to the noise (or error). For a sequential test, the number of experiments is not predefined. Rather, experiments are performed sequentially (surprise!), and the series terminated as soon as enough data is available that a decision can be made as to whether the difference is “large enough”. True, it is theoretically possible for such a sequence of experiments to be indefinitely long; in practice, however, it is far more common for the situation to become decidable after fewer experiments than are required for the case of a fixed number of experiments. So how does this “magic” experimental design work? The best available discussion we know of is in reference [2]. The standard concept behind this experimental design is illustrated in Figure 17-1. As this figure shows, the “universe” is divided into three regions: the region (A) is the region of acceptance of the null hypothesis; region C is the region of acceptance of the alternative hypothesis. The middle region, B, is the region of continuation: as long as values fall into this region, we must continue with the experiments, since there is not enough information to make a decision. Figure 17-2 shows how this works for two typical cases. First a single experiment is performed, and the results noted. If these results put it into the region of continuing the project (virtually inevitable after only one experiment), then a second experiment is performed, and so forth. Figure 17-2 shows typical results for two possible sequences of experiments: the one indicated by the crosses enters the region of acceptance of the alternative hypothesis after seven experiments, the one indicated by the circles enters the region of acceptance of the null hypothesis after nine experiments. Obviously, the actual number of experiments required will depend on both the nature of the experiments and the definition of the two regions of acceptance. The x-axis represents, clearly, the number of experiments that have been carried out. The y-axis represents a function of

94

Chemometrics in Spectroscopy

A

f (α, β)

B

C

Number of experiments

Figure 17-1 Standard concept behind sequential experimental design (see text for definition of function f ( )).

1 A

B f (α, β)

2

C

5 10 Number of experiments

15

Figure 17-2 Typical results for two possible experimental sequences.

the results of the experiments. Important to note at this point is the fact that, in one way or another, the quantity plotted along the y-axis is a function, not of the result of an experiment, but on one way or another, the cumulative results of all the experiments done up to that point. The key point, then, is how the lines separating the different regions are defined. The total answer will depend, of course, on which statistic is being plotted and on the details of the nature of the hypothesis test being done (e.g., two-tailed versus one-tailed, etc.). For an illustration we consider the sequential test of the hypothesis of the mean of a sample being the same as that of a given population, with the standard deviation known. In the case of fixed sample size, this would be done using a statistical hypothesis test with the Z statistic as the test statistic, and the probability level set simply to . For a sequential test, both the theory and the computations are a bit more complicated. In the case at hand, the defining limits are constructed as shown in Figure 17-3. The expected value of any given measurement is, of course 0 , the population mean. Then the expected value of the sum of n readings, which we label T , equals n for each value

Experimental Designs: Part 6

95

f (α, β)

A

B C

h0 Number of experiments

Figure 17-3 The relationship between the expected value of the statistic and the lines separating the regions of acceptance and rejection from the region indicating continuation of the experiment.

of n, and plotting these sums as a function of n gives the central straight line shown in Figure 17-3; this line represents the expected value of the sum, and has a slope equal to 0 . As can be seen, data that agrees with the null hypothesis will follow this line and eventually move into region A, the region of acceptance of the null hypothesis. The lines separating the two regions are defined by their slope and intercepts. If we let represent the minimum difference from 0 we wish to detect, then the slope of the lines (which is common to the two lines: they are parallel) equals 0 + /2. The y-intercepts, which we designate h, are h0 = − ln1 − / 2 / h1 = ln1 − / 2 / We note several interesting points about these expressions. First, the positions of the lines of demarcation depend, as we would expect, on both the minimum expected departure from 0 we wish to detect and . It also depends upon a quantity that is a logarithm, and the logarithm of the quantity no less, that we have always previously dismissed. While a discussion of properly belongs in the realm of elementary statistics, at this point it is worthwhile to go back to some of those discussions, to examine how this impacts our current interests. We will proceed along with this digression in our next chapter.

REFERENCES 1. Mark, H. and Workman, J., Statistics in Spectroscopy, 1st ed. (Academic Press, New York, 1991). 2. Davies, O.L., The Design and Analysis of Industrial Experiments (Longman, New York, 1978).

This page intentionally left blank

18

Experimental Designs: Part 7 – �, the Power of a Test

In Chapter 17 and reference [1], we started discussing the way a series of experiments could be designed so that the decision to perform another experiment could be based on the outcomes of the experiments already done. We saw there that we needed to be able to tell if we could stop because the result had become statistically significant; and we also saw that we needed a way to tell if we could stop because we had reached the statistically significant conclusion that there is no real difference between the sample and the (hypothetical) reference population. This is necessary, indeed crucial, otherwise we could continue experimenting endlessly, waiting for a statistically significant result when there was no real difference to detect so that none would be expected. The first stopping criterion is straightforward, it is simply the standard hypothesis test, based on probabilities that we have previously discussed of a sample coming from the hypothesized population P0 [2]. The second stopping criterion, however, seems to fly in the face of our previous discussions on the topic, where we said that you could not prove two populations the same. However, the reason for the second statement is that the difficulty in proving that a sample came from a given population is easier to see if we reword the statement of it by making it a double negative, and ask whether we can prove that it did not come from a different population? Now the nature of the difficulty becomes clearer: we have no information about the nature of the “different” population that we want to test against. Now that we can see the problem, we can find a point of attack against it. We can hypothesize a population Pa with any given characteristics we want, and then consider the consequences of dealing with that alternate population. In particular, we consider the probabilities of either accepting or rejecting our original null hypothesis (based on P0 if, in fact, our sample came from the alternate population Pa . The probability of coming to the incorrect conclusion that the sample came from P0 when it really came from Pa is called the probability (compare with the probability, which is the probability of drawing the incorrect conclusion that a sample did not come from P0 when it really did). This is known in statistical parlance as the “power” of the statistical test. Thus, in performing a statistical hypothesis test, we would normally consider only the ordinary tests against the alpha error as a means of determining statistical significance. However, as we have seen, that leaves completely open the number of samples needed. The power of a test gives us a criterion which will allow determining the number of samples. To redefine the term: the power of a statistical test is the probability of obtaining a statistically significant result given that in fact the null hypothesis is false. Ordinarily to show a non-significant result is easy: just use few enough samples. To show that you have obtained a non-significant result when there is a high probability of obtaining a significant result for a false hypothesis is convincing indeed, and also gives us the basis for determining the number of samples needed. On the other hand, we do not want to go overboard and use so many samples that we get statistically significant results for

98

Chemometrics in Spectroscopy

tiny, unimportant differences. As we will see below, the power of the test does allow us to specify the minimum number of samples required, but this number can quickly get out of hand, and show up tiny differences, if we are not careful on how we specify the requirements. The problem with defining criteria for such a test is that it depends on the probability, which is difficult to determine (although we could arbitrarily specify a value, such as 95%). It also depends on the smallest difference you need to detect, the number of samples, the variability of the data (which at least can be determined from the data, the same way it is done for determining ), and the probability of detecting the given difference at a specified alpha- significance level. Thus what we do is to work backwards, so to speak. Since we want to find the number of samples corresponding to different probabilities for , and D (the difference between the data and 0 , we first find the difference corresponding to given values of the other quantities. This can be seen more easily in Figure 18-1. To summarize Figure 18-1 in words, the top curve represents the characteristics of a population P0 with mean 0 . Also indicated in Figure 18-1 is the upper critical limit, marking the 95% point for a standard hypothesis test H0 that the mean of a given sample is consistent with 0 . A measured value above the critical value indicates that it would be “too unlikely” to have come from population P0 , so we would conclude that such a reading came from a different population. Two such possible different, or alternate, populations are also shown in Figure 18-1, and labeled P1 and P2 . Now, if in fact a random sample was taken from one of these alternate populations, there is a given probability, whose value depends on which population it came from, that it would fall above (or below) the upper critical limit indicated for H0 . The shaded areas in Figure 18-1 indicate the probabilities for a random sample falling below the critical value for H0 , when one of those alternate populations is in fact the correct population from which the sample was taken. As can be seen, these probabilities are 50% for population P1 and roughly 5% for population P2 . These probabilities are

P0 Upper critical limit for P0 Mean = µ 0 P1

P2

Figure 18-1 Characteristics of population P0 with mean 0 and alternate populations P1 and P2 (Note that the X-axes have been offset for clarity).

Experimental Designs: Part 7

99

the probabilities of (incorrectly) concluding that the data is consistent with H0 , for the two cases. This same topic is continued in our next chapter.

REFERENCES 1. Mark, H. and Workman, J., Spectroscopy 11(2), 43 (1996). 2. Mark, H. and Workman, J., Statistics in Spectroscopy, 1st ed. (Academic Press, New York, 1991).

This page intentionally left blank

19

Experimental Designs: Part 8 – �, the Power of a Test

(Continued)

Continuing from our previous discussion in Chapter 18 from reference [1], analogous to making what we have called (and is the standard statistical terminology) the error when the data is above the critical value but is really from P0 , this new error is called the error, and the corresponding probability is called the probability. As a caveat, we must note that the correct value of can be obtained only subject to the usual considerations of all statistical calculations: errors are random and independent, and so on. In addition, since we do not really know the characteristics of the alternate population, we must make additional assumptions. One of these assumptions is that the standard deviation of the alternate population Pa is the same as that of the hypothesized population P0 , regardless of the value of its mean. The existence of the probability provides us with the tool for determining what is called the power of the test, which is just 1 − , the probability of coming to the correct conclusion when in fact the data did not come from the hypothesized population P0 . This is the answer to our earlier question: once we have defined the alternate population Pa , we can determine the probability of a sample having come from Pa , just as we can determine the probability of that sample having come from P0 . So how does this help us determine n? As we know from our previous discussion of the Central Limit Theorem [2], the standard deviation of a sample from a population decreases from the population standard deviation as n increases. Thus, we can fix 0 and a and adjust the and probabilities by adjusting n and the critical value. Normally, it is convenient to adjust the critical value to be equidistant from 0 and a , and then adjust n so that that critical value represents the desired probability levels for and . As an example, we can set alpha- and beta- levels to the same value, which makes for a simple computation of the number of samples needed, at least for the simple case we have been considering: the comparison of means. If we use the 95% value for both (a very stringent test), which corresponds to a Z-value of 1.96 (as we know), then if we let D represent the difference in means between the two values (sample data and population mean), and S is the precision of the data, we find that √ D >= 392 S/ n

(19-1)

so that n = 392S/D2 = 15 S/D2

(19-2)

In words, we would need 15 samples for 95% confidence on both alpha and beta, to distinguish a difference of the means equal to the precision of the measurement, and the number increases as the square of any decrease in difference we want to detect.

102

Chemometrics in Spectroscopy

To compute the power for a hypothesis test based on standard deviation, we would have to read off the corresponding probability points from a chi-square table; for 95% confidences on both alpha and beta, the square root of the ratio of 2 (0.95, v) and 2 (0.05, v (v = the degrees of freedom, close enough to n for now) is the ratio of standard deviations that can be distinguished at that level of power. Similarly to the case of the means, v would also be related to the square of that ratio, but 2 would still have to be read from tables (or computed numerically). As an example, for 35 samples, the precision of the instrument could not be tested to be better than � √ 486/216 = 225 = 15 (19-3) or 1.5 times the precision of the reference method with that amount of power, and as before, n will increase as the square of any improvement we want to demonstrate. The ratio of 2 (.95, v to 2 (.05, v does decrease as v increases, but not nearly as fast as the square increases: it is a losing fight. Thus, the use of the concept of the Power of a Test allows specification of the number of samples (although it may turn out to be very high), and by virtue of that forms the basis for performing experiments as a sequential series.

REFERENCES 1. Mark, H. and Workman, J., Spectroscopy 11(6), 30–31 (1996). 2. Mark, H. and Workman, J., Spectroscopy 3(1), 44–48 (1988).

20

Experimental Designs: Part 9 – Sequential Designs

Concluded

Our previous two chapters based on references [1, 2] describe how the use of the power concept for a hypothesis test allows us to determine a value for n at which we can state with both - and -% certainty that the given data either is or is not consistent with the stated null hypothesis H0 . To recap those results briefly, as a lead-in for returning to our main topic [3], we showed that the concept of the power of a statistical hypothesis test allowed us to determine both the and the probabilities, and that these two known values allowed us to then determine, for every n, what was otherwise a “floating” quantity, D. At this point it should be starting to become clear what is going on. If a given set of , and D allow us to determine n, then similarly, a corresponding set of , and n allow us to determine D. Thus for a given and , n and D are functions of each other, and it then becomes a simple matter (at least in principle, in practice the math involved is extremely hairy) to determine the functionality. In fact the actual situation is considerably more complicated to determine mathemat ically. In our previous discussions, we have made a number of simplifying assumptions which cannot be used if we wish to calculate correct values for our expressions, and for which the actual situation must be incorporated into the math. The first of these assumptions is the use of the Normal distribution. When we perform an experiment using a sequential design, we are implicitly using the experimentally determined value of s, the sample standard deviation, against which to compare the difference between the data and the hypothesis. As we have discussed previously, the use of the experimental value of s for the standard deviation, rather than the population value of , means that we must use the t-distribution as the basis of our comparisons, rather than the Normal distribution. This, of course, causes a change in the critical value we must consider, especially at small values of n (which is where we want to be working, after all). The other key assumption that we sort of implied was that the comparison of standard deviation is constant. Of course we know that as n changes, the comparison value changes as the square root of n. This is on top of and in addition to the changes caused by the use of the t rather than the Normal (Z) distribution. So how is this related to the nature of the graph used for the sequential experimental design? We forgo the detailed math here, in deference to trying to impart an intuitive grasp of the topic, and we have already presented the equations involved [3]. The limits of the allowable values around the hypothesized values close in on it as n increases. This behavior is shown in Figure 20-1. If, in fact, we were to plot the mean of the population as a function of n, it would be a horizontal line, just as shown. The mean of the actual data would vary around this horizontal line (assuming the null hypothesis was correct), at smaller and smaller distances, as n increased.

104

Chemometrics in Spectroscopy

Upper critical limit Mean (µ0)

Lower critical limit

n

Figure 20-1 The limits of the allowable values around the hypothesized value close in on it as n increases.

If the null hypothesis was wrong, then the data would vary around a line offset from the line representing 0 , and get closer and closer to it, instead. Eventually, at some value of n, this line would cross the converging lines representing the critical limits around 0 , indicating the result. This is the basic picture, shown in Figure 20-2. For a sequential experimental plan, the sequence is terminated at the first significant experiment, as shown. The details differ, however. By convention, instead of plotting the mean, 0 , as a function of n, the sum of the data, which has a theoretical value of n∗ 0 , is used. Clearly this line will slope upward with a slope of 0 , instead of being horizontal, as will the data plot. The rest of the conceptual picture is the same, however. As we saw previously in reference [3], the slope of the line represented by n∗ 0 is paralleled by the confidence limits for the sum of the data, as represented by the equations in that

First significant reading Upper critical limit Mean (x) Mean (µ0)

Lower critical limit

n

Figure 20-2 If the null hypothesis was wrong, then the data would vary around a line offset from the line representing 0 and get closer and closer to that line.

Experimental Designs: Part 9

105 n × (x )

n × (µ 0)

First significant point

Upper critical limit

Lower critical limit

n

Figure 20-3 The approach of the upper line, representing the probability, corresponds to the approach of the curved lines to the n × 0 line (representing the null hypothesis).

column; thus, at the point where the line representing the successive mean values from the experimental design crosses the confidence limit in Figure 20-2, so does the line representing the successive sums eventually cross the line specified by the equations in reference [3], and illustrated in Figure 20-3 here. According to the derived equations, as we saw previously, the actual confidence limits representing the and probabilities are straight lines parallel to each other but not parallel to the line representing n∗ 0 . The approach of the upper line, representing the probability, corresponds to the approach of the curved lines, shown in Figure 20-3, to the n∗ 0 line (representing the null hypothesis) there. The line representing , however, being parallel to the line, departs from the null hypothesis. This can be interpreted as stating, as we have previously implied, that it is always harder to “prove” the null hypothesis than to disprove it.

REFERENCES 1. Mark, H. and Workman, J., Spectroscopy 11(6), 30–31 (1996). 2. Mark, H. and Workman, J., Spectroscopy 11(8), 34 (1996). 3. Mark, H. and Workman, J., Spectroscopy 11(4), 32–33 (1996).

This page intentionally left blank

21 Calculating the Solution for Regression Techniques: Part 1 – Multivariate Regression Made Simple

For the next several chapters we will illustrate the straightforward calculations used for multivariate regression (MLR), principal components regression (PCR), partial least squares regression (PLS), and singular value decomposition (SVD). In all cases we will use the same notation and perform all mathematical operations using MATLAB (Matrix Laboratory) software [1, 2]. We have already discussed and shown many of the manual methods for calculating the matrix algebra in references [3–6]. Let us begin by identifying a simple data matrix denoted by A. A is used to represent a set of absorbances for three samples as a rows × columns matrix. For our example each row represents a different sample spectrum, and each column a different data channel, absorbance or frequency. We arbitrarily designate A for our example as ⎡ ⎤ ⎡ ⎤ A11 A12 1 7 (21-1) A = ⎣ A21 A22 ⎦ = AI×K = ⎣ 4 10 ⎦ A31 A32 6 14 Thus, the integers 1 and 7 represent the instrument signal for two data channels (fre quencies 1 and 2) for sample Spectrum #1, 4 and 10 represent the same data channel signals (e.g., frequencies 1 and 2) for sample Spectrum #2, and so on. If we arbitrarily set our concentration c vector representing a single component to be a single column of numbers as ⎡ ⎤ ⎡ ⎤ 4 c11 (21-2) cr×c = ⎣ c21 ⎦ = cI×1 = ⎣ 8 ⎦ c31 11 we now have the data necessary to calculate the matrix of regression coefficients b which is given by b11 −1 b = = A� A A� c = A+ c = pˆ (21-3) b21 This b (also known as pˆ = the prediction vector) is often referred to as the regression vector or set of regression coefficients. Note that A� A−1 A� is referred to as the pseu doinverse of A designated as A+ . Note that there is one regression coefficient for each frequency (or data channel). The matrix of predicted values is easily obtained as Matrix A (the data matrix) × Vector b (the regression coefficients) = Vector c (the predicted values). This is shown in matrix notation as A × b = c

(21-4)

108

Chemometrics in Spectroscopy

Now using the MATLAB command line software, we can easily demonstrate this solution (for the multivariate problem we have identified) using a series of simple matrix operations as shown in Table 21-1 below: Table 21-1 Matrix operations in MATLAB to compute equations 21-1–21-4 Command line

Comments

� A = [1 7;4 10;6 14]

Enter the A matrix

� A= 1 7 4 10 6 14

Display the A matrix

� c = [4;8;11]

Enter the concentration vector c

c=

Display the concentration vector c

4 8 11 � b = invA�∗ A∗ A�∗ c

Calculate the regression vector [Note: The inverse applies only to (A�∗ A)]

b= 0.7722 0.4662

Display the regression vector b

� A∗ b ans = 4.0356 7.7509 11.1601

Predict the concentrations [Note: A residual (or difference) exists between the predicted and the actual concentrations 4, 8, and 11].

REFERENCES 1. MATLAB software for Windows from The MathWorks, Inc., 24 Prime Park Way, Natick, Mass. 01760-1500. Internet:[email protected] 2. O’Haver, T.C., Chemometrics and Intelligent Laboratory Systems 6, 95 (1989). 3. Workman, J. and Mark, H., Spectroscopy 8(9), 16 (1993). 4. Workman, J. and Mark, H., Spectroscopy 9(1), 16 (1994). 5. Workman, J. and Mark, H., Spectroscopy 9(4), 18 (1994). 6. Mark, H. and Workman, J., Spectroscopy 9(5), 22 (1994).

22 Calculating the Solution for Regression Techniques: Part 2 – Principal Component(s) Regression Made Simple

For the next several chapters in this book we will illustrate the straight forward cal culations used for multivariate regression. In each case we continue to perform all mathematical operations using MATLAB software [1, 2]. We have already discussed and shown the manual methods for calculating most of the matrix algebra used here in references [3–6]. You may wish to program these operations yourselves or use other software to routinely make these calculations. As in Chapter 21, we begin by identifying a simple data matrix denoted by A. A is used to represent a set of absorbances for three samples as a rows × columns matrix. For our example each row represents a different sample spectrum, and each column a different data channel, absorbance or frequency. We arbitrarily designate A for our example as ⎡

A11 A = ⎣A21 A31

⎤ ⎡ A12 1 A22 ⎦ = AI×K = ⎣4 A32 6

⎤ 7 10⎦ 14

(22-1)

Thus, 1 and 7 represent the instrument signal for two data channels (frequencies 1 and 2) for sample spectrum #1; 4 and 10 represent the same data channel signals (e.g., frequencies 1 and 2) for sample spectrum #2, and so on. We now have the data necessary to calculate the singular value decomposition (SVD) for matrix A. The operation performed in SVD is sometimes referred to as eigenanal ysis, principal components analysis, or factor analysis. If we perform SVD on the A matrix, the result is three matrices, termed the left singular values (LSV) matrix or the U matrix; the singular values matrix (SVM) or the S matrix; and the right singular values matrix (RSV) or the V matrix. We now have enough information to find our Scores matrix and Loadings matrix. First of all the Loadings matrix is simply the right singular values matrix or the V matrix; this matrix is referred to as the P matrix in principal components analysis terminology. The Scores matrix is calculated as The data matrix A × the Loadings matrix V = Scores matrix T

(22-2)

Note: the Scores matrix is referred to as the T matrix in principal components analysis terminology. Let us look at what we have completed so far by showing the SVD calculations in MATLAB as illustrated in Table 22-1.

110

Chemometrics in Spectroscopy

Table 22-1 Matrix operations in MATLAB to compute the SVD of data matrix A Command line

Comments

� A = [1 7;4 10;6 14] A= 1 7 4 10 6 14

Enter the A matrix Display the A matrix

� [U,S,V] = svd(A);

Perform SVD on the A matrix

�U U= 03468 09303 01193 05417 -0.0949 -0.8352 07656 -0.3543 05369

Display the U matrix or the left singular values (LSV) matrix

�S S= 198785 0 0 16865 0 0

Display the S matrix or the singular values (SV) matrix

�V V= 03576 -0.9339 09339 03576

Display the V matrix or the right singular values (RSV) matrix (Note: this is also known as the P matrix or Loadings matrix)

� T = A*V T= 68948 15690 107691 -0.1600 152198 -0.5976

Calculate the Scores Matrix or the T matrix

If we arbitrarily set our concentration c vector representing a single component to be a single column of numbers as ⎡ ⎤ ⎡ ⎤ c11 4 cr×c = ⎣ c21 ⎦ = cI×1 = ⎣ 8 ⎦ (22-3) c31 11 We can now use S, V, and T to calculate the following; A reconstruction of the original data matrix A is computed by using the preselected number of principal components (i.e., columns in our T and V matrices) as A estimated = T × V �

(22-4)

The set of regression coefficients (i.e., the regression vector) is calculated as b (regression vector) = V × S−1 × U � × c

(22-5)

Calculating the Solution for Regression Techniques: Part 2

111

Table 22-2 Matrix operations in MATLAB to compute equations 22-4–22-6 Command line

Comments

� Aest = T*V�

Estimate the A data matrix

� Aest = 10000 70000 40000 100000 60000 140000

Display the estimate for A

� b = V(:,1:2)*inv(S(1:2,1:2))*U(:,1:2)’*c;

Calculate the regression vector [Note: The inverse operation refers only to the singular values matrix S. The calculation to determine b can only be performed using two columns in each of the V, S, and U matrices; this number is equivalent to the number of latent variables (or principal components) used.

b= 07722 04662

Display the regression vector

� cest = (T*V� )*b

Predict the concentrations [Note: This computation is equivalent to (Aest × b)].

cest = 40356 77509 111601

Display the concentration vector [Note: For this example of PCR a residual (or difference) exists between the predicted and the actual concentrations 4, 8, and 11].

The predicted or estimated values of c are computed as c (estimated) = T × V � × b

(22-6)

Now using the MATLAB command line software, we can easily demonstrate this solution (for the multivariate problem we have identified) using a series of matrix operations as shown in Table 22-2.

REFERENCES 1. MATLAB software from The MathWorks, Inc., 24 Prime Park Way, Natick, Mass. 01760-1500. Internet: [email protected]. 2. O’Haver, T.C., Chemometrics and Intelligent Laboratory Systems 6, 95 (1989). 3. Workman, J. and Mark, H., Spectroscopy 8(9), 16 (1993). 4. Workman, J. and Mark, H., Spectroscopy 9(1), 16 (1994). 5. Workman, J. and Mark, H., Spectroscopy 9(4), 18 (1994). 6. Mark, H. and Workman, J., Spectroscopy 9(5), 22 (1994).

This page intentionally left blank

23

Calculating the Solution for Regression Techniques:

Part 3 – Partial Least Squares Regression Made Simple

For the past three chapters we have described the most basic calculations for MLR, PCR, and PLS. Our intent is to show basic computations for these regression methods while avoiding unnecessary complexity which could confuse rather than instruct. There are of course a number of difficulties in taking this simplistic approach; namely the assumptions made for our simple cases do not always hold, and poorly behaved matrices are the rule rather than the exception. We have not yet discussed the concepts of rank, collinearity, scaling, or data conditioning. Issues of graphical representation and details of computational methods and assessing model performance are forthcoming. We ask that you abide with us over the next several chapters as we intend to delve much more deeply into the details and problems associated with regression methods. For this chapter we will illustrate the straightforward calculations used for PLS regres sion utilizing singular value decomposition. For PLS a special case of SVD is used. You will notice that the PLS form of SVD includes the use of the concentration vector c as well as the data matrix A. The reader will note that the scores and loadings are determined using the concentration values for PLS-SVD whereas only the data matrix A is used to perform SVD for principal components analysis. The SVD and PLS SVD will be the subject of several future chapters so we will only introduce its use here and not its derivation. All mathematical operations are completed using MATLAB soft ware [1, 2]. As previously discussed the manual methods for calculating the matrix algebra used within these chapters on the subject is found in references [3–7]. You may wish to program these operations yourselves or use other software to routinely make the calculations. As in our last installment we begin by identifying a simple data matrix denoted by A. A is used to represent a set of absorbances for three samples and three data channels, as a rows × columns matrix. For our example each row represents a different sample spectrum, and each column a different data channel, absorbance or frequency. We arbitrarily designate A for our example as ⎡ A11 Ar×c = ⎣A21 A31

A12 A22 A32

⎤ ⎡ A13 1 A23 ⎦ = AI×K = ⎣4 A33 6

7 10 14

⎤ 9 12⎦ 16

(23-1)

Thus, 1, 7, and 9 represent the instrument signal for three data channels (frequencies 1, 2, and 3) for sample spectrum #1; 4, 10, and 12 represent the same data channel signals (e.g., frequencies 1, 2, and 3) for sample spectrum #2, and so on.

114

Chemometrics in Spectroscopy

If we arbitrarily set our concentration c vector representing a single component to be a single column of numbers as ⎡ ⎤ ⎡ ⎤ 4 c11 cr×c = ⎣ c21 ⎦ = cI×1 = ⎣ 8 ⎦ (23-2) c31 11 We now have both the data matrix A and the concentration vector c required to calculate PLS SVD. Both A and c are necessary to calculate the special case of PLS singular value decomposition (PLSSVD). The operation performed in PLSSVD is sometimes referred to as the PLS form of eigenanalysis, or factor analysis. If we perform PLSSVD on the A matrix and the c vector, the result is three matrices, termed the left singular values (LSV) matrix or the U matrix; the singular values matrix (SVM) or the S matrix; and the right singular values matrix (RSV) or the V matrix. We now have enough information to find our PLS Scores matrix and PLS Loadings matrix. First of all the PLS Loadings matrix is simply the right singular values matrix or the V matrix; this matrix is referred to as the P matrix in principal components analysis and partial least squares terminology. The PLS Scores matrix is calculated as The data matrix A × the PLS Loadings matrix V = PLS Scores matrix T

(23-3)

Note: the PLS Scores matrix is referred to as the T matrix in principal components analysis and partial least squares terminology. Let us look at what we have completed so far by showing the PLS SVD calculations in MATLAB as illustrated in Table 23-1. We can now use S, V, and T to calculate the following: A reconstruction of the original data matrix A is computed by using the preselected number of factors (i.e., columns in our T and V matrices) as A estimated = T × V

(23-4)

The set of regression coefficients (i.e., the regression vector) is calculated as b regression vector = V × S−1 × U × c

(23-5)

The predicted or estimated values of c are computed as c estimated = T × V × b

(23-6)

This expression is equivalent to c estimated = A estimated × b = A × b

(23-7)

or can be used to predict a single sample spectrum a using the expression c estimated = a estimated × c = a × b

(23-8)

Now using the MATLAB command line software, we can easily demonstrate this solution (for the multivariate problem we have identified) using a series of matrix operations as shown in Table 23-2.

Calculating the Solution for Regression Techniques: Part 3

115

Table 23-1 Matrix operations in MATLAB to compute the PLS SVD calculations of data matrix A (see equations 23-1–23-3) Command line

Comments

A = 1 7 9 4 10 12 6 14 16

Enter the A matrix

A= 1 7 9 4 10 12 6 14 16

Display the A matrix

c = [4;8;11]

Enter the c vector

c= 4 8 11

Display the c vector

[U,S,V] = SVDPLS(A,c,3);

Perform PLS SVD on the A matrix. This is a CPAC(7) version of the PLS SVD algorithm.

U U= 03817 -0.9067 -0.1797 05451 00638 08359 07465 04170 -0.5186

Display the U matrix or the left singular values (LSV) matrix

S S= 295796 -0.2076 00000 00000 19904 -0.0367 00000 00000 02038

Display the S matrix or the singular values (SV) matrix

V V= 02446 09345 02588 06283 00506 -0.7764 07386 -0.3525 05747

Display the PLS V matrix or the right singular values (RSV) matrix (Note: this is also known as the P matrix or PLS Loadings matrix)

T = A∗ V T= 112894 -1.8839 -0.0034 161236 00138 01680 220801 06750 -0.1210

Calculate the PLS Scores Matrix or the T matrix

116

Chemometrics in Spectroscopy

Table 23-2 Matrix operations in MATLAB to compute equations 23-4–23-8) Command line

Comments

Aest = T∗ V

Estimate the A data matrix

Aest = 10000 70000 90000 40000 100000 120000 60000 140000 160000

Display the estimate for A

b = V∗ invS∗ U∗ c

Calculate the PLS regression vector [Note: The inverse operation refers only to the singular values matrix S. The calculation to determine b is performed using three columns in each of the V, S, and U matrices; this number is equivalent to the number of latent variables (or PLS factors) used.

b= 11667 -0.6667 08333

Display the regression vector

cest = T∗ V ∗ b

Predict the concentrations [Note: This computation is equivalent to (Aest × b)].

cest = 40000 80000 110000

Display the concentration vector [Note: For this simple example of PLS no residual (or difference) exists between the predicted and the actual concentrations 4, 8, and 11].

REFERENCES 1. MatLab software Version 4.2 for Windows from The MathWorks, Inc., 24 Prime Park Way, Natick, Mass. 01760-1500. Internet: [email protected]. 2. O’Haver, T.C., Chemometrics and Intelligent Laboratory Systems, 6, 95 (1989). 3. Workman, J. and Mark, H., Spectroscopy 8(9), 16 (1993). 4. Workman, J. and Mark, H., Spectroscopy 9(1), 16 (1994). 5. Workman, J. and Mark, H., Spectroscopy 9(4), 18 (1994). 6. Mark, H., and Workman, J., Spectroscopy 9(5), 22 (1994). 7. Center for Process Analytical Chemistry, University of Washington, Seattle, WA, m-script library, 1993 (Contact Mel Koch or Dave Veltkamp for current versions).

24 Looking Behind and Ahead: Interlude

We depart from discussion of our usual topics in this chapter. Over the years since beginning writing on this topic, there has been a spate of telephone calls where the callers, after introducing themselves, said something that could generically be rendered as: “By chance I came across a copy of one of your articles, and am interested in reading more about this subject. Are there any more articles like this, and what are they, and how can I get them?” After discussing this between ourselves, we decided that we have reached a point where it is worthwhile to present our readers with a complete set of the chemometrics writings published to date. Those of you who have been reading our work for a long time will recall that the column series “Chemometrics in Spectroscopy” is a continuation of our previous column series, “Statistics in Spectroscopy”. Statistics in Spectroscopy was published from 1986 to 1992, with some preliminary articles in 1985. The columns from the earlier series, “Statistics in Spectroscopy”, have been collected and published in their entirety as a book (with minor editorial changes appropriate to the change in format from a series of columns to a book) of the same name, now in its second edition. So much for the past; what about the discussion? The last few chapters have been presenting the “nuts and bolts” of some of the more common chemometric techniques for performing quantitative chemometric/spectroscopic calibration, even getting down to the level of a “cookbook” of actual code (written for the MATLAB Matrix Algebra multivariate analysis software). The following chapters will deal first with completing a discussion on the various chemometric techniques in current use, and then to go “under the hood” with them to emphasize the underlying mathematical and theoretical framework that these methods rest upon. One upcoming topic will be a description of the so-called “statistical design of experiments” methodologies, emphasizing those techniques that tend to be obscure, but are more useful than they are dealt with in mainstream Chemometric discussions.

This page intentionally left blank

25 A Simple Question: The Meaning of Chemometrics Pondered

In a 1997 paper, Steve Brown and Barry Lavine state, “Chemometrics is not a subfield of Statistics. Although statistical methods are employed in Chemometrics, they are not the primary vehicles for data analysis” [1]. Parenthetically, we recommend this article as a very nice nonmathematical introduction for the average chemist as to what Chemometrics is, and how it can be used. As far as the quote is concerned, we have to both agree and disagree. On the one hand, we have to recognize the de facto truth that many users of Chemometric techniques are not aware of the Statistical backgrounds of the techniques, and indeed, we sometimes suspect that even the developers of those techniques may also not be aware of, or at least, give the statistical considerations their proper weight. Having said that, we will issue some disclaimers a little further on, because there are some legitimate and justifiable reasons for the existence of this situation. However, ignoring the existence of this situation means that nobody is paying the attention that would eventually lead to the condition being corrected, which would result in a better theoretical understanding of the techniques themselves, with a concomitant improvement in their reliability and definition of their range of applicability. This leads us to the other hand, which, it should be obvious, is that we feel that Chemometrics should be considered a subfield of Statistics, for the reasons given above. Questions currently plaguing us, such as “How many MLR/PCA/PLS factors should I use in my model?”, “Can I transfer my calibration model?” (or more importantly and fundamentally: “How can I tell if I can transfer my calibration model?”), may never be answered in a completely rigorous and satisfactory fashion, but certainly improvements in the current state of knowledge should be attainable, with attendant improvements in the answers to such questions. New questions may arise which only fundamental statistical/probabilistic considerations may answer; one that has recently come to our attention is, “What is the best way to create a qualitative (i.e., identification) model, if there may be errors in the classifications of the samples used for training the algorithm?” Part of the problem, of course, is that the statistical questions involved are very difficult, and have not yet been solved completely and rigorously even by statisticians. Another part of the problem is that very few first-class statisticians are interested in, or perhaps even aware of, the existence of our subdiscipline or its problems. Thus of necessity we push on and muddle through in the face of not always having a completely firm, mathematically rigorous foundation on which to base our use of the techniques we deal with (here comes our disclaimer). So we use these techniques anyway because otherwise we would have nothing: if we waited for complete rigor before we did anything, we would likely be waiting a long, long time, maybe indefinitely, for a solution that might never appear, and in the meanwhile be helpless in the face of the real (and real-world) problems that confront us.

120

Chemometrics in Spectroscopy

But that does not mean that we should not fight the good fight while we are trying to solve current problems, or let that effort distract us. This means two things. The first is to do as we have been doing, and use our imperfect tools and our imperfect understanding of them, to continue to solve problems as best we can. But the second thing we need to do is what we have not been doing, which is to improve our understanding of the tools we use. In this endeavor, more widespread and better understanding and application of the fundamental statistical/probabilistic basis of our chemometric algorithms is crucial. Maybe one of the things we need to accomplish this is to recruit more first-class statisticians into our ranks, so that they can pay proper attention to the fundamentals, and explain them to the rest of us. Also each of us should pay attention and put some effort into learning more about these fundamentals ourselves. Then we could ourselves better understand the phenomena we see occurring in our data and analyses thereof, and then maybe eventually learn how to deal with them properly. In order to appreciate how understanding new statistical concepts can help us, let us look at an example of where we can better apply known statistical concepts, to understand phenomena currently afflicting us. To this end, let us pose the seemingly innocuous question: “When doing quantitative calibration, why is it that we use the formulation of the problem that makes the constituent values the dependent (i.e., the Y ) variable, and make the spectroscopic data the X (or independent) variable, called the Inverse Beer’s Law formulation (sometimes called the P-matrix formulation)?” (For that matter, why is the formulation that we most commonly use called “Inverse Beer’s Law” instead of the direct “Beer’s Law”?) Now, we are sure that everybody reading this chapter thinks they know the answer. Now, if you are among those readers, then you are wrong already, because there are multiple answers to this question, all of them correct, and each of them incomplete. Let us dispose of the most common answer first. This answer is the one given in most of the discussions about the relative merits of the two formulations, e.g. [2], and is essentially a practical one: we use the Inverse Beer’s Law formulation because by doing so, we need to only determine the concentration(s) of the analyte(s) of interest. In the Beer’s law formulation, you must determine the concentrations of all components in a mixture, whether they are of interest or not. Of course, there is benefit to that also; as Malinowski points out, you can determine the number of components in a mixture and their spectra, as well as their concentrations, by proper application of the techniques of factor analysis in such a case [3]. The second answer is similar, but even more simplistic. Figure 25-1 shows a graphical depiction of a two-wavelength calibration situation: the values on the two wavelength axes determine the point on the calibration plane from which to strike a line to the concentration axis. The situation, however, is symmetric; so why don’t we consider the possibility of using the value along one of the wavelength axes along with the concentration value to determine the value along the other wavelength axis? In theory this could be done, but the reason we do not do it is the same as the answer to the main question above: we do not care; this case is of no interest to us. As chemists, we are interested in determining quantities of chemical interest, and we use the spectroscopic values as a mean of attaining this goal; the reverse calculation is of no interest to us as chemists. None of these answers deal with fundamentals. So finally we get to the substantive part of the discussion, the one that connects with our original diatribe concerning the goal

A Simple Question: The Meaning of Chemometrics Pondered

121

Calibration plane CONC

+

WL 2

WL 1

Figure 25-1 Symbolic graphical depiction of a two-wavelength calibration.

and role of Statistics in Chemometric calculations, the one that will give us an answer to our original question that is based on fundamental considerations, and therefore the one that is the purpose of this whole discussion. To fully appreciate the point we have to go back a bit and look at the historical development of spectroscopic quantitative analysis. Back when we were in school and taking academic courses in Analytical Chemistry, spectroscopy was only one of many techniques presented (and one of the “minor” ones, at that). Now, we can not really compare our experiences with what is being done currently because we are somewhat out of touch with academia, but back then what we now call the Beer’s Law formulation (i.e., making the constituent concentration the X-variable) was the one presented and taught, and we were required to use it. Of course, as an academic exercise the system was simplified: there was only one analyte in a pure solvent, so in principle it would seem that we could have put either variable on the X-axis. Nowadays, standard practice would impel us to put the analyte concentration on the Y -axis even in this simplified situation (whether it belonged there or not). What has changed between then and now? Well in fact considerable has changed, in both the nature of the situation surrounding the analysis and the instruments we use to do the measurements. Back in the days of our academic exercises, spectrometers were based on vacuum-tube technology (remember them? – or are we dating ourselves?), were noisy, drifted terribly, and were full of all manner of error sources. The samples we used to calibrate the instrument, on the other hand, were made synthetically, by weighing the analyte on an analytical balance and dis solving it in the fixed volume of a volumetric flask. Both of these items were considered to be the highest-precision, highest-accuracy measuring devices available. Therefore, in those days, the accuracy of the spectroscopic measurements were considered to be far inferior to the accuracy of the training samples. In those days, Statistics was more highly regarded than it is now, and the analytical chemists then knew the fundamental requirements of doing calibration work. There are several; we need not go into all of them now, but the one that is pertinent to our current discussion is the one that states that, while the Y -variable may contain error, the X-variable must be known without error. Now, in the real world this is never true, since all quantities are the result of some measurement, which will therefore have error

122

Chemometrics in Spectroscopy

associated with it. In practice, however, it is sometimes possible to reduce the error to a sufficiently small value that it approximates zero well enough for the calibration calculations to work. What happens if we do not manage to keep the X error “sufficiently small”? Let us examine a situation which is just complicated enough to show the effects; three sets of data are presented in Table 25-1, that we will use, along with some of the statistics Table 25-1 Three sets of data illustrating the effect of errors in X and in Y on the results obtained by calibration (A) No error Sample #

X

Y

1 2 3 4

0 0 10 10

0 0 10 10

Intercept = 0 Slope = 1 Correlation coeff = 1 SEE = 0 PRESS = 0 (B) Error in Y Sample #

X

Y

1 2 3 4

0 0 10 10

−1 1 9 11

X

Y

−1 1 9 11

0 0 10 10

Intercept = 0 Slope = 1 Correlation coeff = 0.98058 SEE = 1.4142 PRESS = 2.000 (C) Error in X Sample # 1 2 3 4 Intercept = 0.19231 Slope = 0.96154 Correlation coeff = 0.98058 SEE = 1.38675 PRESS = 1.92018

A Simple Question: The Meaning of Chemometrics Pondered (a)

123

(b)

Y

Y Correct model

Correct model

X

X

(c) Correct model Y Calculated model

X

Figure 25-2 Graphical representation of three regression situations. (a) no error. (b) Error in y only. (c) Error in x only. See text for discussion.

associated with calibration calculations based on those data. Graphical representations of the three data sets are displayed in Figures 25-2A through 25-2C, so that the respective models can be compared to the data. We present univariate data, since that shows the effects we wish to illustrate, and is the simplest example that will do so. The biggest advantage to a scenario like this is that we know the “right” answer, because we can make it whatever we want it to be. In this case, the right answer is that the intercept is zero and the slope is 1 (unity). Table 25-1A represents this condition with four samples whose data follow that model without error. The data in Table 25-1A are the prototype data upon which we will build data containing error, and investigate the effects of errors in Y and in X. We use four data points, in coincident pairs, so that when we introduce error, we can retain certain important properties that will result in the same model being the correct one for the data. Along with the data, we show the results of doing the calibration calculations on the data. For Table 25-1A, the slope and the intercept are as we described, the error (which we measure as both the Standard Error of Estimate [SEE] and using cross-validation [the PRESS statistic, using the leave-one-out algorithm]) is zero (naturally), and the correlation coefficient is unity – a necessary concomitant of having zero error.

124

Chemometrics in Spectroscopy

Now in Table 25-1B, we introduce error into the Y variable. We do so by adding +1 to one each of the high and low values, and −1 to each of the other high and low values. This maintains symmetry and keep the average position of the pairs of points remains the same, which guarantees that the correct model for the data does not change. This is in accordance with theory and is borne out when the calibration calculations are performed: the model is identical, even though the error (SEE) is no longer zero and the correlation coefficient is no longer unity. Go ahead: redo the calculations and check this out for yourself. Now, the purists and the sharper-eyed among us may argue that another requirement of regression theory is that the errors follow a Normal (i.e., Gaussian) distribution and that these errors are not distributed properly. We counter this argument by pointing out that there is not enough data to tell the difference; there is no significance test that can be used to demonstrate that the data either do or do not follow any predetermined distribution. Finally, and of most interest, is the data in Table 25-1C. Here we have taken the same errors as in Table 25-1B and applied them to the X variable rather than the Y variable. By symmetry arguments, we might expect that we should find the same results as in Table 25-1B. In fact, however, the results are different, in several notable ways. In the first place, we arrive at the wrong model. We know that this model is not correct because we know what the right model is, since we predetermined it. This is the first place that what the statisticians have told us about the results are seen. In statistical parlance, the presence of error in the X variable “biases the coefficient toward zero”, and so we find: the slope is decreased (always decreased) from the correct value (of unity, with this data) to 096+. So the first problem is that we obtain the wrong model. The next item we will look at is the correlation coefficient. The correlation coeffi cient for Table 25-1C is identical to that in Table 25-1B. There is nothing particularly noteworthy about this, except that the correlation coefficient is useless as a means of distinguishing between the two cases: obviously, since we obtain the same result in both situations, we cannot tell from the value of the correlation coefficient which situation we are dealing with. Now we come to the Standard Error of Estimate and the PRESS statistic, which show interesting behavior indeed. Compare the values of these statistics in Tables 25-1B and 25-1C. Note that the value in Table 25-1C is lower than the value in Table 25-1B. Thus, using either of these as a guide, an analyst would prefer the model of Table 25-1C to that of Table 25-1B. But we know a priori that the model in Table 25-1C is the wrong model. Therefore we come to the inescapable conclusion that in the presence of error in the X variable, the use of SEE, or even cross-validation as an indicator, is worse than useless, since it is actively misleading us as to the correct model to use to describe the data. This is for univariate data; what happens in the case of multivariate (multiwavelength) spectroscopic analysis. The same thing, only worse. To calculate the effects rigorously and quantitatively is an extremely difficult exercise for the multivariate case, because not only are the errors themselves are involved, but in addition the correlation structure of the data exacerbates the effects. Qualitatively we can note that, just as in the univariate case, the presence of error in the absorbance data will “bias the coefficient(s) toward zero”, to use the formal statistical description. In the multivariate case, however, each coefficient will be biased by different amounts, reflecting the different amounts of noise (or error, more generally) affecting the data at different wavelengths. As mentioned above, these

A Simple Question: The Meaning of Chemometrics Pondered

125

effects will be exacerbated by intercorrelation between the data at different wavelengths. The difficulty comes when you realize that it is not simply the correlations between pairs of wavelengths that are operative in this regard, but also the intercorrelation effects of the data when the wavelengths are taken 3, 4, n at a time. This is what has made the problem so intractable. Now, we are sure that there are some readers who will read this and say something along the lines of “well, all you need do is do a PCA/PLS analysis and get rid of all those effects”. Actually, there might be a germ of truth to that – if you can always do all your calibration modeling using only the first two or three PCA or PLS factors. Beyond that you will run into what we might almost call the Law of Conservation of Error (except for the fact that, as we all know, error is much easier to create than destroy!). In special cases, however, such as PCA and PLS, the total error really is constant, so that we quickly get into territory where the noise that you pushed out of the first couple of factors reappears, and affects the higher factors even more than the original noise affected the original data. So in the long-gone days of our academic lives, the chemical measurements, being based on high-accuracy gravimetric and volumetric techniques, were indeed the proper ones to put on the X-axis. Contrast that with the current state of technology: instruments have improved enormously, and rather than making up training samples by simple gravi metric dilutions, we often obtain our training, or reference, values through complicated analytical methodologies, which are themselves fraught with so much error that even in favorable cases, the error can be 5–10% of the analytical value. In our current practice, therefore, the error in the reference lab values really is greater than the error in the absorbance data. For this reason it is now appropriate to reverse the positions of the concentration and absorbance values relative to their place in the calculation schema. So it is the changing nature of the world and the types of analyses we do that dictate how we go about organizing the calculations we use to do them. This comes from fundamental considerations of the behavior of the modeling process, which the science of Statistics can tell us about.

REFERENCES 1. Lavine, B.K. and Brown, S., Today’s Chemist at Work 6(9), 29–37 (1997). 2. Brown, C.W., Spectroscopy 1(4), 32–37 (1986). 3. Malinowski, E.R., Factor Analysis in Chemistry, 2nd ed. (John Wiley & Sons, New York, (1991).

This page intentionally left blank

26

Calculating the Solution for Regression Techniques:

Part 4 – Singular Value Decomposition

In Chapters 21–23 and in this chapter, we have described the most basic calculations for MLR, PCR, and PLS. To reiterate, our intention is to demonstrate these basic computations for each mathematical method presently, and then to delve into greater detail as the chapters progress; consider these articles linear algebra bytes. For this chapter we will illustrate the basic calculation and mathematical relationships of different matrices for the calculations of Singular Value Decomposition or SVD. You will note from previous chapters that SVD is used for modern computations of principal components regression (PCR) and partial least squares regression (PLSR), although slightly different forms of SVD are used for each set of computations. Recall for PCR we simply used SVD and for PLS a special case of SVD that we called PLS SVD was used. You will also recall that the PLS form of SVD includes the use of the concentration vector c as well as the data matrix A. The reader will note that the scores (T) and loadings (V) are determined using the concentration values for PLS SVD whereas only the data matrix A is used to perform SVD for principal components analysis. All mathematical operations used for this chapter are completed using MATLAB software for Windows [1]. As previously discussed the manual methods for calculating the matrix algebra used within these chapters is found in references [2–5]. You may wish to program these operations yourselves or use other software to routinely make the calculations. As in previous installments we begin by identifying a simple data matrix denoted by A. A is used to represent a set of absorbances for three samples and three data channels, as a rows × columns matrix. For our example each row represents a different sample spectrum, and each column a different data channel, absorbance or frequency. We arbitrarily designate A for our example as ⎡ ⎤ ⎡ ⎤ 1 7 9 A11 A12 A13 Ar×c = ⎣ A21 A22 A23 ⎦ = AI×K = ⎣ 4 10 12 ⎦ (26-1) A31 A32 A33 6 14 16 Thus, 1, 7, and 9 represent the instrument signal for three data channels (frequencies 1, 2, and 3) for sample spectrum #1; 4, 10, and 12 represent the same data channel signals (e.g., frequencies 1, 2, and 3) for sample spectrum #2, and so on. Given any data matrix A of arbitrary size (as rows × columns) the matrix A can be written or defined using the computation of Singular Value Decomposition [6–8] as A = USV = U × S × V

(26-2)

where U is the left singular values matrix, V is the loadings matrix, and S is the diagonal matrix containing information on the variance described by each principal component

128

Chemometrics in Spectroscopy

(as the S matrix columns). It is important to note when reviewing the use of SVD in the literature that many references define the scores matrix (T) as U × S. Keep in mind that the scores can be calculated as U×S=A×V=T

(26-3)

and it holds that the original data matrix A can be reconstructed as U × S × V = T × V = A × V × V = A × I = A

(26-4)

We can demonstrate the interrelationships between the different matrices resulting from the SVD calculations by the use of MATLAB as shown in Table 26-1. By studying the relationships between the various matrices resulting from the com putation of SVD, one can observe that there are several ways to compute the same Table 26-1 Simple SVD performed on matrix A using MATLAB; other matrix relation ships are also shown (see equations 26-1 through 26-4) Command line

Comments

A = [1 7 9;4 10 12;6 14 16]

Enter the A matrix

A= 1 7 9 4 10 12 6 14 16

Display the A matrix

[U,S,V] = svd(A)

Calculate the SVD of A

U= 03821 09061 -0.1814 05451 -0.0624 08361 07463 -0.4183 -0.5178

Display the U matrix, also known as the left singular values matrix, and rarely referred to as the scores matrix. The scores matrix is most often denoted as U × S or A × V which as it turns out are exactly the same.

S= 295803 0 0 0 19907 0 0 0 02038

Display the S matrix or the singular values matrix. This diagonal matrix contains the variance described by each principal component. Note: the squares of the singular values are termed the eigenvalues.

V= 02380 -0.9312 02762 06279 -0.0694 -0.7752 07410 03579 05681

Display the V matrix or the right singular values matrix; this is also known as the loadings matrix. Note: this matrix is the eigenvectors corresponding to the positive eigenvalues.

U*S*V = ans = 10000 70000 90000 40000 100000 120000 60000 140000 160000

U*S*V is equivalent to the original data matrix A derived using the SVD computation

Calculating the Solution for Regression Techniques: Part 4

129

Table 26-1 (Continued) Command line

Comments

T = A*V T= 113024 18038 -0.0370 161231 -0.1243 01704 220748 -0.8328 -0.1055

The scores matrix (often designated as T) can be calculated as A × V

U*S ans = 113024 18038 -0.0370 161231 -0.1243 01704 220748 -0.8328 -0.1055

As mentioned in the text of the article, the scores matrix T can also be calculated as U × S.

T*V ans = 10000 70000 90000 40000 100000 120000 60000 140000 160000

As we have stated, the original data matrix A can be estimated as the scores matrix (T) × the transpose of the loadings matrix (V ) as shown.

A*V*V ans = 10000 70000 90000 40000 100000 120000 60000 140000 160000

Just another way to estimate the original data matrix A. In this case, V times the transpose of V (itself) is a diagonal matrix with a value of ones along the diagonal, such as shown below. Note: this matrix of ones along the diagonal is called an identity matrix or (I). 10000 00000 00000 00000 10000 00000 00000 00000 10000

final results, making it somewhat difficult to follow the literature. However, knowing these inner mathematical relationships can help clarify our understanding of the different nomenclature. We will compare and contrast some of the literature and the use of different terms in later installments; right now just tuck this information away for future reference.

REFERENCES 1. MatLab software for Windows from The MathWorks, Inc., 24 Prime Park Way, Natick, Mass. 01760-1500. Internet: [email protected]. 2. Workman, J. and Mark, H., Spectroscopy 8(9), 16 (1993). 3. Workman, J. and Mark, H., Spectroscopy 9(1), 16 (1994). 4. Workman, J. and Mark, H., Spectroscopy 9(4), 18 (1994). 5. Mark, H. and Workman, J., Spectroscopy 9(5), 22 (1994). 6. Mandel, J., American Statistician 36, 15 (1982). 7. Golub, G.H. and Van Loan, Charles F., Matrix Computations, 2nd ed. (The Johns Hopkins University Press Baltimore, MD, 1989), pp. 427, 431. 8. Searle, S.R., Matrix Algebra Useful for Statistics (John Wiley & Sons, New York, 1982), p. 316.

This page intentionally left blank

27 Linearity in Calibration

Those who know us know that we have always been proponents of the approach to calibration that uses a small number of selected wavelengths. The reasons for this are partly historical, since we became involved in Chemometrics through our involvement in near-infrared spectroscopy, back when wavelength-based calibration techniques were essentially the only ones available, and these methods did yeoman’s service for many years. When full-spectrum methods came on the scene (PCR, PLS) and became popu lar, we adopted them as another set of tools in our chemometric armamentarium, but always kept in mind our roots, and used wavelength-based techniques when necessary and appropriate, and we always knew that they could sometimes perform better than the full spectrum techniques under the proper conditions, despite all the hype of the proponents of the full-spectrum methods. Lately, various other workers have also noticed that eliminating “extra” wavelengths could improve the results, but nobody (including ourselves) could predict when this would happen, or explain or define the conditions that make it possible. The advantages of the full-spectrum methods are obvious, and are promoted by the proponents of full-spectrum methods at every opportunity: the ability to reduce noise by averaging data over both wavelengths and spectra, noise rejection by rejecting the higher factors, into which the noise is preferentially placed, the advantages inherent in the use of orthogonal variables, and the avoidance of the time-consuming step of performing the wavelength selection process. The main problem was to define the conditions where wavelength selection was superior; we could never quite put our finger on what characteristics of spectra would allow the wavelength-based techniques to perform better than full-spectrum methods. Until recently. What sparked our realization of (at least one of) the key characteristics was an on-line discussion of the NIR discussion group [1] dealing with a similar question, whereupon the ideas floating around in our heads congealed. At the time, the concept was proposed simply as a thought experiment, but afterward, the realization dawned that it was a relatively simple matter to convert the thought experiment into a computer simulation of the situation, and check it out in reality (or at least as near to reality as a simulation permits). The advantage of this approach is that simulation allows the experimenter to separate the effect under study from all other effects and investigate its behavior in isolation, something which cannot be done in the real world, especially when the subject is something as complicated as the calibration process based on real spectroscopic data. The basic situation is illustrated in Figure 27-1. What we have here is a simulation of an ideal case: a transmission measurement using a perfectly noise-free spectrometer through a clear, non-absorbing solvent, with a single, completely soluble analyte dissolved in it. The X-axis represents the wavelength index, the Y -axis represents the measured absorbance. In our simulation there are six evenly spaced concentrations of analyte, with simulated “concentrations” ranging from 1 to 6 units, and a maximum simulated

132

Chemometrics in Spectroscopy 1.6 1.4 1.2 1

0.8 0.6 0.4 0.2 301

289

277

265

253

241

229

217

205

193

181

169

157

145

133

121

97

109

85

73

61

49

37

25

1

13

0 –0.2

Figure 27-1 Six samples worth of spectra with two bands, without (left) and with (right) stray light. (see Color Plate 1)

absorbance for the highest concentration sample of 1.5 absorbance units. Theoretically, this situation should be describable, and modeled by a single wavelength, or a single factor. Therefore in our simulation we use only one wavelength (or factor) to study. For the purpose of our simulation, the solute is assumed to have two equal bands, both of which perfectly follow Beer’s law. What we want to study is the effect of non linearities on the calibration. Any nonlinearity would do, but in the interest of retaining some resemblance to reality, we created the nonlinearity by simulating the effect of stray light in the instrument, such that the spectra are measured with an instrument that exhibits 5% stray light at the higher wavelengths. Now, 5% might be considered an excessive amount of stray light, and certainly, most actual instruments can easily exhibit more than an order of magnitude better performance. However, this whole exercise is being done for pedagogical purposes, and for that reason, it is preferable for the effects to be large enough to be visible to the eye; 5% is about right for that purpose. Thus, the band at the lower wavelengths exhibits perfect linearity, but the one at the higher wavelengths does not. Therefore, even though the underlying spectra follow Beer’s law, the measured spectra not only show nonlinearity, they do so differently at different wavelengths. This is clearly shown in Figure 27-2, where absorbance versus concentration is plotted for the two peaks. Now, what is interesting about this situation is that ordinary regression theory and the theory of PCA and PLS specify that the model generated must be linear in the coefficients. Nothing is specified about the nature of the data (except that it be noise-free, as our simulated data is); the data may be non-linear to any degree. Ordinarily this is not a problem because any data transform may be used to linearize the data, if that is desirable. In this case, however, one band is linearly related to the concentrations and one is not; a transformation, blindly applied, that linearized the absorbance of the higher-wavelength band would cause the other band to become non-linear. So now, what is the effect of this all on the calibration results that would be obtained? Clearly, in a wavelength-based approach, a single wavelength (which would be theo retically correct), at the peak of the lower-wavelength band, would give a perfect fit to the absorbance data. On the other hand, a single wavelength at the higher-wavelength band would give errors due to the nonlinearity of the absorbance. The key question then becomes, how would a full-wavelength (factor-based) approach behave in this situation?

Linearity in Calibration

133

1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0 1

2

3

4

5

6

Figure 27-2 Absorbance versus concentration, without (upper) and with (lower) stray light.

In the discussion group, it was conjectured that a single factor would split the dif ference; the factor would take on some character of both absorbance bands, and would adjust itself to give less error than the non-linear band alone, but still not be as good as using the linear band. Figure 27-3 shows the factor obtained from the PCA of this data. It seems to be essentially Gaussian in the region of the lower-wavelength band, and somewhat flattened in the region of the higher-wavelength band, conforming to the nature of the underlying absorbances in the two spectral regions. Because of the way the data was created, we can rely on the calibration statistics as an indicator of performance. There is no need to use a validation set of data here. Validation sets are required mainly to assess the effects of noise and intercorrelation. Our simulated data contains no noise. Furthermore, since we are using only one wavelength or one factor, intercorrelation effects are not operative, and can be ignored. Therefore the final test lies in the values obtained from the sets of calibration results, which are presented in Table 27-1. Those results seem to bear out our conjecture. The different calibration statistics all show the same effects: the full-wavelength approach does seem to be sort of “split the difference” and accommodate some, but not all, of the non-linearities; the algorithm 0.2 0.18 0.16 0.14 0.12 0.1 0.08 0.06 0.04 0.02

Figure 27-3 First principal component from concentration spectra.

157

151

145

139

133

127

121

115

109

97

103

91

85

79

73

67

61

55

49

43

37

31

25

19

7

13

1

0

134

Chemometrics in Spectroscopy

Table 27-1 Calibration statistics obtained from the three calibration models discussed in the text Linear wavelength SEE Corr. Coeff. F

0 1

Non-linear wavelength

Principal component

0237 09935 305

00575 09996 5294

uses the data from the linear region to improve the model over what could be achieved from the non-linear region alone. On the other hand, it could not do so completely; it could not ignore the effect of the nonlinearity entirely to give the best model that this data was capable of achieving. Only the single-wavelength model using only the linear region of the spectrum was capable of that. So we seem to have identified a key characteristic of chemometric modeling that influences the capabilities of the models that can be achieved: not nonlinearity per se, because simple nonlinearity could be accommodated by a suitable transformation of the data, but differential nonlinearity, which cannot be fixed that way. In those cases where this type of differential, or non-uniform, nonlinearity is an important characteristic of the data, then selecting those wavelengths and only those wavelengths where the data are most nearly linear will provide better models than the full-spectrum methods, which are forced to include the non-linear regions as well, are capable of. Now, the following discussion does not really constitute a proof of this condition (in the mathematical sense), but this line of reasoning is fairly convincing that this must be so. If, in fact, a full-spectrum method is splitting the difference between spectral regions with different types and degrees of nonlinearity, then those regions, at different wavelengths, themselves must have different amounts of nonlinearity, so that some regions must be less nonlinear than others. Furthermore, since the full-spectrum method (e.g., PCR) has a nonlinearity that is, in some sense, between that of the lowest and highest, then the wavelengths of least nonlinearity must be more linear than the full-spectrum method and therefore give a more accurate model than the full-spectrum algorithm. All that is needed in such a case, then, is to find and use those wavelengths. Thus, when this condition of differential nonlinearity exists in the data, modeling tech niques based on searching through and selecting the “best” wavelengths (essentially we’re saying MLR) are capable of creating more accurate models than full-wavelength methods, since almost by definition this approach will find the wavelength(s) where the effects of nonlinearity are minimal, which the full-spectrum methods (PCA, PLS) cannot do.

REFERENCE 1. The moderator of this discussion group was Bruce Campbell. He can be reached for information, or to join the discussion group by sending a message to: [email protected]. New members are welcome.

28

Challenges: Unsolved Problems in Chemometrics

We term the issues we plan to discuss in this chapter as “unsolved” problems, but that may be incorrect. It may be, perhaps, more accurate to call them “Unaddressed Problems in Chemometrics”. Calling them “unsolved” implies that attempts have been made to solve them, but those attempts were unsuccessful, possibly because these problems are too difficult, or possibly because maybe we are not smart enough. Calling them “unaddressed” on the other hand, really gets to the heart of the matter: a number of problems have come to our attention that nobody seems to be paying any heed to. It may very well turn out that some of these problems are too difficult to solve at the current state of the art in Chemometrics, and maybe we are really not smart enough, but at this point we do not know, and we will never know if nobody tries. Our attention was drawn to these problems via various routes. Some arose from our own work on various projects. Some arose from discussions in the on-line discussion group. Some have been floating around in the backs of our minds for what seems like forever, but only recently crystallized into something concrete enough to write down in a coherent manner so that it could be explained to somebody else. Answers – we have none, only questions. We bring up these points to stir up some discussion, and maybe even a little controversy, and certainly with the hope that we can prod some of our compatriots “out there” to tackle some of these. Conspicuous by its absence is the question of calibration transfer, even though we consider it unsolved in the general sense, in that there is no single “recipe” or algorithm that is pretty much guaranteed to work in all (or at least a majority) of cases. Nevertheless, not only are many people working on the problem (so that it is hardly “unaddressed”), but there have been many specific solutions developed over the years, albeit for particular calibration models on particular instruments. So we do not need to beat up on this one by ourselves. So what are these problems? 1) The first one we mention is the question of the validity of a test set. We all know and agree (at least, we hope that we all do) that the best way to test a calibration model, whether it is a quantitative or a qualitative model, is to have some samples in reserve, that are not included among the ones on which the calibration calculations are based, and use those samples as “validation samples” (sometimes called “test samples” or “prediction samples” or “known” samples). The question is, how can we define a proper validation set? Alternatively, what criteria can we use to ascertain whether a given set of samples constitutes an adequate set for testing the calibration model at hand? A very limited version of this question, does in fact, sometimes appear, when the question arises of how many samples from a given calibration set to keep in reserve for

136

Chemometrics in Spectroscopy

the validation process. Answers range from one (at a time, in the PRESS algorithm) to half the set, and there is no objective, scientific criterion given for any of the choices that indicate whether that amount is optimum. Each one is justified by a different heuristic criterion, and there is never any discussion of the failings of any particular approach. For example, while the PRESS algorithm is appealing, it does not even test the calibration model: if anything, for n samples it tests n different models, none of which is the one to be used, and so forth. Another shortcoming of PRESS is that if each sample was read multiple times, then a computer program that simply removes one reading at a time does not remove the effect of that sample from the data. Even so, at best any of these answers treat only one aspect of the larger question, which includes not only how many samples, but which ones? A properly taken random sample is indeed representative of the population from which it comes. So one subquestion here is, how should we properly sample? The answer is “randomly” but how many workers select their validation samples in a verifiably random manner? How can someone then tell if their test set is then valid, and against what criteria? Some of this goes back to the original question of obtaining a proper and valid set of calibration samples in the first place, but that is a different, although related problem. We can turn that question around in the same way: what are the criteria for telling if a calibration sample set is a valid set? Maybe both problems have the same solution, but we do not know because nobody is working on either one. But to pose the question more directly: how can we tell if any set of samples constitute a valid test set? Even if they were chosen in a proper random manner, are there any independent tests for their validity? What characteristics should the criteria for deciding be based on, and what are the criteria to use? 2) The next problem we bring up for discussion is the definition of “validation”. Now, we are sure there are some who will complain that we are arguing terminology rather than substance. However, we think that agreement on what terms mean has substantive consequences, especially in modern times when standards-setting organizations (e.g., ASTM) and government agencies are taking an interest in what we do. As we will see below, there is the question of the time required to validate, so on the one hand, if we recognize that verifying the accuracy of a given model at the time that model is created may or may not be a sufficient test of its long-term behavior and we may need to include long-term testing procedures. On the other hand, if government agencies create regulations for how models are to be validated, which presumably they are likely to do on the basis of what we ourselves decide is required, do we want to be constrained to not being able to declare that we have created a model until months or years have passed? Such questions involve much more than terminology, especially if the government decides that “validation” is, in fact, whatever we claim it is. As we hinted above, the most common use of the term “validation” involves simply retaining some samples separately from the main set of calibration samples and using those as a more-or-less independent test of the accuracy of the calibration model obtained. However, this definition is not universally agreed to. When the subject came up in the on-line discussion group, the following comment was made by Richard Kramer of the discussion group [1]: The issue Howard raises is an important one. However, I disagree with his characterization of validation and with the resulting conclusion. It all depends upon

Unsolved Problems in Chemometrics

137

what one means by the concept of validation. If validation means the ongoing validation of a plurality of alternative models (my preferred meaning), it DOES become the means of selecting one model over others. And importantly, it permits selection of models which exhibit the best performance with respect to time-related properties such as robustness. It is not uncommon to observe that the model which initially appears to be optimum is the one whose performance degrades most rapidly as time passes. Validation over time also provides a means of gaining insight into which portions of the data might contain more confusion than information and would be best discarded. In particular, it can be interesting to look at the data residuals over time. It is not uncommon to find that the residuals in some parts of the data space increase more rapidly, over time, than the residuals in other parts of the data space. Generally excluding (or de-weighting) the former from the model can improve the model’s performance, short term and long term. Certainly Richard raises valid points, and you can hardly fault his prescription for monitoring and improving the results. However, is that considered, or should that be considered a requirement for validation, or even a necessary part of the validation process? The response comment to Richard at the time was as follows: I think Rich & I agree more than we disagree. If you use his definition of validation then what he says follows. However, that definition is not the one in common use – the MUCH more common definition is simply the one that tells you to separate your calibration samples & keep some out of the calibration calculations, then use those to validate. Once you’ve gone to the trouble to collect data over time then your options expand greatly. Not only can you use that data for ongoing validation, you can also include those new readings in the calibration calculations. There are at least two ways to do this: 1) As Richard implies, one way is to gradually replace the older data with the new as it becomes available. This has been standard practice for a long time, for example in the agricultural industry, where old samples will never be seen again. A grain elevator, e.g., will never again have to measure another sample from the 1989 crop year. 2) The other obvious extension, which is more useful for the case where you may still have to measure samples with the same characteristics as the old ones, is to simply keep adding to and expanding the calibration set as new samples become available. The new samples then not only allow you to test for robustness, but inclusion of such samples will actually make the calibration more robust. I think we all know this intuitively, but I have also been able to prove this mathematically. So validation may not only involve the time frame required to perform it, it may also involve questions of the models (or at least the number of models) being tested. So there we have it: what exactly is “validation”? 3) The next unsolved problem we bring up is the question of error in the classification of training samples when calibrating an instrument to do identification. We mentioned

138

Chemometrics in Spectroscopy

this briefly in a recent column, but it is worth some more discussion. The problem appears to arise primarily in medical applications, so as a non-proprietary example, let us imagine we are interested in identifying the degree of burn of a burn victim: that is whether the subject has a 1st, 2nd or 3rd degree burn. The distinctions are medically important, and furthermore there are qualitative differences between them despite the fact that they arise out of the quantitative difference in the amount of heat involved. In these respects this typifies other medical situations. We could take spectra of the burned areas from subjects who have been burned, but there is a certain amount of subjectivity in assigning the degree of burn in a given case, and occasionally two physicians will disagree on the designation of the degree of burn in some cases. Clearly, if they disagree, they both cannot be correct, so if we use one or the other’s diagnosis, the training classification will also occasionally be in error. While there is certainly a progression in the intensity and severity of the burn as we go from 1st to 3rd degree burns, we cannot simply use a quantitative scale, for a number of reasons: a quantitative scale of that sort is not agreed to by all physicians, it would be, at best, highly nonlinear, and most importantly, there are real qualitative differences between tissue subjected to the different extents of damage, besides the potential quantitative ones. Because of this, a straightforward quantitative approach would not suffice, even if one could be developed. We need methods to deal with the existence of errors in the training classifications when training instruments to do automated identification. 4) The final problem we bring up is based on the question of modeling based on individual wavelengths versus full-spectrum methods and the modern variations on those themes. Basically the question can be put: “How far should we go in eliminating wavelengths?”. As we discussed in a recent column, as well as in times past, our backgrounds are from the days of pre-PCA/PLS/PCR/NN calibration modeling, and we there learned the value of wavelength-based models (principally MLR, or P-matrix as it’s sometimes called), which we only recently crystallized into something concrete enough to write down in a coherent manner so that it could be explained to somebody else. (does that sound familiar?) The full-spectrum methods (PLS, PCR, K-matrix, etc.) have their advantages and, as we recently discussed, so do the individual-wavelength methods. The users of the full-spectrum approaches have in recent years taken an empirical, ad hoc approach to the question of wavelength elimination, finding that there was benefit to it, even if there were no explanations of the reasons for that benefit. Our initial reaction was something on the order of: why not go the whole way and eliminate all the wavelengths except those few that are needed to do the analysis (i.e., go to the limit of wavelength elimination, which essentially brings it back to MLR)? However, now that we know what the benefit of MLR-type modeling is, it is clear that eliminating all those wavelengths is counterproductive, because it throws the baby out with the bathwater, so to speak. Ideally, we should like to devise criteria for determining how many wavelengths, and which wavelengths, to keep and which to eliminate, to obtain the optimum balance between the noise-reduction capabilities of the fill-spectrum methods and the linearity-maximization capabilities of the individualwavelength approaches.

Unsolved Problems in Chemometrics

139

Well, there we have it: our list of current unsolved/unaddressed problems. Hop to it, readers!!!

REFERENCE 1. Chemometrics discussion group moderated by Bruce Campbell. He can be reached for infor mation, or to join the discussion group by sending a message to: [email protected]. New members are welcome.

This page intentionally left blank

29

Linearity in Calibration: Act II Scene I

When we first published our chapter “Linearity in Calibration” as an article in Spectroscopy magazine [1] we did not quite realize what a firestorm we were going to ignite, although, truth be told, we did not expect everybody to agree with us, either. But if so many actually took the trouble to send their criticisms to us, then there must also be a large “silent majority” out there that are upset, perhaps angry, and almost certainly misunderstanding what we said. We prepared responses to these criticisms, but they became so lengthy that we could not print them all in a single published column, and thus the topic is included in several smaller chapters. At this point in our discussion, let us raise the question of the linearity of spectro scopic data as a general topic. There are a number of causes of nonlinearity that most chemists and spectroscopists are familiar with. Let us define our terms. When speak ing of “linearity” the meaning of the term depends on your point of view, and your interests. An engineer is concerned, perhaps, with the linearity of detector response as a function of incident radiant energy. To a chemist or spectroscopist, the interest is in the linearity of an instrument’s readings as a function of the concentration of an analyte in a set of samples. In practice, this is generally interpreted to mean that when measuring a transparent, non-scattering sample, the response of the instrument can be calculated as some constant times the concentration of the analyte (or at least some function of the instrument response can be calculated as a constant times some other function of the concentration). In spectroscopic usage, that is normally interpreted as meaning the condition described theoretically by Beer’s Law, that is the instrument response function is the negative exponential of the concentration: I = k Io e−bC

(29-1)

where I = k= Io = b= C=

the the the the the

radiation passing through the sample multiplying constant radiation incident on the sample product of the pathlength and absorbtivity concentration of the analyte.

When other types of samples are measured, the resulting data is usually known to be nonlinear (except possibly in a few special cases), so those measurements are of no interest to us here. Thus, in practice, the invocation of “linearity” implies the assumption that Beer’s Law holds, therefore discussions of nonlinearity are essentially about those phenomena that cause departures from Beer’s law.

142

Chemometrics in Spectroscopy

These include 1) Chemical causes a) Hydrogen bonding b) Self-polymerization or condensation c) Interaction with solvent d) Self-interaction 2) Instrumental causes a) Nonlinear detector b) Nonlinear electronics c) Instrument bandwidth broad compared to absorbance band d) Stray light e) Noncollimated radiation f) Excessive signal levels (saturation). Most chemists and spectroscopists expect that in the absence of these distinct phenom ena causing nonlinearity, Beer’s Law provides an exact description of the relationship between the absorbance and the analyte concentration. Unfortunately the world is not so simple, and Beer’s Law never holds exactly, EVEN IN PRINCIPLE. The reason for this arises from thermodynamics. Optical designers and specialists in heat transfer calculations in the chemical engineer ing and mechanical engineering sciences are familiar with the mathematical construct known as The Equation of Radiative Transfer, although most chemists and spectro scopists are not. The Equation of Radiative Transfer states that, disregarding absorbance and scattering, in a lossless optical system dE = I d d da dt

(29-2)

where dE = the differential energy transferred in differential time dt I = the optical intensity as a function of wavelength (i.e., the “spectrum”) d = the differential wavelength increment d = the differential optical solid angle the beam encompasses da = the differential area occupied by the beam. For a static (i.e., unvarying with time) system, we can recast equation 29-2 as: dE/dt = I d d da

(29-3)

where dE/dt is the power in the beam. The application of these equations to heat transfer problems is obvious, since by knowing the radiation characteristics of a source and the geometry of the system, these equations allow an engineer, by integrating over the differential terms of equation 29-2 or equation 29-3, to calculate the amount of energy transferred by electromagnetic radiation from one place to another. Furthermore, the first law of thermodynamics assures us that dE/dt will be constant anywhere along the optical beam, since any change would require that the energy in the

Linearity in Calibration: Act II Scene I

143

beam be either increased or decreased, which would require that energy would be either created or destroyed, respectively. Less obviously, perhaps, the second law of thermodynamics assures us that the inten sity, I, is also constant along the beam, for if this were not the case, then it would be possible to focus all the radiation from a hot body onto a part of itself, increasing the radiation flux onto that portion and raising its temperature of that portion without doing work – a violation of the second law. The constancy of beam energy and intensity has other consequences, some of which are familiar to most of us. If we solve equation 29-3 for the product (d da) we get: d da = dE/dt × d/I

(29-4)

All the terms on the right-hand side of equation 29-4 are constants, therefore for any given wavelength and source characteristics, the product d da) is a constant, and in an optical system one can be traded off for the other. We are all familiar with this characteristic of optical systems, in the magnification and demagnification of images described by geometric optics. Whenever light is brought to a small focus (i.e., da becomes small) the light converges on the focal point through a large range of angles (i.e., d becomes large) and vice versa. This trade-off of parameters is more obvious to us when seen through the paradigm of geometric optics, but now we see that this is a manifestation of the thermodynamics underlying it all. We are also familiar with this effect in another context: in the fact that we cannot focus light to an arbitrarily small focal point, but are limited to what we usually call the “diffraction limit” of the radiation in the beam. This effect also comes out of equation 29-4, since there is a physical (or perhaps a geometrical) limit to d: d cannot become arbitrarily large, therefore da cannot become arbitrarily small. Again, we are familiar with this effect by coming across it in another context, but we see that it is another manifestation of the underlying thermodynamic reality. Getting back to our main line of discussion, we can see from equation 29-2 (or equation 29-3) that the differential terms must all have finite values. If any of the terms d, d, or da were zero, then zero energy would pass through the system and we could not make any measurements. One thing this tells us, of interest to us as spectroscopists, is that we can never build an instrument with perfect resolution. The mechanistic fundamentals (quantum broadening, Doppler broadening, etc.) have been extensively discussed by one of our colleagues [2]. This effect also manifests itself in the fact that every technology has an “instrument function” that is convolved with the sample spectrum, and each instrument function is explained by the paradigms of the associated technology, but since “perfect” resolution means that d = 0, we see again that this is another result of the same underlying thermodynamics. More to the point of our discussion regarding nonlinearity, however, is the fact that d cannot be zero. d is related to the concept of “collimation”: for a “perfectly collimated” beam, d = 0. But as we have just seen, such a beam can transfer zero energy; so just as with d and da, a perfectly collimated beam has no energy. Beer’s law, on the other hand, is based on the assumption that there is a single pathlength (normally represented by the variable b in the equation A = abc) for all rays through the sample. In a real, physical, measurement system, this assumption is always false, because of the fact that d cannot be zero. As Figure 29-1 shows, the actual

144

Chemometrics in Spectroscopy I2

I0

θ

θ max

I1 b

Figure 29-1 Diagram showing the pathlength in a sample for ray going straight through (to I1 ) and those going at an angle (to I2 ).

rays have pathlengths that range from b (for those rays that travel “straight through”, i.e., normal to the sample surfaces) to b/cos(max (for the rays at the most extreme angles). We noted this effect above as item 2e in our list of sources of nonlinearity, and here we see the reason that there is fundamental limitation. Mechanistically, the nonlinearity is caused by the fact that the absorbance for the rays traveling normally = abc, while for the extreme rays it is abc/cos(max . Thus the non-normal rays suffer higher absorbance than the normal ones do, and the discrepancy (which equals abc1 − 1/cos) increases with increasing concentration. When the medium is completely nonabsorbing, then the difference in pathlength does not affect the measurement. When the sample has absorbance, however, it is clear that ray I2 will have its intensity reduced more than ray I1 , due to the longer pathlength. Thus not all rays are reduced by the same amount and this leads to the nonlinearity of the measurement. Mathematically, this can be expressed by noting that the intensity measured when a beam with a finite range of angles passes through a sample is I = Io

�max

e−b/ cos d

(29-5)

0

rather than the simpler form shown in equation 29-1 (which, we remind the reader, only holds true for “perfectly collimated” beams, which have zero energy). In practice, of course, this effect is very small, normally much smaller than any of the other sources of nonlinear behavior, and we are ordinarily safe in ignoring it, and calling Beer’s law behavior “linear” in the absence of any of the other known sources of nonlinear behavior. However, the point here is that this completes the demonstration of our statement above, that Beer’s law never exactly holds IN PRINCIPLE and that as spectroscopists we never ever really work with perfectly linear data.

REFERENCES 1. Mark, H. and Workman, J., Spectroscopy 13(6), 19–21 (1998). 2. Ball, D.W., Spectroscopy 11(1), 29–30 (1996).

30

Linearity in Calibration: Act II Scene II – Reader’s

Comments � � �

Some time ago we wrote an article entitled “Linearity in Calibration” [1], in which we presented some unexpected results when comparing a calibration model using MLR with the model found using PCR. That column generated an active response, so we are discussing the subject in some detail, spread over several columns. The first part of these discussions have been published [2]; this chapter is the continuation of that one. In this chapter we now present the responses we received to the original published article [1] in order of receipt, following which we will comment about them in subsequent chapters. Here, in order of receipt, are the comments: The first set of comments we received were from Richard Kramer: [Howard & Jerry], I’m afraid that this month’s Spectroscopy Column is badly off the mark (pun intended (with apologies)). The errors are two-fold with the most serious error so significant that the other error is moot. 1) If I understand the column correctly, a 1-factor model was used. Well, a single linear factor can never be sufficient to properly model a non-linear system. A minimum of 2 factors are required. The synthetic data did NOT demonstrate the advantage of a single linear wavelength over a multiple wavelength model, it merely illustrated the fact that a single linear factor is not sufficient to model non-linear data. We could stop here, but, for the sake of completeness � � � . 2) The second problem is that that we never have the luxury of working with noise-free data. Thus, the column did not ask the right question(s). The proper question to ask is “In what ways and under which circumstances do the signal averaging advantages of multiple-wavelength models outperform or underper form with respect to a single (or n wavelength, where n is a small integer) wavelength calibration when noise is present?” The answer will depend upon the levels of noise and non-linearity and the number of wavelengths in each model. Regards, Richard We went back and forth a couple of times, but rather than list each of our conversations individually, we will reserve comments until we have looked at all the comments, and then we will summarize our responses to all four respondents together, since several of these response comments say the same things, to some extent.

146

Chemometrics in Spectroscopy

Second, we received comments from Patrick Wiegand: Gents, I have always looked forward to reading your articles on Chemometrics in Spec troscopy. They are truly a valuable resource – I usually cut them out and save them for future reference. However, I think your article “Linearity in Calibration” in the June 1998 issue of Spectroscopy leads the reader to an erroneous conclusion. This conclusion results largely because of the assumptions you make about the application of PLS and PCR. I know of no experienced practitioner of chemometrics who would blindly use the “full spectrum” when applying PLS or PCR. In the book “Chemometrics” by Beebe, Pell and Seasholtz, the first step they suggest is to “examine the data.” Likewise, Kramer in his new book has two essential conditions: The data must have information content and the information in the data must have some rela tionship with the property or properties which we are trying to predict. Likewise, in the course I teach at Union Carbide, I begin by saying that “no model ing technique, no matter how complex, can produce good predictions from bad data.” In your article, you appear to be creating an artificial set of circumstances: 1) You start with a “perfectly noise-free spectrum” 2) You create an excessively high degree of non-linearity which would never be tolerated by an experienced spectroscopist. 3) You assume the spectroscopist will use the entire spectrum blindly when apply ing PLS or PCR, even though some parts of the spectrum clearly have no information and other parts are clearly nonlinear. 4) You limit the number of factors for PLS/PCR to 1, even though the number of latent variables must be greater, due to the nonlinearity. In regards to number 1, by using a perfectly noise-free spectrum, you have elim inated the main advantage of PLS/PCR. That is, the whole point of using these techniques is that they have better ability to reject noise than MLR. To come to an adequate conclusion as to the best performer, you should at least add an amount of random noise an order of magnitude greater than normal, since the amount of nonlinearity you use is an order of magnitude greater than normal. Number 2 – I understand that you wanted to use a high degree of nonlinearity so that the absorbance vs. concentration plot will be nonlinear to the naked eye, but you can’t really expect to use this degree of nonlinearity to make a judgmental comparison between two techniques if it is not realistic that it will ever occur in real life. Number 3 – There are many well-established techniques for choosing which wavelength regions to use when modeling with PLS/PCR. First, I advise people to make sure that the pure component spectrum actually has a band in the location being modeled. If this is not possible, at least only include regions that look like

Linearity in Calibration: Act II Scene II

147

valid bands – no sense in trying to include low s/n baseline regions. Plots of a linear correlation coefficient vs. wavelength for the property of interest are also useful in choosing the right regions to include in the model. Finally, if the initial model is built using the full-spectrum, an examination of factor plots would reveal areas in which there is no activity. Number 4 – In cases where there is no choice but to deal with nonlinearity in the spectra, then it will be necessary to use more factors than the number of chemical species in the system. Once again, an experienced practitioner will use other ways of choosing the right number of factors, like a PRESS plot, etc. Thus your conclusion – that MLR is more capable of producing accurate models than PLS/PCR – is based on a contrived set of circumstances that would not occur in reality, especially when the chemometrician/spectroscopist is experienced. It would be very interesting also, since the performance of the models presented are so similar, to see how the performance would be affected by noise, drift, etc. which are always present in actuality. I would not be surprised if PLS/PCR outperformed MLR under those circumstances. All of the above would seem to indicate that I am totally against using MLR. This is not the case. In my practice, I always try the simplest approach first. This means first trying MLR. If that does not work, then I use PLS. If that does not work – well, some people may use neural networks, but I have not yet found a need to do so. I think you are right in saying that there has been a lot of hype over PLS (although not as much as there has been over neural nets!) In many cases MLR works great, and I will continue to use it. To paraphrase Einstein, “Always use the simplest approach that works – but no simpler.” The third set of comments we received were from Fred Cahn: I read your article in Spectroscopy (13(6), June 1998) with interest. However, I don’t agree with the conclusions and the way your simulation was carried out and/or presented. While I am no longer working in this field, and cannot easily do simulations, I think that a 2 factor PCR or PLS model would fully model the simulated spectra. At any wavelength in your simulation, a second degree power series applies, which is linear in coefficients, and the coefficients of a 2 factor PCR or PLS model will be a linear function of the coefficients of the power series. (This assumes an adequate number of calibration spectra, that is, at least as many spectra as factors and a sufficient number of wavelength, which the full spectrum method assures.) The PCR or PLS regression should find the linear combination of these PCR/PLS coefficients that is linear in concentration. See my publication: Cahn, F. and S. Compton, “Multivariate Calibration of Infrared Spectra for Quanti tative Analysis Using Designed Experiments”, Applied Spectroscopy, 42:865–872 (July, 1988).

148

Chemometrics in Spectroscopy

Fred supplied a copy of the cited paper, and we read it. Again, the comments about it will be included among the general comments. And finally, the fourth set of comments we received were from Paul Chabot: Hello, I recently read your column in the Spectroscopy issue of June 1998, which was dealing with “Linearity in Calibration”. First, I have to tell you that I really like your monthly column. You do a good job at explaining the basics and more of many topics related to chemometrics, and “demistify” the subjects. As an avid user of PLS, I was concerned when you were comparing MLR to PLS and PCR on your synthetic data set. Even though I agree with you that in some cases, MLR is a much better approach than PLS or PCR, sometimes the use of a full spectrum technique is essential. In this particular case, I do not doubt your results showing that MLR outperforms the full spectrum techniques because the data set was designed to do so. But out of the full spectrum techniques, I would expect PLS to outperform PCR, and the loading of the first principal component to be mostly located around the lower wavelength peak for PLS. Did you notice any difference between PCR and PLS on this data set? I would appreciate it if you could let me know if you tried both approaches and the results you obtained so I don’t have to regenerate the data. Thank you very much, and keep up the good work, Paul Chabot To summarize the comments (including ones presented during subsequent discussions, and therefore not included above): 1) Richard Kramer, Patrick Wiegand, and Fred Cahn felt that we should have tried two factors. 2) Richard Kramer and Patrick Wiegand thought we should have added simulated noise to the data. 3) All four responders indicated that we should have tried PLS. 4) Richard Kramer, Patrick Wiegand, and Paul Chabot indicated that one PLS factor might do as well as one wavelength. 5) Richard Kramer and Patrick Wiegand thought that our conclusion was that MLR is better than PCA. As stated in the introduction to this chapter, we present our responses in chapters to follow.

REFERENCES 1. Mark, H. and Workman, J., Spectroscopy 13(6), 19–21 (1998). 2. Mark, H. and Workman, J., Spectroscopy 13(11), 18–21 (1998).

31 Linearity in Calibration: Act II Scene III

In Chapter 27, we discussed a previously published paper entitled “Linearity in Calibration” [1]. In the chapter and original paper we presented some unexpected results when comparing a calibration model using MLR with the model found using PCR. That chapter, when first published as an article, generated a rather active response, so we are discussing the subject and responding to the comments received in some detail, spread over several chapters. The first two parts of our response were included as Chapters 29 and 30, which refer to the papers published as [2, 3]; this Chapter 31 is the continuation of those. We ended Chapter 30 with a summary of the comments received regarding a previous “Linearity in Calibration” paper. We therefore pick up where we left off by starting this chapter with that same summary (naturally, anyone who wishes to read the full text of the comments will have to go back and reread Chapter 30 derived from reference [3]): 1) Richard Kramer, Patrick Wiegand, and Fred Cahn felt that we should have tried two factors. 2) Richard Kramer and Patrick Wiegand thought we should have added simulated noise to the data. 3) All four responders indicated that we should have tried PLS. 4) Richard Kramer, Patrick Wiegand, and Paul Chabot indicated that one PLS factor might do as well as one wavelength. 5) Richard Kramer and Patrick Wiegand thought that our conclusion was the MLR is better than PCA. In addition, each of the responders had some of their own individual comments; we discuss all these below. We now continue with our responses, and discussion of these comments: It may surprise some to hear this, especially in light of some of the comments we make below, but we agree with the responders more than we disagree. We also believe, for example, in pre-screening the data, at least as strongly as Patrick Wiegand does, and we believe his comments regarding the way all (or at least, let’s hope all) experienced chemometricians approach a problem. Indeed, fully half the book that one of us authored [4] was spent on just that point: how to “look at the data”. However, our experience in the “real world” (as some like to call it) of instrument manufacturers has given us a somewhat different slant on the reality of what actually happens when users get hold of a new super-whiz-bang package of calculation. In many years of experience in the NIR applications department at Technicon Instru ments, there was about an hour and a half available to teach both theory and practice of calibration to each group of new users; the rest of the training time was spent teaching the students how to set the instrument up, prepare samples, take reproducible readings,

150

Chemometrics in Spectroscopy

and learn the rest of the mechanics needed to run the instrument, take readings, and collect the data. How much attention do you think could be paid to the finer points? This seems to be typical of what happens in the majority of cases involving novice users, and it is rare that there is anyone “back at the plant” who can pick up the ball and take them any further. Even experienced practitioners can be misled, however. As was pointed out, real data contains various types and amounts of variations in both the X and Y variables. Furthermore, in the usual case, neither the constituent values nor the optical readings are spaced at nice, even, uniform intervals. Under such circumstances, it is extremely difficult to pick out the various effects that are operative at the different wavelengths, and even when the data analyst does examine the data, it may not always be clear which phenomena are affecting the spectra at each particular wavelength. Now we will respond to the various comments, and make some more observations of our own. We will re-quote the pertinent parts of the communications from the responders, collecting together those on a similar topic and comment on them collectively. Note than some of these quotes were from later messages than those quoted in our previous column, because they were generated during subsequent discussions, and so may not have appeared previously. We hope nobody takes our reply comments personally. Both some of the comments and some of our responses are energetic, because we seem to have touched on a subject that turned out to be somewhat controversial. So we do not take the responders comments personally, but we do enter with zest and gusto into what looks like something turning into a rather lively debate, and we sincerely hope that everybody can take our own comments in that same spirit. The format of this columns is as follows: each numbered section starts with the comments from the various responders dealing with a given aspect of the subject, followed by our response to them collectively. So now let us consider the various points raised, starting with the use of noise-free data: 1) “You start with a ‘perfectly noise-free spectrum’ ” (Patrick Wiegand) “In regards to number 1, by using a perfectly noise-free spectrum, you have eliminated the main advantage of PLS/PCR. That is, the whole point of using these techniques is that they have better ability to reject noise than MLR. To come to an adequate conclusion as to the best performer, you should at least add an amount of random noise an order of magnitude greater than normal, since the amount of nonlinearity you use is an order of magnitude greater than normal.” (Patrick Wiegand) “The second problem is that that we never have the luxury of working with noise-free data. Thus, the column did not ask the right question(s). The proper question to ask is ‘In what ways and under which circumstances do the signal averaging advantages of multiple-wavelength models outperform or underperform with respect to a single (or n wavelength, where n is a small integer) wavelength calibration when noise is present?’ The answer will depend upon the levels of noise and nonlinearity and the number of wavelengths in each model.” (Richard Kramer) “It isn’t a case of ‘extreme difficulty’. It is a situation where, in one case you use a factor which happens to be based upon an explicit model (i.e. linearity) which is correct

Linearity in Calibration: Act II Scene III

151

for the data while stacking the deck against the second case by denying any opportunity to be correct.” (Richard Kramer) Response: Of course we used noise-free data. Otherwise we could not be sure that the effects we see are due to the characteristics we impose on the data, rather than the random effects of the noise. When anyone does an actual, physical experiment and takes real readings, the noise level or the signal-to-noise ratio is a consideration of paramount importance, and any experimenter normally takes great pains to reduce the noise as much as possible, for just that reason. Why shouldn’t we do the same in a computer experiment? On the other hand, PCA and PLS are both known to perform better than MLR when the data is noisy because of the inherent averaging that they include. In this we agree fully; indeed, we also mentioned this characteristic in Chapter 27, as well as in the original column. Richard Kramer hit the nail on the head with his question “In what ways ?” The important question, then, that needs to be asked (and answered) is, at what point does one phenomenon or the other become dominant, so as to control or determine which algorithm will provide a better model? The next important question is, how can we tell which phenomenon is dominant in any particular case? Rich Kramer also had the insight to go to the next step, and realized that the only way to determine whether the nonlinearity is “small” or “large” is by having something to compare to, and the natural characteristic to compare it to is the noise. On this score we also agree with Richard and Patrick fully, and this is one place where much research is needed (there are others; and we will get to them in due course): How do you compare the systematic behavior of nonlinearity with the random behavior of noise? The standard application of the science of Statistics provides us with tools to detect systematic effects, but how do we go to the next step and ascertain their relative effects on calibration models? These are among the fundamental behavioral properties of calibrations that are not being investigated, but need to be. There are important theoretical reasons to reduce the spectral noise when doing calibrations. Nevertheless, if the main advantage of PLS is its behavior in the presence of noisy data (as Patrick Wiegand states), that is poor praise indeed. Noise levels of modern instruments are far below those of the past. In some cases, and NIR instruments come to mind here, the noise levels are so low that they are tantamount to having “zero noise” to start with. This improvement in instrumentation is a good thing, and we sincerely doubt that anybody would recommend using a noisy instrument for the sole purpose of justifying a more sophisticated algorithm. In any case, even if all the above statements are 100% true, it does not affect our discussion because they are beside the point. The behavior of calibration algorithms in the face of noisy data is an important topic and perhaps should be studied in depth, but it was not at issue in the “Linearity in Calibration” column. 2) “You create an excessively high degree of nonlinearity which would never be tolerated by an experienced spectroscopist.” (Patrick Wiegand) Response: In the absence of random variation, ANY amount of nonlinearity would give the same results, and if we used less, any differences from the results we presented would be only of degree, not of kind. Any amount of nonlinearity is infinitely greater

152

Chemometrics in Spectroscopy

than zero. As we explained in the original column, we deliberately chose an unrealis tically large amount of nonlinearity for pedagogical purposes; what would be the point of comparing different calibration lines that the naked eye saw as equally straight? The fact that it is “unrealistically” large is immaterial. 3) “You assume the spectroscopist will use the entire spectrum blindly when applying PLS or PCR, even though some parts of the spectrum clearly have no information and other parts are clearly nonlinear.” (Patrick Wiegand) Response: Above, I described the situation as we see it, regarding the traps that both experienced and novice users of these very sophisticated algorithms can fall into. Keep in mind the pedagogy involved as well as the chemometrics: by suitable choice of values for the “constituent”, the peaks at the nonlinear wavelengths could have been made to appear equally spaced, and the linear wavelengths appear stretched out at the higher values. The “clarity” of the nonlinearity is due to the presentation, not to any fundamental property of the data, and this clarity does not normally exist in real data. How is someone to detect this, especially if not looking for it? Attempts to address this issue have been made in the past (see [5]) with results that in our opinion are mixed, at best. And that simulated data was also noise-free. With real data, a more scientifically valid approach would be to correct the nonlinearity from physical theory. In the current case, for example, a scientifically valid approach would be to convert the data to transmission mode, subtract the stray light and reconvert to absorbance: the nonlinear wavelengths would have become linear again. There are, of course, several things wrong with this procedure, all of them stemming from the fact that this data was created in a specific way for a specific purpose, not necessarily to be representative of real data: a) You would have to know a priori that only certain wavelengths (and which ones) were subject to the “stray light” or whatever source of nonlinearity was present. b) One of the problems of current chemometric practice is the “numbers game” aspect. No matter how soundly based in physical theory a procedure is, if the numbers it produces are not as good (whatever that might mean in a specific case) as a different, more empirical, procedure, the second procedure will be used, no matter how empirical its basis. The counter-argument to that, of course, is something on the order of “Well, we have to get as good results as we can for the user” and there is a certain amount of legitimacy to this statement. However, we know of no other field of scientific study where a situation of this sort is tolerated. Certainly, every field has areas of unknown effects where not all the fundamental physical theory is available, but in all fields other than chemometrics, there are workers investigating these dark areas, to try to fill in the missing knowledge. In chemometrics, on the other hand, for at least the 22 years we have been involved with the field, all we have seen the workers in the field doing are building bigger and higher and more fanciful mathematical superstructures on foundations that few, if any of them, seem to be aware of. We will have more to say about this below. c) The simple fact that sometimes the nature of the correct physical theory to use is unknown. d) Finally, the real reason we presented these results the way we did was that the whole purpose of the exercise was to study the effect of this type of variation of

Linearity in Calibration: Act II Scene III

153

the data, so that simply removing it would not only be trivial, it would also be a counterproductive procedure. 4) “If I understand the column correctly, a 1-factor model was used. Well, a single linear factor can never be sufficient to properly model a non-linear system. A minimum of 2 factors are required.” (Richard Kramer) “PLS should have, in principle, rejected a portion of the non-linear variance resulting in a better, although not completely exact, fit to the data with just 1 factor. The PLS does tend to reject (exclude) those portions of the x-data which do not correlate linearly to the y-block.” (Richard Kramer) “You limit the number of factors for PLS/PCR to 1, even though the number of latent variables must be greater, due to the nonlinearity.” (Patrick Wiegand) “In principle, in the absence of noise, the PLS factor should completely reject the non linear data by rotating the first factor into orthogonality with the dimensions of the x-data space which are ‘spawned’ by the nonlinearity. The PLS algorithm is supposed to find the (first) factor which maximizes the linear relationship between the x-block scores and the y-block scores. So clearly, in the absence of noise, a good implementation of PLS should completely reject all of the nonlinearity and return a factor which is exactly linearly related to the y-block variances.” (Richard Kramer) “While I am no longer working in this field, and cannot easily do simulations, I think that a 2 factor PCR or PLS model would fully model the simulated spectra.” (Fred Cahn) “My “objection” is that you did not seem to look at the 2nd factor, which I think is needed to accurately model the spectra after the background is added.” (Fred Cahn) “I would expect PLS to outperform PCR, and the loading of the first principal component to be mostly located around the lower wavelength peak for PLS.” (Paul Chabot) Response: Yes, but: The point being that, as our conclusions indicate, this is one case where the use of latent variables is not the best approach. The fact remains that with data such as this, one wavelength can model the constituent concentration exactly, with zero error – precisely because it can avoid the regions of nonlinearity, which the PCA/PLS methods cannot do. It is not possible to model the “constituent” better than that, and even if PLS could model it just as well (a point we are not yet convinced of since it has not yet been tried – it should work for a polynomial nonlinearity but this nonlinearity is logarithmic) with one or even two factors, you still wind up with a more complicated model, something that there is no benefit to. Richard Kramer suggested that we use two wavelengths (with the MLR approach) to see what happens. Well, here’s what happens: if the second wavelength is also on the linear absorbance band, you get a “divide by zero” error upon performing the matrix inversion due to the perfect collinearity between the data at the two wavelengths. If the second wavelength is on the nonlinear band, the regression coefficient calculated for it is exactly zero (at least to 16 digits, where the computer truncation error becomes important), since it plays exactly no role in the modeling. In other words, not only is it

154

Chemometrics in Spectroscopy

unnecessary to add a second wavelength to the model, it is impossible to do so if you try; when the model is perfectly correct you can’t force a second wavelength into that model even if you want to. Richard Kramer, Patrick Wiegand, and Paul Chabot suggested that a one-factor PLS model should reject the data from the nonlinear wavelength and therefore also provide a perfect fit to the “constituent”. I offered to provide the data as an EXCEL spreadsheet to these responders; Paul accepted the offer, and I e-mailed the data to him. We will see the results at an appropriate stage. 5) “There are many well-established techniques for choosing which wavelength regions to use when modeling with PLS/PCR. First, I advise people to make sure that the pure component spectrum actually has a band in the location being modeled." (Patrick Wiegand) Response: That indeed is a good procedure when you can do it (keeping in mind our earlier discussion regarding users reactions to the case of a conflict between theoret ical correctness and the experimental “numbers game”), and we also make the same recommendation when appropriate. If anything, proper wavelength choice is even more important when using MLR than either PCA or PLS. But what do you do when the “constituent” is a physical property, with no distinct absorbance band? This consider ation becomes particularly pernicious when that property is not itself being calibrated for, but is a variation superimposed on the data, and needs a factor (or wavelength) to compensate for, yet has no absorbance band of its own? The prototype example of this is the “repack” effect found when the measurements are made by diffuse reflectance: “Repack” does not have an absorbance band. Other situations arise where that approach fails: when the chemistry is unknown or too complicated (octane rating in gasoline, for example). Here again, even though a fair amount is known about the chemistry behind octane rating, there is no absorbance band for “octane value”. Another case is where the chemistry is known, but the spectroscopy is unknown, because the pure material is not available. Protein, for example, cannot be extracted from wheat (or at least not and still remain protein), so the spectrum of “pure” protein as it exists in wheat is unknown. Even simpler molecules are subject to this effect: we can measure the spectrum of pure water easily enough, for example, but that is not the same spectrum as water has when it is present as an intimate mixture in a natural product – the changes in the hydrogen bonding completely change the nature of the spectrum. And these examples are ones we know about! 6) “Finally, the calibration statistics presented in Table 27-1 show a correlation coef ficient of 0.9996 for PCR, even when an obviously nonlinear region is used! I am not sure if this is significantly different from the one shown for MLR using only the linear region. To me either model would be acceptable at the stage of method development where the article ended. Besides, it is unlikely that someone would be able to know a priori that the linear region was the better region to use for MLR.” (Patrick Wiegand) Response: As a purely practical matter, we agree with that interpretation. However, we hope that by now we have convinced you that we are trying to do more than that – we are trying to find out what really goes on inside the “black boxes” of chemometric

Linearity in Calibration: Act II Scene III

155

calculations. The fact that the value of the PCR correlation coefficient differs significantly from unity becomes clear when you look at the other term of the ANOVA equation: in the MLR case the sum-squared error is zero, in the PCR case it is “infinitely” greater than that. Don’t forget that “significance”, at least in the statistical sense, is defined only when dealing with random variables. This also relates to the earlier comment regarding how to find ways to compare the relative effects of noise and nonlinearity on calibration models. 7) “It would be very interesting also, since the performance of the models presented are so similar, to see how the performance would be affected by noise, drift, etc. which are always present in actuality. I would not be surprised if PLS/PCR outperformed MLR under those circumstances.” (Patrick Wiegand) Response: Yes, it certainly would be most interesting to investigate this question. This is closely related to the previous discussion concerning the relationship between noise and nonlinearity, so I would modify the statement of the problem to “At what point does one or another effect dominate the behavior of the calibration?” that is, where is the crossover point? Investigating questions of this sort is called “research”, and a more fundamental question arises: why isn’t anybody doing such investigations? Other, related, questions are also important: Having determined this in isolation, how does the data analyst determine this in real data, where unknown amounts of several effects may be present? There is a similarity here to Richard’s earlier point regarding the relationship between the amount of noise and the amount of nonlinearity. Here are more fertile areas for research into the behavior of calibration models. 8) “At any wavelength in your simulation, a second degree power series applies, which is linear in coefficients, and the coefficients of a 2 factor PCR or PLS model will be a linear function of the coefficients of the power series. (This assumes an adequate number of calibration spectra, that is, at least as many spectra as factors and a sufficient number of wavelength, which the full spectrum method assures.) The PCR or PLS regression should find the linear combination of these PCR/PLS coefficients that is linear in concentration.” (Fred Cahn) Response: We have read the indicated section of that paper [6], and scanned the rest of it. We agree with much of what it says, both in the paper and in Fred Cahn’s messages, but we are not sure we see the relevance to the column. Certainly, nonlinearities in real data can have several possible causes, both chemical (e.g., interactions that make the true concentrations of any given species different than expected or might be calculated solely from what was introduced into a sample, and interaction can change the underlying absorbance bands, to boot) and physical (such as the stray light, that we simulated). Approximating these nonlinearities with a Taylor expansion is a risky procedure unless you know a priori what the error bound of the approximation is, but in any case it remains an approximation, not an exact solution. In the case of our simulated data, the nonlinearity was logarithmic, thus even a second-order Taylor expansion would be of limited accuracy. Alternative methods, such as correcting the nonlinearity though the application of an appropriate physical theory as we described above, may do as well or even better than a Taylor series approximation, but a rigorous theory is not always available. Even in

156

Chemometrics in Spectroscopy

cases where a theory exists, often the physical conditions for which the theory is valid cannot be achieved; we demonstrated this in the discussion in Chapters 29 and 30 of the fundamental impossibility of truly achieving “Beer’s Law linearity”. Thus we are left with a situation where even in the best cases we can achieve, there can be residual non-linearities in the data. The purpose of our column was to investigate the behavior of different modeling methods in the face of nonlinearity. 9) “Thus, my interest in 2 or more factor chemometric models of your simulation is in line with this view of chemometrics. I agree with the need for better physical understanding of instrument responses as well as of the spectra themselves. I would not choose PCR/PLS or MLR to construct such physical models, however.” (Fred Cahn) Response: We were not trying to use the chemometric techniques to create a physical model in the column. We also agree that physical models should be created in the traditional manner, based on the study of the physical considerations of a situation. Ideally you would start from a fundamental physical law and derive, through logic and mathematics, the behavior of a particular system: this is how all other fields of science work. A chemometric technique then would be used only to ascertain the value (from a series of physical measurements) of an unknown parameter that the mathematical derivation created. What we were trying to do in the column was to ascertain the behavior of a mathemat ical (not physical!) system in the face of a certain type of (simulated) physical behavior. There is nothing wrong with trying to come up with empirical methods for improving the practical performance of chemometric calibration, but one of the philosophical problems with the current state of chemometrics is that nobody is trying to do anything else, that is to determine the fundamental behavior of these mathematical systems. 10) “The synthetic data did NOT demonstrate the advantage of a single linear wavelength over a multiple wavelength [sic] model ” (Richard Kramer) “ in one case you use a factor which happens to be based upon an explicit model (i.e. linearity) which is correct for the data while stacking the deck against the second case by denying any opportunity to be correct.” (Richard Kramer) “In your article, you appear to be creating an artificial set of circumstances: ” (Patrick Wiegand) “Thus your conclusion – that MLR is more capable of producing accurate models than PLS/PCR – is based on a contrived set of circumstances that would not occur in reality, especially when the chemometrician/spectroscopist is experienced.” (Patrick Wiegand) Response: Artificial? Contrived? Only insofar as any experimental study is based on a “contrived” set of circumstances – contrived to enable the experimenter to separate the phenomenon of interest and study its effects, with “everything else the same”. But that is a minor matter. Richard and Patrick (and how many others, who didn’t respond?) believe that we concluded that “MLR is better than PCA/PLS”. The really critical point here is that that is NOT our conclusion, and anyone who thinks this has misunderstood us. We put the fault for this on ourselves, since the one thing that is clear is that we did not explain ourselves sufficiently.

Linearity in Calibration: Act II Scene III

157

Therefore let us clarify the point here and now: we are not fighting a “holy war” against PCA/PLS etc. The purpose of the exercise was NOT to “prove that MLR with wavelength selection is better”, but to investigate and explain conditions that cause that to be so, when it happens (which it does, sometimes). As we discussed in the original column, more and more discussions about calibration processes, both oral and in the literature, describe situations where wavelength selection improved the results (in PCR and PLS as well as MLR), but there has previously been no explanation for this phenomenon. Therefore we decided to investigate nonlinearity since we suspected that to be a major consideration, and so it turned out to be. We continue our discussion in the following chapters.

REFERENCES 1. 2. 3. 4.

Mark, H. and Workman, J., Spectroscopy 13(6), 19–21 (1998). Mark, H. and Workman, J., Spectroscopy 13(11), 18–21 (1998). Mark, H. and Workman, J., Spectroscopy 14(1), 16–17 (1999). Mark, H., Principles and Practice of Spectroscopic Calibration (John Wiley & Sons, New York, 1991). 5. Mark, H., Applied Spectroscopy 42(5), 832–844 (1988). 6. Cahn, F. and Compton, S., Applied Spectroscopy 42, 865–872 (1988).

This page intentionally left blank

32

Linearity in Calibration: Act II Scene IV

This chapter continues our discussion started by the responses received to our Chapter 27 when it was first published as a paper entitled “Linearity in Calibration” [1]. So far our discussion has extended over three previous chapters (29 through 31) whose original published citations are given in references [2–4]. In Chapter 31, originally referenced as [4] we stated, “we are not fighting a ‘holy war’ against PCA/PLS etc.” and then went on to discuss what our original column was really about. However, if there is a “holy war” being fought at all, then from our point of view it is against the practice of simply accepting the results of the computer’s cogitations without attempting to understand the underlying phenomena that affect the behavior of the calibration models, regardless of the algorithm used. This has been our “fight” since the beginning – which can be verified by going back and rereading our very first column ever [5]. The authors do not always agree, but we do agree on the following: it is incompre hensible how a person calling himself a scientist can fail to wonder WHY calibration models behave the way they do, and try to relate their behavior to the properties of the data giving rise to them. There are reasons for everything that happens, whether we know what those reasons are or not, and the goal of science is to determine what those underlying reasons or principles are. At least that is the goal of every other field of scientific endeavor that we are aware of – why is Chemometrics exempt? Real data, as we have seen, is far too complicated to work with to try to obtain fundamental understanding, just as the physical world is often too complicated to study directly in toto. Therefore work such as was presented in the “Linearity in Calibration” chapter is needed, creating a simplified system where the characteristic of interest can be isolated and studied – just as physical experiments often work with a simplified portion of the physical world for the same reason. This might be categorized as “Experimental Chemometrics”, controlling the nature of the data in a way that allows us to relate the properties of the data to the behavior of the model. Does this mimic the “real world”? No, but it does provide a window into the inner workings of the calibration calculations, and we need as many such windows as we can get. We will go so far as to make an analogy with Chemistry itself. The alchemists of old had an enormous empirical knowledge base, and from that could do all manner of useful things. But we do not consider alchemy a science, and it did not become a science until the underlying principles and phenomena were discovered and codified in a way that all could use. The current state of Chemometrics is more nearly akin to alchemy than Chemistry: we can do all manner of useful things with it, but it is all empirical and there are still many areas where even the most expert and prominent practitioners treat it as a “black box” and make no attempt to understand the inner workings of that black box.

160

Chemometrics in Spectroscopy

Empiricism is important and even necessary, but hardly sufficient. The ultimate test of whether something is scientific is its ability to predict – and that does NOT mean SEP!! The irony of the situation is that a good deal of basic knowledge is available. The field of Chemometrics bypasses all the Statistical basics and jumps right into the heavyduty sophisticated algorithms: everybody just wants to start running before they can even crawl. We commented on this situation in earlier Chapters 29–31 and previous publications [6], and what response we received was on the order of “Why was so much space wasted before getting to the important part?” It is certainly unfortunate that the portion of the discussion that was perceived as “wasted space” was the important part, but was not recognized as such. The early foundations of Statistics go back to the 1600s or so, to the time when proba bility theory was recognized as a distinct branch of mathematics. The current problem is that nobody currently seems to apply the knowledge gained over the intervening span of time, or to be interested in applying that knowledge, or to do fundamental investigations at all. The chemometric community completely ignores the previous mathematical basis underlying its structure. The science of Statistics does, in fact, form a firm foundation that Chemometrics is built on. It is almost shameful that the modern Chemometrics community seems to be content to build ever higher and fancier superstructures on a foundation that is solid enough, but to which it is hardly connected. Worse, there seems to be an active antipathy to such investigations: just look at the firestorm we aroused by publishing a very small and innocuous study of the funda mental behavior of a particular data system! In fact, from the response, you would almost think we committed heresy or attacked religious beliefs, in daring to suggest that PCR/PLS was not always the best way to go, much less do some serious research on the subject. Everybody gives lip service to the concept of “fundamental research is good for the long run”, but nobody seems interested in putting that concept into practice, even with the possibility of fairly short-term returns. Let us look at a couple of examples. In reference [7] we found the following passage: But, it would be dangerous to assume that we can routinely get away with extrapolation of that kind. Sometimes it can be done, sometimes it can’t. There is no simple rule that can tell us which situation we might be facing. (see p. 129 in [7]). And that passage seems to sum up the current state of affairs. Theoretically, a good straight line should be extrapolatable almost indefinitely, yet we all know how risky it is to extrapolate even a little bit beyond the range of our data. Why does not practice conform to theory? The obvious answer is that something is nonlinear. But why cannot we detect this? As Rich says, we do not have any simple rules. Well, OK, so we do not have simple rules. Maybe no simple rules exist. But then, why do not we at least have complicated rules to help us make such important decisions? At least then we would have a way to predict (in the scientific sense) something that is worthwhile knowing. As it stands we have nothing, and nobody seems interested in finding out why. Maybe a new approach is needed. Maybe this is where Fred Cahn’s work is pertinent: if you can approximate the nonlinearity with a Taylor series, then maybe the quality of the fit can provide a diagnostic to form the foundation of a rule on which to base a decision. Maybe something else will work. We do not know, but it is a possible starting

Linearity in Calibration: Act II Scene IV

161

point. Fred, you are in the ideal position to pursue this, how about it – will you accept this challenge? The above example, of course, is relatively abstract and “academic”, and as such perhaps not of too much interest to the majority. Another example, with more practical application, is transfer of calibration models from one instrument to another. This is an endeavor of enormous current practical importance. Witness that hardly a month passes without at least one article on that topic in one or more of the analytical or spectroscopic journals. Yet all those reports are the same: “Effect of Data Treatment ABC Combined with Algorithm XYZ Compared to Algorithm UVW” or some such; they are all completely empirical studies. In themselves there is nothing wrong with that. The problem is that there is nothing else. There are no critical reviews summarizing all this work and extracting those aspects that are common and beneficial (or common and harmful, for that matter). Even worse, there are no fundamental studies dealing with the relationship of the algorithm’s behavior to the underlying physics, chemistry, mathematics, or instrumental effects. It is not difficult to see that the calibration transfer problem breaks down into two pieces: a) The effect of instrumental variation on the data b) The effect of variations of the data on the model. Studying the effects of instrumental performance should be the province of the manu facturers. Unfortunately, the perception is that it is to their benefit to release such results only if they turn out to be “good”, and there is little incentive for them to perform studies whose only purpose is to increase scientific knowledge. Thus it is up to academia to pick up this particular ball, if there is any interest in it at all. Fundamental studies in those areas will eventually give rise to real knowledge about how and when calibrations can be transferred, and provide us with trustworthy recipes for doing the transfer. Such knowledge will also provide us with the confidence of knowing that the underlying science is sound, and thus take us beyond the “my algorithm is better than your algorithm” stage that we are now at. Furthermore, true fundamental understanding could also be applied in reverse. Then instrument manufacturers could concentrate on those aspects of construction and opera tion that affect the transferability situation, and be able to verify their capabilities in an unambiguous, scientifically valid and agreed-on manner. This is just one other example of a current problem that COULD be attacked with fundamental studies, with both short- and long-term benefits that are obvious to all. Connecting to the statistical foundations, as described above, can have other benefits. For example, computing an SEP on a validation set of data is considered the be-all and end-all of calibration diagnostics. This is an important calculation, to be sure, but it has its limitations, as well. For example, the SEP alone has no diagnostic capability: it tells you nothing about what you need to do in order to improve a calibration model. For another, even when you compare SEPs from different models and choose the model with the smallest SEP, that does not necessarily mean you are choosing the best model. We often see “robustness” bandied about in discussions of calibration models, but what diagnostics do we have to quantify “robustness”? Without such a diagnostic, how can we expect to evaluate “robustness” either in isolation or to compare with SEP?

162

Chemometrics in Spectroscopy

By focusing all our attention on the SEP we have also lost the ability to evaluate calibrations on their own. When calibrating spectrometers to do quantitative analysis, where samples are cheap and easy to come by, this loss is not too serious, but what do you do when a project requires calibration runs that cost a million (or ten million) dollars per run, and minimizing the number of runs is the absolute top priority? In such a case, you will not only not have validation data, you will likely not even have enough calibration data to do a leave-one-out calculation, and then being able to evaluate models from calibration diagnostics alone will be critical. Statisticians have, in fact, developed diagnostic tests that provide information about such characteristics, but the Chemometric community, in our arrogance, think we know better and ignore all this prior work. The statistical community has also developed many local and semi-local diagnostic tools to help understand and improve calibration models; we really need to get back to the roots on this, as well. There are innumerable unsolved problems in Chemometrics that need to be addressed: real, scientific problems, not just new ways to throw numbers around.

REFERENCES 1. 2. 3. 4. 5. 6. 7.

Mark, H. and Workman, J., Spectroscopy 13(6), 19–21 (1998). Mark, H. and Workman, J., Spectroscopy 14(2), 16–27 (1999). Mark, H. and Workman, J., Spectroscopy 14(1), 16–17 (1999). Mark, H. and Workman, J., Spectroscopy 13(11), 18–21 (1998). Mark, H. and Workman, J., Spectroscopy 2(1), 38–39 (1987). Mark, H. and Workman, J., Spectroscopy 13(4), 26–29 (1998). Kramer, R., Chemometric Techniques for Quantitative Analysis (Marcel Dekker, New York, 1998).

33 Linearity in Calibration: Act II Scene V

This chapter is still a continuation of our discussion started by the responses received to Chapter 27 from our initial publication of “Linearity in Calibration” [1]. Up until now our discussion has extended over Chapters 29–32 as original paper publications ([2–5], respectively). At this point, however, we are finally getting toward the end of our obsession with considerations of linearity – at least until we receive another set of comments from our readers. Incidentally, we welcome such feedback, even those that disagree with us or with which we disagree, so please keep it coming. Indeed, it seems that we do not get much feedback unless our readers disagree with us, and feel it strongly enough to feel the need to say so. That is great – there is nothing like a little controversy to keep a book like this interesting: who said chemometrics and statistics and mathematics were dry subjects, anyway?! In our original column on this topic [1] we had only done a principal component analysis to compare with the MLR results. One of the comments made, and it was made by all the responders, was to ask why we did not also do a PLS analysis of the synthetic linearity data. There were a number of reasons, and we offered to send the data to any or all of the responders who would care to do the PLS analysis and report the results. Of the original responders, Paul Chabot took us up on our offer. In addition, at the 1998 International Diffuse Reflectance Conference (The “Chambersburg” meeting), Susan Foulk also offered to do the PLS analysis of this data. Gratifyingly, when Paul and Susan reported their PLS loadings they were identical, even though they used different software packages to do the PLS calculations (PLSIQ and Unscrambler). We are certainly glad we do not have to worry about sorting out dif ferences in software packages (due to different convergence criteria, etc., that sometimes creep into results such as these) on top of the Chemometric issues we want to address. Figure 33-1 presents the plot of the PLS loadings. Paul and Susan each computed both loadings. Note that the first loading is indistinguishable to the eye from the first PCA loading (see our original column on this topic [1]). Paul and Susan each also computed the two calibration models and performance statistics for both models. Except that various programs did not compute the same sets of performance statistics (although in one case a different computation seemed to be given the same label as SEE), the ones that were reported by both programs had identical values. As expected by all responders, and by your hosts as well, when two-factor models (either PCR or PLS) were computed, the fit of the model to the synthetic data was perfect. Table 33-1 presents a summary of the numerical results obtained, for one-factor calibration models. Interestingly, when comparing the calibration results we find that the reported cor relation coefficients agree among the different programs using the same algorithm, but the SEE values differ appreciably; it would seem that not all programs use the same

164

Chemometrics in Spectroscopy PLS Loadings 0.2 0.15 0.1

300

288

276

264

252

240

228

216

204

192

180

168

156

144

132

120

108

96

84

72

60

48

36

24

0

0 –0.05

12

Loading

0.05

–0.1 –0.15 –0.2 –0.25 –0.3 Index

Figure 33-1 PLS loadings from the synthetic data used to test the fit of models to nonlinearity. (see Colour Plate 2)

Table 33-1 Summary of results obtained from synthetic linearity data using one PCA or PLS fac tor. We present only those performance results listed by the data analyst as Correlation Coefficient and Standard Error of Estimate Data analyst Column Chabot Chabot Foulk

Type of analysis

Corr. Coeff.

SEE

PCR PCR PLS PLS

0999622439 0999622411 0999623691 0999624

0057472 001434417 001436852 0051319

definition of SEE. This leaves in question, for example, whether the value reported for SEE from PLS by Susan Foulk is really as large an improvement over the SEE for PCR reported by your columnists, or if it is due to a difference in the computation used. Since Paul Chabot reported SEE for both algorithms and his values are more nearly the same, even though his computation seems to differ from both the others, the tentative conclusion is that there is a difference in the computation. Indeed, we find that if we multiply our own value for SEE by the square root of 4/5, we obtain a value of 0.0514045, a value that compares to the SEE obtained by Susan Foulk in more nearly the same way that Paul Chabot’s values compare to each other, indicating a possibility that there is a discrepancy in the determination of degrees of freedom that are used in the two algorithms. Based on the values of the correlation coefficients, then, we can find the following comparisons between the two algorithms: as several of the responders indicated, the PLS model did provide improved results over the PCR model. On the other hand, the degree of improvement was not the major effect that at least some of the responders expected. As Richard Kramer expected,

Linearity in Calibration: Act II Scene V

165

PLS should have, in principle, rejected a portion of the non-linear variance result ing in a better, although not completely exact, fit to the data with just 1 factor. Some of this variance was indeed rejected by the PLS algorithm, but the amount, compared to the Principal Component algorithm, seems to have been rather minuscule, rather than providing a nearly exact fit. Nonlinearity is a subject the specifics of which are not prolifically or extensively discussed as a specific topic in the multivariate calibration literature, to say the least. Textbooks routinely cover the issues of multiple linear regression and nonlinearity, but do not cover the issue with “full-spectrum” methods such as PCR and PLS. Some discussion does exist relative to multiple linear regression, for example in Chemometrics: A Textbook by D.L. Massart et al. [6], see Section 2.1, “Linear Regression” (pp. 167–175) and Section 2.2, “Non-linear Regression,” (pp. 175–181). The authors state, In general, a much larger number of parameters [wavelengths, frequencies, or factors] needs to be calculated in overlapping peak systems [some spectra or chromatograms] than in the linear regression problems. (p. 176) The authors describe the use of a Taylor expansion to negate the second and the higher order terms under specific mathematical conditions in order to make “any function” (i.e., our regression model) first-order (or linear). They introduce the use of the Jacobian matrix for solving nonlinear regression problems and describe the matrix mathematics in some detail (pp. 178–181). There are also forms of nonlinear PCR and PLS where the linear PCR or PLS factors are subjected to a nonlinear transformation during singular value decomposition; the nonlinear transformation function can be varied with the nonlinearity expected within the data. These forms of PCR/PLS utilize a polynomial inner relation as spline fit functions or neural networks. References for these methods are found in [7]. A mathematical description of the nonlinear decomposition steps in PLS is found in [8]. These methods can be used to empirically fit data for building calibration models in nonlinear systems. The interesting point is that there are cases, such as the one demonstrated in the Linearity in Calibration chapter where nonlinearity is the dominant phenomenon, where MLR will fit the data more closely with fewer terms than either PCR or PLS. One could imagine a real case where an analyte would have a minor absorption band such that the magnitude of the spectral band is within a linear region of the measuring instrument. One could also imagine the major absorption band of this analyte is somewhat nonlinear at the higher concentration ranges. In this special case the MLR would provide a closer fit with fewer terms than either the PLS or the PCR, unless the minor band was isolated prior to model development using the PCR or PLS. This points to a continuing need for spectral band selection algorithms that can automatically search for the optimum spectral information and linear fit prior to the calibration modeling step. But all things remaining constant, cases remain where MLR with automatic channel selection feature will provide a more optimum fit, in some cases, than either PCR or PLS. Surprising indeed, to some people! In their day, Principal Components and Partial Least Squares were each considered almost as “the magic answer to all calibration problems”. It took a long time for the realization to dawn that they contain no “magic” and are subject to most of the

166

Chemometrics in Spectroscopy

same problems as the algorithm previously available (at that time, what we now call MLR). Now we see a surge in other new algorithms: wavelets, neural networks, genetic algorithms, as well as the combining of techniques (e.g., selecting wavelengths before performing a PCA or PLS calculation). While some of the veterans of the “PC wars” (not “political correctness”, by the way) realize that they can be overfit just as MLR calibrations can, have become wary of the problem and are more cautious with new algorithms, there is some evidence that a large number, perhaps the majority, of users are not nearly so careful, and are still looking for their “magic answer”. There is a generic caution that need to be promoted, and all users made aware of when dealing with these more sophisticated methods. That is the simple fact that every new parameter that can be introduced into a calibration procedure is another way to overfit and hide the fact that it is happening. Worse, the more sophisticated the algorithm the harder it is to see and recognize that that is going on. With PCR and PLS we introduced the extra parameter of the number of factors: one extra parameter. With wavelets we introduce the order and the locality of each wavelet: two extra parameters. With neural nets, we have the number of nodes in each layer: n extra parameters, and then there is even a metaparameter: the number of layers. No wonder reports of overfitting abound (and don’t forget: those are only the ones that are recognized)! And nary a diagnostic in sight. In a perfect world, a new algorithm would not be introduced until a corresponding set of diagnostic methods were developed to inform the user how the algorithm was behaving. As long as we are dreaming, let us have those diagnostics be informative, in the sense that if the algorithm was misbehaving, it would point the user in the proper direction to fix it.

REFERENCES 1. 2. 3. 4. 5. 6.

Mark, H. and Workman, J., Spectroscopy 13(6), 19–21 (1998). Mark, H. and Workman, J., Spectroscopy 13(11), 18–21 (1998). Mark, H. and Workman, J., Spectroscopy 14(1), 16–17 (1999). Mark, H. and Workman, J., Spectroscopy 14(2), 16–27 (1999). Mark, H. and Workman, J., Spectroscopy 14(5), 12–14 (1999). Massart, D.L., Vandeginste, B.G.M., Deming, S.N., Michotte, Y. and Kaufman, L., Chemo metrics: A Textbook (Elsevier Science Publishers, Amsterdam, 1988). 7. Wold, S., Kettanah-Wold, N. and Skagerberg, B., Chemometrics and Intelligent Laboratory Systems 7, 53–65 (1989). 8. Wold, S., Chemometrics and Intelligent Laboratory Systems 14 (1992).

34

Collaborative Laboratory Studies: Part 1 – A Blueprint

We will begin by taking a look at the detailed aspects of a basic problem that confronts most analytical laboratories. This is the problem of comparing two quantitative methods performed by different operators or at different locations. This is an area that is not restricted to spectroscopic analysis; many of the concepts we describe here can be applied to evaluating the results from any form of chemical analysis. In our case we will examine a comparison of two standard methods to determine precision, accuracy, and systematic errors (bias) for each of the methods and laboratories involved in an analytical test. As it happens, in the case we use for our example, one of the analytical methods is spectroscopic and the other is an HPLC method. As it happens, a particularly opportune event occurred recently, almost simultaneously with our writing these next few chapters: an article [1] appeared in LC-GC, a sister magazine to Spectroscopy, that also takes concepts that we discussed and described in some of our early chapters, and applies them to a real-life situation (or at least a simulation of a real-life situation), the main difference is that the experiment described deals with macroscopic objects while the “real world” deals in atoms and molecules). In past chapters [2, 3] we also described how probabilistic phenomena give rise to distributions and even included computer programs to allow simulations of this, but given the constraints of time and text space, we were not able to link that to the actual behavior of the physical world nearly as well as Hinshaw does. In the case described, given the venue, the interest is in the chromatography, and for that reason we will not dwell on their application. However, we do strongly urge our readers to obtain a copy of this article and read it for it is description of the basis and generation of the distributions that arise from the effects of the random behavior of the physical world. The probabilistic and statistical experiments described are superb examples of how concepts such as these can be illustrated and brought to life. The statistical tools we describe in the next few chapters, and use for this demonstra tion, are ones that we have previously described. These tools include statistical hypothesis testing and ANOVA. Our previous descriptions of these topics were generic and rather general; at that time we were interested in presenting the theoretical background and reasoning behind the development of these statistical techniques. Now we will use them in a practical situation, to show how these methods can be used to evaluate various characteristics relating to the precision and accuracy of analytical methods, applying them to real data to simultaneously demonstrate how to use them and the nature of the results that can be obtained. We will use ANOVA to evaluate potential bias in reported results inherent in the analytical methods themselves, or due to the operators (i.e., location of laboratory) performing the methods. For the next series of articles all computations were completed using MathCad Worksheets [4] written by the authors. The objectives of this next set of articles is to determine the precision, accuracy, and bias due to choice of analytical

168

Chemometrics in Spectroscopy

method and/or operator for the determination of an analyte within a set of hypothetical production samples and spiked recovery samples (samples of gravimetrically known composition). The discussion will occupy the Chapters 34–39.

EXPERIMENTAL DESIGN The experimental design used for this hypothetical study is based on a relatively simple factorial model where individual samples are measured as shown in Figure 34-1 and Table 34-1. We have previously discussed factorial designs [5] although, as was the case with ANOVA, our previous discussion was simplified and primarily theoretical, to demonstrate the principles involved, while in the current discussion, we apply these concepts to a more realistic practical situation. For this hypothetical test, samples consist of three production run samples (i.e., Nos. 1–3) with a target analyte value of 3.60 units (percent, grams, pounds, etc.). In addition, three spiked recovery samples with target analyte levels of 3.40, 3.61, and 3.80% respectively are represented by Nos. 4–6. This experimental model allows the methods and locations (labs or operators) to be compared for precision, accuracy, and systematic errors. We will use the designation Lab 1 and Lab 2 to indicate different locations and/or operators performing the identical procedures for METHODS A and B (or I and II). Before considering the design and the analysis of it in detail, let us take a look at the factors that are being included in the design, and their impact on the experimental design and the analysis of this design: we have six samples, two methods of analysis for the constituent of interest, two laboratories, two chemists in each laboratory and five repeat readings of the constituents of each sample by each chemist. Statistical hypothesis

Method I

r1 r2 r 3

r 4

r 5

Method II

r1 r2 r 3

r 4

r 5

Method I

r1 r2 r 3

r 4

r 5

Method II

r1 r2 r 3

r 4

r 5

Lab 1

Each sample (n = 6)

Lab 2

Sample

Location

Method

Replicates

Figure 34-1 A simple factorial design for collaborative data collection. Each sample analyzed (in this hypothetical case n = 6) requires multiple labs, or operators, using both methods of analysis and replicating each measurements a number of times (r = 5) for this hypothetical case.

Collaborative Laboratory Studies: Part 1

169

Table 34-1 “As reported” analytical data∗ for collaborative study Sample No. – Replicate no.

Lab 1 – Method B

Lab 2 – Method B

Lab 1 – Method A

Lab 2 – Method A

11 12 13 14 15 Mean

3507 3463 3467 3501 3489 3.485

3507 3497 3503 3473 3447 3.485

3462 3442 3460 3517 3460 3.468

3460 3443 3447 — — 3.450

21 21 23 24 25 Mean

3479 3453 3459 3461 3481 3.467

3497 3660 3473 3447 3453 3.506

3446 3448 3455 3456 3455 3.452

3460 3470 3450 3460 3460 3.460

31 32 33 34 35 Mean

3366 3362 3351 3353 3347 3.356

3370 3327 3387 3430 3383 3.379

3318 3330 3328 3322 3323 3.324

3337 3317 3337 3330 3330 3.330

41 42 43 44 45 Mean

3421 3377 3399 3379 3379 3.391

3407 3400 3417 3353 3380 3.391

3366 3360 3361 3362 3370 3.364

3380 3380 3380 3380 3380 3.380

51 52 53 54 55 Mean

3565 3568 3561 3576 3587 3.571

3540 3550 3573 3533 3543 3.548

3538 3539 3544 3540 3543 3.541

3560 3580 3590 3580 3560 3.570

61 62 63 64 65 Mean

3764 3742 3775 3767 3766 3.763

3860 3833 3933 3870 3810 3.881

3741 3740 3739 3742 3744 3.741

3740 3760 3730 3770 3750 3.740

∗

Note: For this hypothetical exercise, Samples 1–3 have a target value of 3.60% absolute; whereas Samples 4–6 are Spiked Recovery Samples with target values of 3.40 (No. 4), 3.61 (No. 5), and 3.80 (No. 6).

170

Chemometrics in Spectroscopy

testing provides us with an objective method of determining whether or not a given difference in conditions (i.e., factor) has an effect on the readings. We have the following a priori expectations for the behavior of these several factors: a) Since we know that the samples are of different composition we expect the measure ments of the constituent value to reflect this genuine difference in composition, and be therefore to be systematic, and be constant across all other factors. Any departure from constant differences (beyond the amount expected from random variation due to unavoidable random error of the analysis, of course) can be attributed to an effect of the corresponding factor, or due to blunders such as improper mixing or sampling of the material. b) There may be an effect due to the use of two different laboratories. This effect may or may not be the same for the two different methods of analysis. This can be examined by comparing the results of measurements on the same sample by the same method in each of the two different laboratories. Under the proper circumstances, results from multiple samples may be combined to achieve a more definitive test. Before doing so, the existence of the appropriate circumstances must first be determined. c) There may be an effect due to the use of two different methods of analysis. This effect may or may not be the same in the two different laboratories. There may or may not be a difference between the two chemists in each laboratory. This can be examined by comparing the results of measurements on the same sample by the two different methods of analysis. Under the proper circumstances, results from multiple samples may be combined to achieve a more definitive test; if circumstances are appropriate, results from the two chemists in each laboratory and the results from the two laboratories may also be combined. Before doing so, the existence of the appropriate circumstances must first be determined. d) There may or may not be a difference between the two chemists’ readings of the constituent values in a given laboratory. If we arbitrarily label the chemists in each laboratory as “Chemist #1” and “Chemist #2”, we would not expect a systematic difference between the corresponding chemists in the two different laboratories. This can, however, happen by coincidence. This can be examined by comparing the results of measurements on the same sample by the two different chemists in each laboratory. Under the proper circumstances, results from multiple samples may be combined to achieve a more definitive test. Before doing so, the existence of the appropriate circumstances must first be determined. Many of these aspects will be presented over the next several chapters. e) We do not expect any systematic effects among the five repeat readings of each sample by each chemist in each laboratory. We do expect random variations, reflecting unavoidable random errors of measurement. These unavoidable random errors of measurement are quantified by the terms “precision” and “accuracy”. f) We expect the precision and accuracy for each method to be the same at both laboratories. This can be examined by comparing the precision and accuracy of each method in each laboratory, combining results from multiple samples when appropriate. Before doing so, the existence of the appropriate circumstances must first be determined. g) We do not expect the precision and accuracy to be the same for the two methods except by coincidence.

Collaborative Laboratory Studies: Part 1

171

h) We expect the precision and accuracy to be the same for all four chemists for each method, unless we find a difference in precision and/or accuracy between laboratories. This can be examined by comparing the precision and accuracy of each method as performed by each chemist, combining results from multiple samples when appropriate. Before doing so, the existence of the appropriate circumstances must first be determined. The use of the statistical tools of ANOVA and statistical hypothesis testing, described previously in these chapters and whose application is described in further detail below, allows separation of the effects due to the various factors and objective verification as to which ones are statistically significant. In the absence of any systematic effects due to one or more of the factors, our a priori expectation is that any differences seen are due to the effects of unavoidable random errors only, and will therefore be non-significant. Therefore, any statistically significant effects found due to differences between sets of readings indicates that the corresponding factor has a real, systematic effect on the readings. By posing the scientific questions about the effects of the factors in the formalism of statistical hypothesis tests [6], any statistically significant result is an indication that the corresponding factor has a real, systematic effect on the readings, and this gives us the handle we need to extract that information from the mass of data we obtain from this simple-seeming, but (as we see) actually very complicated experimental design. Data analysis for this series was performed using MathCad and the statistical methods used are described in greater detail in Youden’s monograph [7] and in Mark and Workman [8]. We use the MathCad worksheets both to illustrate how the theoretical concepts can be put to actual use and also to demonstrate how to perform the calculations we describe. The worksheets will be printed along with the chapters in which they are first used. At a later date we are planning to enable you to go to the Spectroscopy home page (http://www.spectroscopymag.com) and find them. If, and when, the actual URLs for the worksheets become available, we will let you know. The primary goal of this series of chapters is to describe the statistical tests required to determine the magnitude of the random (i.e., precision and accuracy) and systematic (i.e., bias) error contributions due to choosing Analytical METHODS A or B, and/or the location/operator where each standard method is performed. The statistical analysis for this series of articles consists of five main parts as: Part 1: Overall comparison of both locations and analytical methods for precision and accuracy; Part 2: Analysis of Variance testing for both locations and analytical methods to deter mine if an overall bias exists for location or analytical method; Part 3: Testing for systematic error in each method by performing a comparison test for a set of measurements versus the known True Value; Part 4: Performing a ranking test to determine if either analytical method or location affects the results as a systematic error (bias); and Part 5: Computing the “efficient comparison of two methods” as described by Youden and Steiner in reference [7]. The analyst may use one or more of these statistical test methods to compare analyti cal results depending upon individual requirements. It is recommended that the easiest

172

Chemometrics in Spectroscopy

and most fruitful test for the effort expended would be the test method described in Chapter 38. This simple set of tests statistically compares precision, accuracy, and sys tematic error for two methods with the minimum quantity of analytical effort. Chapter 38 is most highly recommended above the Chapters 34–37, but it is a useful tool to proceed through an understanding of the first chapters before proceeding to Chapter 38. The basic experimental design required for statistical methods in Chapters 34–37 is demonstrated in Figure 34-1 and the data is presented in Table 34-1. The basic experimental design required for Chapter 38 statistical methods is given in Figure 34-2 and the corresponding data in Table 34-2. Thus, if you would like to follow along by performing these tests on your own real data, the basic designs are demonstrated here to allow you to collect data before proceeding through the statistical methods described within the next 6 chapters.

r1

Sample X

r2 r3 r4 r5

Sample Y

r1 r2 r3 r4 r5

Sample X

r1 r2 r3 r4 r5

Sample Y

r1 r2 r3 r4 r5

Method A

Method B

Method

Sample

Replicates

Figure 34-2 Simple experimental design for Youden/Steiner comparison of two Methods (data shown in Table 34-2).

Table 34-2 Analytical data entry for comparison of two methods tests Method A

Mean

Method B

Sample X

Sample Y

Sample X

Sample Y

3366 3380 3360 3380

3741 3740 3740 3760

3421 3407 3377 3400

3764 3860 3742 3833

3372

3745

3401

3800

Collaborative Laboratory Studies: Part 1

173

ANALYTICAL METHODS Sample collection and handling Let us say the first three samples tested were collected by Lab 2 from their production facility. These samples were retained from actual production lots. An aliquot from each retained jar was removed and shipped to Lab 1 in appropriate sealed containers. METHOD B testing was started at both laboratories the day following receipt of the samples to rule out any possible aging effects. METHOD A testing was performed in Lab 1 on the following day, while the METHOD A testing in Lab 2 occurred a week later. The second three samples were spiked, produced at Lab 2 using the pure analyte reagent and Control material. An aliquot of each sample was shipped to Lab 1 in appropriate sealed containers. Once again, the METHOD B testing was performed on the same day at both locations. METHOD A testing was done at both sites within a 2-day time period.

METHOD A and B analysis All six samples at both sites were prepared the same way. Five separate aliquots from each sample were separately sampled and prepared for testing. Each aliquot was then measured three times. Conditions and standard operating procedures for METHODS A and B were carefully specified for both Labs 1 and 2.

RESULTS AND DATA ANALYSIS Comparing all laboratories and all methods for precision and accuracy COMPARISON OF PRECISION AND ACCURACY FOR METHODS AND LABO RATORIES USING THE GRAND MEAN FOR SAMPLES No. 1–3 (Collabor_GM Worksheet), OR BY USING A SPIKED RECOVERY STUDY FOR SAMPLES No. 4–7 (Collabor_TV Worksheet) To compute the results shown in Tables 34-3 and 34-4, the precision of each set of replicates for each sample, method, and location are individually calculated using the root mean square deviation equation as shown (Equations 34-1 and 34-2) in standard symbolic and MathCad notation, respectively. Thus the standard deviation of each set of sample replicates yields an estimate of the precision for each sample, for each method, and for each location. The precision is calculated where each yij is an individual replicate (j) measurement for the ith sample; y¯ i is the average of the replicate measurements for the ith sample, for each method, at each location; and N is the number of replicates for each sample, method, and location. The results of these computations for these data

174

Chemometrics in Spectroscopy

Table 34-3 Individual sample analysis precision for hypothetical production samples Sample no. Sample 1 Sample 2 Sample 3 Pooled

METHOD B – Lab 1

METHOD B – Lab 2

METHOD A – Lab 1

METHOD A – Lab 2

0020 0013 00079 0015

0025 0088 0037 0057

00089 00066 00068 0008

00089 0010 0012 0010

Table 34-4 Individual sample analysis precision for hypothetical spiked recovery samples Sample no. Sample 4 Sample 5 Sample 6 Pooled

METHOD B – Lab 1

METHOD B – Lab 2

METHOD A – Lab 1

0019 0010 0012 0014

0025 0015 0047 0032

00041 00026 00019 00030

METHOD A – Lab 2 0000 0013 0016 0012

are found in Tables 34-3 and 34-4 representing samples 1–3 (hypothetical production samples), and 4–6 (hypothetical spiked samples), respectively.

S=

� �N �� � y − y¯ i 2 � i=1 i N −1

� �−−−−−−−−−−−−−−→ � �� � Y − meanY 2 S= N −1

(34-1)

(34-2)

The pooled precision and accuracy for each sample for both analytical methods and locations are calculated using Equations 34-3 and 34-4, representing standard symbolic and MathCad notation, respectively. The pooled precision is calculated where each yi is an individual replicate measurement for an individual sample; y¯ i is the average of the replicate measurements for each sample, each method, each location; and Ni is the number of replicates for an individual (ith) sample, method, and location. The results of these computations for these data are found in Tables 34-3 and 34-4 (Pooled) row representing samples 1–3, and 4–6, respectively. The results from Tables 34-3 and 34-4 indicate there is no trend in error versus concentration, therefore the error appears to show no trending with respect to concentration.

Ps =

� � N1 � N2 � N3 � N4 � �2 � �2 � �2 � �2 �� y1j − y¯ 1 + y2j − y¯ 2 + y3j − y¯ 3 + y4j − y¯ 4 � � j=1 j=1 j=1 j=1 N1+N2+N3+N4−4

(34-3)

Collaborative Laboratory Studies: Part 1

175

Table 34-5 Individual sample analysis estimated accuracy using grand mean calculation Sample no. Sample 1 Sample 2 Sample 3 Pooled

� Ps =

METHOD B – Lab 1

METHOD B – Lab 2

METHOD A – Lab 1

METHOD A – Lab 2

0025 0014 0012 0018

0029 0096 0051 0065

0029 0031 0037 0033

0029 0017 0024 0024

− −−−−−−−−−−−−−−−−�−−−�−−−−−−−−−−−−−−2−−−−−−−−−−−−−−−→ �−−−−−−−−−−−−−−− � Y 3 − meanY 3 + Y 4 − meanY 42 Y 1 − meanY 12 + Y 2 − meanY 22 + N1+N2+N3+N4−4 N1+N2+N3+N4−4

(34-4) To compute the results shown in Table 34-5 for production samples, the accuracy of each set of replicates for each sample, method, and location was individually calculated using the root mean square deviation equation as shown in equations 34-5 and 34-6 in standard symbolic and MathCad notation, respectively. The standard deviation of each set of sample replicates yields an estimate of the accuracy for each sample, for each method, and for each location. The accuracy is calculated where each yi is an individual replicate measurement; GM is the Grand Mean of the replicate measurements for each sample, both methods, both locations; and N is the number of replicates for each sample, method, and location. The results found in Table 34-5 represent samples 1–3. Note: Each sample had a Grand Mean computed by taking the mean for all measurements made for each of the samples 1–3. � � N � �2 �� � yij − GMi � j=1 Si = (34-5) N −1 � �� � �−−−−−−−−−−→ � Y − GM2 S = N −1

(34-6)

To compute the results shown in Table 34-6 for the Spiked Recovery samples, the accu racy of each set of replicates for each sample, method, and location can be individually calculated using the root mean square deviation equation as shown in equations 34-5 and 34-6 in standard symbolic and MathCad 7.0 notation, respectively. The standard devia tion of each set of sample replicates yields an estimate of the accuracy for each sample, for each method, and for each location. The accuracy is calculated where each yi is an individual replicate measurement; and The Spiked or true values (TV) are substituted for GM in equations 34-5 and 34-6. The accuracy is calculated for each sample, each method, and each location; and N is the number of replicates for each sample, method, and location. The results found in Table 34-6 represent samples 34-4 through 34-6. Note: Each sample had a True Value given by a known analyte spike into the sample.

176

Chemometrics in Spectroscopy

Table 34-6 Individual sample analysis accuracy using Spiked Recovery study Sample no. Sample 4 Sample 5 Sample 6 Pooled

METHOD B – Lab 1

METHOD B – Lab 2

METHOD A – Lab 1

METHOD A – Lab 2

0022 0044 0043 0038

0027 0071 0083 0065

0041 0077 0066 0063

0022 0042 0058 0043

Table 34-7 Individual sample precision and accuracy for combined Methods A and B and Labs 1 and 2 – Production samples No. Sample 1 Sample 2 Sample 3 Pooled

GM

Precision

3.472 3.471 3.347 3.430

00231 00479 0021 0033

Accuracy 00278 00538 0033 0040

Table 34-8 Individual sample precision and accuracy for combined Methods A and B and Labs 1 and 2 – Spiked Recovery samples No. Sample 4 Sample 5 Sample 6 Pooled

TR

Precision

340 361 380 3603

0016 0011 0025 0018

Accuracy 0029 0061 0064 0054

The analytical results for each sample can again be pooled into a table of precision and accuracy estimates for all values reported for any individual sample. The pooled results for Tables 34-7 and 34-8 are calculated using equations 34-1 and 34-2 where precision is the root mean square deviation of all replicate analyses for any particular sample, and where accuracy is determined as the root mean square deviation between individual results and the Grand Mean of all the individual sample results (Table 34-7) or as the root mean square deviation between individual results and the True (Spiked) value for all the individual sample results (Table 34-8). The use of spiked samples allows a better comparison of precision to accuracy, as the spiked samples include the effects of systematic errors, whereas use of the Grand Mean averages the systematic errors across methods and shifts the apparent true value to include the systematic error. Table 34-8 yields a better estimate of the true precision and accuracy for the methods tested. A simple statistical test for the presence of systematic errors can be computed using data collected as in the experimental design shown in Figure 34-2. (This method is demonstrated in the Measuring Precision without Duplicates sections of the MathCad Worksheets Collabor_GM and Collabor_TV found in Chapter 39.) The results of this test are shown in Tables 34-9 and 34-10. A systematic error is indicated by the test using

Collaborative Laboratory Studies: Part 1

177

Table 34-9 Statistical test for the presence of systematic errors (using samples 1 and 2 only) F-test for bias 16.53

F-critical for bias 9.27

Table 34-10 Statistical test for the presence of systematic errors (using samples 4 and 5 only) F-test for Bias 2.261

F-critical for Bias 9.277

Samples 1 and 2, but not for Samples 4 and 5. This indicates that the difference between precision and accuracy is large enough to indicate a bias inherent within the analytical method(s). Since these are the same methods and locations tested, further evaluation is required to determine if a bias actually exists.

REFERENCES 1. 2. 3. 4. 5. 6. 7.

Hinshaw, J.V., LC-GC 17(7), 616–625 (1999). Mark, H. and Workman, J., Spectroscopy 2(2), 60–64 (1987). Workman, J. and Mark, H., Spectroscopy 2(6), 58–60 (1987). MathCad; MathSoft, Inc.: 101 Main Street, Cambridge, MA 02142; Vol. v. 7.0; (1997). Mark, H. and Workman, J., Spectroscopy 10(1), 17–20 (1995). Mark, H. and Workman, J., Spectroscopy 4(7), 53–54 (1989). Youden, W. J. and Steiner, E. H., Statistical Manual of the AOAC, 1st ed. (Association of Official Analytical Chemists, Washington, DC, 1975). 8. Mark, H. and Workman, J., Statistics in Spectroscopy, 1st ed. (Academic Press, New York, 1991).

This page intentionally left blank

35

Collaborative Laboratory Studies: Part 2 – using ANOVA

In this chapter the use of ANOVA will be described for use in collaborative study work.

ANOVA TEST COMPARISONS FOR LABORATORIES AND METHODS (ANOVA_s4 WORKSHEET) Analysis of Variance (ANOVA) is a useful tool to compare the difference between sets of analytical results to determine if there is a statistically meaningful difference between a sample analyzed by different methods or performed at different locations by different analysts. The reader is referred to reference [1] and other basic books on statistical methods for discussions of the theory and applications of ANOVA; examples of such texts are [2, 3]. Table 35-1 illustrates the ANOVA results for each individual sample in our hypo thetical study. This test indicates whether any of the reported results from the analytical methods or locations is significantly different from the others. From the table it can be observed that statistically significant variation in the reported analytical results is to be expected based on these data. However, there is no apparent pattern in the method or location most often varying from the others. Thus, this statistical test is inconclusive and further investigation is warranted.

Table 35-1 ANOVA: comparing methods and laboratories No.

F -test for bias

F -critical for bias

Difference

Bias

Sample 1

181

Sample 2

121

3.34

—

No

3.34

—

No

Sample 3

689

3.34

METHOD B-LAB 1 + METHOD B-LAB 2 vs. METHOD A-LAB 1 + METHOD A-LAB 2

Yes

Sample 4

328

3.24

METHOD A-LAB 1

Yes

Sample 5

1052

3.24

METHOD B-LAB 1 + METHOD A-LAB 2 vs. METHOD B-LAB 2 + METHOD A-LAB 1

Yes

Sample 6

2410

3.24

METHOD B-LAB 2

Yes

180

Chemometrics in Spectroscopy

ANOVA test comparisons (using ANOVA_s2 worksheet) Table 35-2 shows the ANOVA results comparing laboratories (i.e., different locations) performing the same METHOD B analytical procedure for analysis. This statistical test indicates that for the higher concentration spiked samples (i.e. 5 and 6 at 3.61 and 3.80% levels, respectively) a significant difference in reported average values occurred. However, Lab 1 was higher for Sample No. 5 and lower for Sample No.6 indicating no apparent trend in the analytical results reported for both labs, indicating that there is no systematic difference between labs using METHOD B. Table 35-3 illustrates the ANOVA results comparing laboratories (i.e., different loca tions) performing the same METHOD A for analysis. This statistical test indicates that for the mid-level concentration spiked samples (i.e. 4 and 4 at 3.40 and 3.61% levels, respectively) difference in reported average values occurred. However, this trend did not continue for the highest concentration sample (i.e., Sample No. 6) with a concentration of 3.80%. The Lab 1 was slightly lower in reported value for Samples 4 and 5. There is no significant systematic error observed between laboratories using the METHOD A. Table 35-4 reports ANOVA comparing the METHOD B procedure to the METHOD A procedure for combined laboratories. Thus the combined METHOD B analyses for each sample were compared to the combined METHOD A analyses for the same sample. This statistical test indicates whether there is a significant bias in the reported results for each method, irrespective of operator or location. An apparent trend is indicated using this statistical analysis, that trend being a positive bias for METHOD B as compared to

Table 35-2 ANOVA: comparing laboratories for METHOD B (Lab 1 vs. Lab 2) No. Sample Sample Sample Sample Sample Sample

Method 1 2 3 4 5 6

METHOD METHOD METHOD METHOD METHOD METHOD

B B B B B B

F -test for bias

F -critical for bias

Difference

Bias

0 098 199 00008 814 2091

532 532 532 532 532 532

— — — — 0.024 −0098

No

No

No

No

Yes

Yes

Table 35-3 ANOVA: comparing laboratories for METHOD A spectrophotometry (Lab 1 vs. Lab 2) No. Sample Sample Sample Sample Sample Sample

Method 1 2 3 4 5 6

METHOD METHOD METHOD METHOD METHOD METHOD

A A A A A A

F -test for bias

F -critical for bias

Difference

Bias

110 252 118 763 2952 153

5.99 5.99 5.99 5.32 5.32 5.32

— — — −0016 −0029 —

No

No

No

Yes

Yes

No

Collaborative Laboratory Studies: Part 2

181

Table 35-4 ANOVA: comparing methods for combined laboratories and operators, all Method B vs. all Method A No.

Method comparison

Sample 1

METHOD B vs. METHOD A

Sample 2

METHOD B vs. METHOD A

Sample 3

METHOD B vs. METHOD A

Sample 4

METHOD B vs. METHOD A

Sample 5 Sample 6

F -test for bias

F -critical for bias

Difference

Bias

505

4.49

0.024

Yes

193

4.49

—

No

4.49

0.041

Yes

706

4.41

0.019

Yes

METHOD B vs. METHOD A

007

4.41

—

No

METHOD B vs. METHOD A

1144

4.41

0.066

Yes

159

METHOD A. Thus METHOD B would be expected to report a higher level of analyte than METHOD A.

REFERENCES 1. Mark, H. and Workman, J., Statistics in Spectroscopy, 1st ed. (Academic Press, New York, 1991). 2. Draper, N. and Smith, H., Applied Regression Analysis (John Wiley & Sons, New York, 1981). 3. Zar, J.H., Biostatistical Analysis (Prentice Hall, Englewood Cliffs, NJ, 1974).

This page intentionally left blank

36 Collaborative Laboratory Studies: Part 3 – Testing for Systematic Error

TESTING FOR SYSTEMATIC ERROR IN A METHOD: COMPARISON TEST FOR A SET OF MEASUREMENTS VERSUS TRUE VALUE – SPIKED RECOVERY METHOD (COMPARET WORKSHEET) The Student’s (W.S. Gossett) t-test is useful for comparisons of the means and standard deviations of different analytical test methods. Descriptions of the theory and use of this statistic are readily available in standard statistical texts including those in the references [1–6]. Use of this test will indicate whether the differences between a set of measurement and the true (known) value for those measurements is statistically meaningful. For Table 36-1 a comparison of METHOD B test results for each of the locations is compared to the known spiked analyte value for each sample. This statistical test indicates that METHOD B results are lower than the known analyte values for Sample No. 5 (Lab 1 and Lab 2), and Sample No. 6 (Lab 1). METHOD B reported value is higher for Sample No. 6 (Lab 2). Average results for this test indicate that METHOD B may result in analytical values trending lower than actual values. For Table 36-2, a comparison of METHOD A results for each of the locations is made to the known spiked analyte value for each sample. This statistical test indicates that METHOD A results are lower than the known analyte values for Sample Nos. 4–6 for both Lab 1 and Lab 2. Average results for this test indicate that METHOD A is consistently lower than actual values.

Table 36-1 Comparison of METHOD B test results to true value Method–Location Sample Sample Sample Sample Sample Sample

4 4 5 5 6 6

METHOD METHOD METHOD METHOD METHOD METHOD

B–LAB B–LAB B–LAB B–LAB B–LAB B–LAB

1 2 1 2 1 2

t-test for bias

t-critical for bias

Difference

Bias

106 076 837 906 673 294

2776 2776 2776 2776 2776 2776

— — −0038 −0062 −0037 0061

No

No

Yes

Yes

Yes

Yes

184

Chemometrics in Spectroscopy

Table 36-2 Comparison of METHOD A results to true value Method–Location Sample Sample Sample Sample Sample Sample

4 4 5 5 6 6

METHOD METHOD METHOD METHOD METHOD METHOD

A–LAB A–LAB A–LAB A–LAB A–LAB A–LAB

1 2 1 2 1 2

t-test for bias

t-critical for bias

Difference

Bias

1952 90 598 60 684 707

2776 2776 2776 2776 2776 2776

−0036 −0018 −0069 −0036 −0058 −0050

Yes Yes Yes Yes Yes Yes

REFERENCES 1. MathCad; MathSoft, Inc.: 101 Main Street, Cambridge, MA 02142; Vol. v. 7.0 (1997). 2. Youden, W.J. and Steiner, E.H., Statistical Manual of the AOAC, 1st ed. (Association of Official Analytical Chemists, Washington, DC, 1975). 3. Mark, H. and Workman, J., Statistics in Spectroscopy, 1st ed. (Academic Press, New York, 1991). 4. Draper, N. and Smith, H., Applied Regression Analysis (John Wiley & Sons, New York, 1981). 5. Zar, J.H., Biostatistical Analysis (Prentice Hall, Englewood Cliffs, NJ, 1974). 6. Owen, D.B., Handbook of Statistical Tables (Addison-Wesley Publishing Co., Inc., Reading, MA, 1962).

37

Collaborative Laboratory Studies: Part 4 – Ranking Test

RANKING TEST FOR LABORATORIES AND METHODS (MANUAL COMPUTATIONS) The ranking test for laboratories provides for the calculation of individual ranks for each laboratory or method using the averaged results collected for all replicates and all methods/locations. The summary of averaged analytical results discussed in this series is shown in Table 37-1a. These compiled results are assigned ranks by column from the largest to the smallest reported analytical values. The largest analytical result in each column receives a score of 1, whereas the smallest result receives the largest number. When two results in a column are identical, a 0.5 is added to the rank number, and the subsequent number is not used. Note column 1 in Table 37-1a; both row 1 and row 2 have the identical value of 3.485 and are assigned 1.5 as rank score values. Note that rank 2 is not used due to the tie, and the lower analytical results are given ranks 3 and 4, respectively. The rows are summed resulting in a rank score as column #8, Table 37-1b. Table 37-1a Results table for ranking test Sample 1

Sample 2

Sample 3

Sample 4

Sample 5

Sample 6

3.485 3.485 3.468 3.450

3.467 3.506 3.542 3.460

3.356 3.379 3.324 3.330

3.391 3.391 3.364 3.380

3.571 3.548 3.541 3.570

3.763 3.861 3.741 3.740

L1: METHOD B–LAB 1 L2: METHOD B–LAB 2 L3: METHOD A–LAB 1 L4: METHOD A–LAB 2

Table 37-1b Ranked results table

L1: METHOD B–LAB 1 L2: METHOD B–LAB 2 L3: METHOD A–LAB 1 L4: METHOD A–LAB 2 ∗

Sample 1

Sample 2

Sample 3

Sample 4

Sample 5

Sample 6

Score∗

1.5

2

2

1.5

1

2

10

1.5

1

1

1.5

3

1

9

3

3

4

4

4

3

21

4

4

3

3

2

4

20

If an individual laboratory score is equal to or outside of the limit boundaries, then we conclude that there is a pronounced systematic error present between the laboratory, or laboratories, with the extreme score. In this particular case the limits are 8–22.

186

Chemometrics in Spectroscopy

Table 37-1c Approximate 5% two-tail limits for laboratory ranking Scores (from Ref. [1]) No. of locations/tests

Number of samples 3

4

5

6

7

8

9

10

3

—

4 12

5 15

7 17

8 20

10 22

12 24

13 27

4

—

4 16

6 19

8 22

10 25

12 28

14 31

16 34

5

—

5 19

7 23

9 27

11 31

13 35

16 38

18 42

6

3 18

5 23

7 28

10 32

12 37

15 41

18 45

21 49

7

3 21

5 27

8 32

11 37

14 42

17 47

20 52

23 57

8

3 24

6 30

9 36

12 42

15 48

18 54

22 59

25 65

9

3 27

6 34

9 41

13 47

16 54

20 60

24 66

27 73

10

4 29

7 37

10 45

14 52

17 60

21 67

26 73

30 80

The score values are compared to a statistical table of values found in reference [1]. This table is partially reproduced as Table 37-1c. If an individual laboratory score is equal to or outside of the limit boundaries, then we conclude that there is a pronounced systematic error present between the laboratory, or laboratories, with the extreme score. In this particular case the limits are 8 to 22, therefore there is no significant systematic error in the methods as determined using this test.

REFERENCE 1. Youden, W.J. and Steiner, E.H., Statistical Manual of the AOAC, 1st ed. (Association of Official Analytical Chemists, Washington, DC, 1975).

38

Collaborative Laboratory Studies: Part 5 – Efficient

Comparison of Two Methods

COMPUTATIONS FOR EFFICIENT COMPARISON OF TWO METHODS (COMP_METH WORKSHEET) The section following shows a statistical test (text for the Comp_Meth MathCad Work sheet) for the efficient comparison of two analytical methods. This test requires that replicate measurements be made on two different samples using two different analyt ical methods. The test will determine whether there is a significant difference in the precision and accuracy for the two methods. It will also determine whether there is sig nificant systematic error between the methods, and calculate the magnitude of that error (as bias). This efficient statistical test requires the minimum data collection and analysis for the comparison of two methods. The experimental design for data collection has been shown graphically in Chapter 35 (Figure 35-2), with the numerical data for this test given in Table 38-1. Two methods are used to analyze two different samples, with approximately five replicate measurements per sample as shown graphically in the previously mentioned figure. The analytical results can immediately be plotted using the Youden/Steiner twosample graphic shown in Figure 38-1. This graphic gives a rapid method for visually determining if the reported analytical values contain systematic error. The presence of systematic error is indicated by the occurrence of two-sample plot points that are found in the lower left, and upper right quadrants of the charts. The presence of points in these quadrants indicates that high analyte value samples are biased to the high end, and low analyte containing samples are biased to the low end. Analytical methods not exhibiting systematic (bias) errors should have randomly distributed two-sample plot points throughout all the quadrants of the chart. Figure 38-1 gives an indication that METHOD A has a negative bias; and METHOD B is more random. However, the range of the axes is much lower for Method A indicating that the overall bias is quite small, and significantly less than Method B. The calculations for the efficient two-method comparison are shown in Table 38-2 and the subsequent equations following. The mathematical expressions are given in MathCad symbolic notation showing that the difference is taken for each replicate set of X and Y and the mean is computed. Then the sum for each replicate set of X and Y is calculated and the mean is computed. The difference in the sums is computed (as d) and the differences are summed and reported as an absolute value (as �d). The mean difference is calculated as mean(d). Each X and Y result contains the systematic error of the analytical method for its respective laboratory, noting that the systematic error is assumed to be identical for

188

Chemometrics in Spectroscopy

Table 38-1 Analytical data entry for comparison of two methods tests METHOD A

METHOD B

Sample X

Sample Y

Sample X

Sample Y

3.366 3.380 3.360 3.380

3.741 3.740 3.740 3.760

3.421 3.407 3.377 3.400

3.764 3.860 3.742 3.833

3.372

3.745

3.401

3.800

Mean

METHOD A:

METHOD B:

3.9

3.905

3.9

+ +

mean(BY )

mean(AY )

3.8

BY

AY

+++ ++ 3.7

3.35

3.8

+++

+ +

+

+ 3.4

3.45

3.7

mean(AX ) . AX

3.35

3.4

3.45

mean(BX ) . BX

Figure 38-1 Two-sample charts illustrating systematic errors for Methods A vs. B.

Table 38-2 Calculations for comparison tests METHOD A:

METHOD B:

ADxy �= �AX − AY� mean�ADxy� = 0�374 ATxy �= �AX + AY� mean�ATxy� = 7�117

BDxy �= �BX − BY� mean�BDxy� = 0�399 BTxy �= �BX + BY� mean�BTxy� = 7�201

d � ATxy − BTxy � d = 0�337 Mean Difference: mean�d� = 0�084 d2 �= BTxy − ATxy

X and Y for each method. When the difference between X and Y is calculated (as d) the systematic error drops out so that the difference (d) between X and Y contains no systematic errors, only random errors. We then estimate the precision by using the difference quantities. The difference between the true analyte concentrations of X and Y represents the true analyte difference between X and Y without the systematic error, but

Collaborative Laboratory Studies: Part 5

189

with the random errors. The relative precision between the two methods is calculated using Table 38-2 and equations 38-1 and 38-2. The F-statistic used to compare the sizes of the Method A vs. Method B precision values is given by equation 38-5 and is compared to the F-statistic table value (equation 38-7). The null (Ho ) hypothesis states that there is no difference in the precision of the two methods; whereas the alternate hypothesis (Ha ) indicates that there is a difference in the precision. For the methods compared in this study there is a significantly larger precision for METHOD B as compared to METHOD A. Method A precision is 0.007, whereas Method B precision is 0.037 representing a 5.3 factor increase. When summing the X and Y values, the systematic contribution is found twice. The two used in the denominator is indicative of the error contribution from each independent set of results (i.e., X and Y ). Given independent random errors only, the standard deviation of the sum of two measurements X and Y would be identical to the standard deviation of the differences between the two measurements X and Y . In the absence of any systematic error, Sr2 and Sd2 estimate the same standard deviation. In the presence of systematic error, Sd2 is large compared to Sr2. The larger the Sd2, the greater is the systematic error contribution. The relative systematic error between the two methods is calculated using Table 38-2, and equations 38-3 and 38-4. The F -statistic is used to compare the sizes of the Method A vs. Method B systematic error values and is given by equation 38-6; and is compared to the F -statistic table value (equation 38-7). The null (Ho ) hypothesis states that there is no difference in the systematic error found in the two methods; whereas the alternate hypothesis (Ha ) indicates that there is a difference in the size of the systematic error. For the methods compared in this study there is a significantly larger systematic error for METHOD B as compared to METHOD A. The test to determine whether the bias is significant incorporates the Student’s t-test. The method for calculating the t-test statistic is shown in equation 38-10 using MathCad symbolic notation. Equations 38-8 and 38-9 are used to calculate the standard deviation of the differences between the sums of X and Y for both analytical methods A and B, whereas equation 38-10 is used to calculate the standard deviation of the mean. The t-table statistic for comparison of the test statistic is given in equations 38-11 and 38-12. The F -statistic and t-statistic tables can be found in standard statistical texts such as references [1–3]. The null hypothesis (Ho ) states that there is no systematic difference between the two methods, whereas the alternate hypothesis (Ha ) states that there is a significant systematic difference between the methods. It can be seen from these results that the bias is significant between these two methods and that METHOD B has results biased by 0.084 above the results obtained by METHOD A. The estimated bias is given by the Mean Difference calculation.

Measuring the Precision and Standard Deviation of the Methods (Youden/Steiner) Note that for the calculations of precision and standard deviation (equations 38-1 through 38-4), the numerator expression is given as 2�n − 1�. This expression is used due to the 2 times error contribution from independent errors found in each independent set (i.e., X and Y ) of results.

190

Chemometrics in Spectroscopy

Precision (Sr) � ��−−−−−−−−−−−−−−−−−−−−−−−−−−−−−→ � �− � 1 � ASr �= · �ADxy − mean�ADxy��2 2 · �nY − 1�

(38-1)

ASr = 6�692658 · 10−3

� ��−−−−−−−−−−−−−−−−−−−−−−−−−−−−−→ � �− � 1 � 2 BSr �= · �BDxy − mean�BDxy�� 2 · �nY − 1�

(38-2)

BSr = 0�037334 Standard deviation (Sd) � ��−−−−−−−−−−−−−−−−−−−−−−−−−−−−−→ � �− � 1 � ASd �= · �ATxy − mean�ATxy��2 2 · �nY − 1�

(38-3)

ASd = 0�012428

� �� −−−−−−−−−−−−−−−−−−−−−−−−→ � �−−−−−− � 1 � 2 BSd �= · �BTxy − mean�BTxy�� 2 · �nY − 1�

(38-4)

BSd �= 0�045387 F -statistic calculation �Fs � for precision ratio Sr2 Ratio: PFs �=

B2 Sr A2 Sr

(38-5)

PFs = 31�118 Ho : If Fs is less than or equal to Ft , then there is NO DIFFERENCE in Precision

estimation.

Ha : If Fs is greater than Ft , then there is a DIFFERENCE in Precision estimation.

F -statistic calculation (Fs ) for presence of systematic errors Sd2 Ratio: SF s �=

B2 Sd A2 Sd

SF s = 13�337

(38-6)

Collaborative Laboratory Studies: Part 5

191

Ho : If Fs is less than or equal to Ft , then there is NO DIFFERENCE in systematic error. Ha : If Fs is greater than Ft , then there is a DIFFERENCE in systematic error. F -statistic table value �Ft � df 1 � = nY − 1 df 1 = 3 qF�0�95� df 1 � df 1 � = 9�277

(38-7)

Student’s t-test for the difference in the biases between two methods

Mean Difference: mean�d� = 0�084

� �� � �−−−−−−−−−−−−−−−−−−−−−−→ � 1 � 2 s �= · �d2 − mean�d� � �df 1 �

(38-8)

s = 0�053

s

sm �= √ nY

(38-9)

sm = 0�026 Calculate t-test statistic: Te �=

mean�d� sm

Te = 3�201

(38-10)

Enter alpha value as a2: �2 �= �95

Calculate t-table value:

�1 �=

�2 + 1

2

(38-11)

�1 = 0�975 t �= qt��1 � df 1 � t-table value� t = 3�182

(38-12)

192

Chemometrics in Spectroscopy

Ho : If Te is less than or equal to t-table value, then there is NO SYSTEMATIC DIF

FERENCE between method results.

Ha : If Te is greater than t, then there is a SYSTEMATIC DIFFERENCE (BIAS) between

method results.

SUMMARY This set of articles presents the computational details and actual values for each of the statistical methods shown for collaborative tests. These methods include the use of precision and estimated accuracy comparisons, ANOVA tests, Student’s t-testing, The Rank Test for Method Comparison, and the Efficient Comparison of Methods tests. From using these statistical tests the following conclusions can be derived: 1. Both analytical methods are quite precise and accurate, therefore the production samples are below target value concentration. 2. The precision for METHOD B is significantly larger than METHOD A, indicating METHOD A is more precise than METHOD B. 3. There is no correlation of analytical error with concentration over the range tested (i.e., 3.40–3.80% analyte). 4. Analytical results comparing METHOD B and METHOD A will show significant variation due to the high precision of both analytical methods. 5. There is no operator/laboratory bias between labs for METHOD B. 6. There is no operator/laboratory bias between labs for METHOD A. 7. There is a significant bias between METHOD B and METHOD A; METHOD B yields higher results. 8. Both METHOD B and METHOD A results trend lower than actual values, but by small quantities (approximately −0.04% at the target value of 3.60%). 9. The laboratory ranking test did not show any laboratory or method outside of confidence limits, therefore neither method nor laboratory is consistently high or low in reported results. 10. METHOD B precision is a factor of 5.3 times greater than that of METHOD A. 11. The systematic error contribution is larger for METHOD B than METHOD A. 12. METHOD B is biased to +0.084 as compared to METHOD A.

ACKNOWLEDGEMENT The real analytical data used for Chapters 34–38 was graciously provided by Dan Devine of Kimberly-Clark Analytical Science & Technology.

REFERENCES 1. MathCad; MathSoft, Inc.: 101 Main Street, Cambridge, MA 02142 (1997). 2. Mark, H. and Workman, J., Spectroscopy 10(1), 17–20 (1995). 3. Mark, H. and Workman, J., Spectroscopy 4(7), 53–54 (1989).

39

Collaborative Laboratory Studies: Part 6 – MathCad

Worksheet Text

The MathCad worksheets used for this Chemometrics in Spectroscopy collaborative study series are given below in hard copy format. Unless otherwise noted, the worksheets have been written by the authors. The text files for the MathCad v7.0 Worksheets used for the statistical tests in this report are attached as Collabor_GM, Collabor_TV, ANOVA_s4, ANOVA_s2, CompareT, and Comp_Meth. References [1–11] are excellent sources of information of the details on these statistical methods. Collabor_GM

Collaborative Test Worksheet -------------------------

RAW DATA ENTRY: X01

X05

X09

3.51 3.46 3.47 3.50 3.49 3.48 3.45 3.46 3.46 3.48 3.37 3.36 3.35 3.35 3.35

X02

X06

X10

3.51 3.50 3.50 3.47 3.45 3.50 3.66 3.47 3.45 3.45 3.37 3.33 3.39 3.43 3.38

X03

X07

X11

3.46 3.44 3.46 3.52 3.46 3.45 3.45 3.46 3.46 3.46 3.32 3.33 3.33 3.32 3.32

X04

3.46 3.44 3.45

X08

3.46 3.47 3.45

X12

3.34 3.32 3.34

Mean values for Data:

n01:=rows(X01) n02:=rows(X02) n03:=rows(X03) n04:=rows(X04) mean(X01) mean(X02) mean(X03) mean(X04)

= = = =

3.485 3.485 3.468 3.45

n05:=rows(X05) n06:=rows(X06) n07:=rows(X07) n08:=rows(X08) mean(X05) mean(X06) mean(X07) mean(X08)

= = = =

3.467 3.506 3.452 3.46

n09:=rows(X09) n10:=rows(X10) n11:=rows(X11) n12:=rows(X12) mean(X09) mean(X10) mean(X11) mean(X12)

= = = =

3.356 3.379 3.324 3.3303

194

Chemometrics in Spectroscopy

--------------------------------------------------------

GRAND MEANS FOR EACH ROW (USE IF NO “TRUE VALUE” IS AVAILABLE): GM1 �=

�mean�X01� + mean�X02� + mean�X03� + mean�X04�� 4

GM2 �=

�mean�X05� + mean�X06� + mean�X07� + mean�X08�� 4

GM3 �=

�mean�X09� + mean�X10� + mean�X11� + mean�X12�� 4

GRAND MEANS FOR EACH ROW: GM1 = 3�472 GM2 �= 3�47115 GM3 �= 3�347433

COMPUTATIONS FOR PRECISION AND ACCURACY: Precision:

−−−−−−−−−−−−−−−−→

−−−−1−−−−− 2 SDp�X01� �= · �X01 − mean�X01�� n01 − 1 −−−−−−−−−−−−−−−−−−−−−−−−−→ 1 SDp�X02� �= · �X02 − mean�X02��2 n02 − 1 SDp�X01� = 0.02 SDp�X02� = 0.025

−−−−−−−−−−−−−−−−−−−−−−−−−→ 1 2 SDp�X03��= · �X03−mean�X03�� n03 − 1 −−−−−−−−−−−−−−−−−−−−−−−−−→ 1 SDp�X04��= · �X04−mean�X04��2 n04 − 1

Collaborative Laboratory Studies: Part 6

SDp�X03� = 8.888 ·10 –3 SDp�X04� = 8.888 ·10 –3

−−−−−−−−−−−−−−−−→

−−−−1−−−−− SDp�X05��= · �X05−mean�X05��2 n05 − 1

−−−−−−−−−−−−−−−−−−−−−−−−−→

1 2 SDp�X06��= · �X06−mean�X06�� n06 − 1 SDp�X05� = 0.013 SDp�X06� = 0.088

−−−−−−−−−−−−−−−−−−−−−−−−−→

1 2 SDp�X07��= · �X07−mean�X07�� n07 − 1

−−−−−−−−−−−−−−−−−−−−−−−−−→

1 SDp�X08��= · �X08−mean�X08��2 n08 − 1 SDp�X07� = 6.557 ·10 –3 SDp�X08� = 0.01

−−−−−−−−−−−−−−−−−−−−−−−−−→

1 2 SDp�X09��= · �X09−mean�X09�� n09 − 1

−−−−−−−−−−−−−−−−→

−−−−1−−−−− SDp�X10��= · �X10−mean�X10��2 n10 − 1 SDp�X09� = 7.918 ·10 –3 SDp�X10� = 0.037

−−−−−−−−−−−−−−−−−−−−−−−−−→

1 SDp�X11��= · �X11−mean�X11��2 n11 − 1

195

196

−−−−−−−−−−−−−−−−→ −−−−1−−−−− 2 SDp�X12��= · �X12−mean�X12�� n12 − 1 SDp�X12� = 0.012 SDp�X11� = 6.812 ·10 –3

Accuracy: −−−−−−−−−−−−−−−−−−−−−→ 1 SDa�X01� �= · �X01 − GM1�2 n01 − 1 −−−−−−−−−−−−−−−−−−−−−→ 1 2 SDa�X02� �= · �X02 − GM1� n02 − 1 SDa�X01� = 0.025 SDa�X02� = 0.029 −−−−−−−−−−−−−−−−−−−−−→ 1 2 SDa�X03� �= · �X03 − GM1� n03 − 1 −−−−−−−−−−−−−−−−−−−−−→ 1 SDa�X04� �= · �X04 − GM1�2 n04 − 1 SDa�X04� = 0.029 SDa�X03� = 0.029 −−−−−−−−−−−−−−−−−−−−−→ 1 SDa�X05� �= · �X05 − GM2�2 n05 − 1 −−−−−−−−−−−−−−−−−−−−−→ 1 2 SDa�X06� �= · �X06 − GM2� n06 − 1 SDa�X05� = 0.014 SDa�X06� = 0.096

Chemometrics in Spectroscopy

Collaborative Laboratory Studies: Part 6

197

−−−−−−−−−−−−→

−−−−1−−−−− 2 SDa�X07��= · �X07 − GM2� n07 − 1

−−−−−−−−−−−−−−−−−−−−−→

1 SDa�X08��= · �X08 − GM2�2 n08 − 1 SDa�X07� = 0.031 SDa�X08� = 0.017

−−−−−−−−−−−−−−−−−−−−−→

1 SDa�X09��= · �X09 − GM3�2 n09 − 1

−−−−−−−−−−−−−−−−−−−−−→

1 2 SDa�X10��= · �X10 − GM3� n10 − 1 SDa�X09� = 0.012 SDa�X10� = 0.051

−−−−−−−−−−−−→

−−−−1−−−−− 2 SDa�X11� �= · �X11 − GM3� n11 − 1

−−−−−−−−−−−−→

−−−−1−−−−− SDa�X12� �= · �X12 − GM3�2 n12 − 1 SDa�X11� = 0.037 SDa�X12� = 0.024

Pooled Standard Deviations (As Precision): Row 1: SpR1� = − −−−−−−−−−−−−−−−−→ −−−−−−−−−−−−−−−−→ − −−−−−−−−−−−−−−−−→ − −−−−−−−−−−−−−−−−→ −

�X01 − mean�X01��2 + �X02 − mean�X02��2 + �X03 − mean�X03��2 + �X04 − mean�X04��2 n01 + n02 + n03 + n04 − 4 SpR1 = 0.0231474

198

Chemometrics in Spectroscopy

Row 2: SpR2� = −−−−−−−−−−−−−−−−→ −−−−−−−−−−−−−−−−−→ −−−−−−−−−−−−−−−−−→ −−−−−−−−−−−−−−−−−→

−

�X05 − mean�X05��2 + �X06 − mean�X06��2 + �X07 − mean�X07��2 + �X08 − mean�X08��2 n05 + n06 + n07 + n08 − 4 SpR2 = 0.0478817

Row 3: SpR3� = −−−−−−−−−−−−−−−−→ − −−−−−−−−−−−−−−−−→ − −−−−−−−−−−−−−−−−→ − −−−−−−−−−−−−−−−−→

−

�X09 − mean�X09��2 + �X10 − mean�X10��2 + �X11 − mean�X11��2 + �X12 − mean�X12��2 n09 + n10 + n11 + n12 − 4 SpR3 = 0.021

Pooled Standard Deviations (As Accuracy): Row 1: − − − − − −

−−−−−−−−−−−→

−−−−−−−−−−−→

−−−−−−−−−−−→ −

−−−−−−−−−−−→ − �X01 − GM1�2 + �X02 − GM1�2 + �X03 − GM1�2 + �X04 − GM1�2 SpR1� = n01 + n02 + n03 + n04 − 4 SpR1 = 0.0277715

Row 2: −−−−−−−−−−−−−→ − − −− −

−−−−−−−−−−−→

−−−−−−−−−−−→ −

−−−−−−−−−−−→

�X05 − GM2�2 + �X06 − GM2�2 + �X07 − GM2�2 + �X08 − GM2�2 SpR2� = n05 + n06 + n07 + n08 − 4 SpR2 = 0.0537719

Row 3: −−−−−−−−−−−−→ −−−−−−−−−−−−−→ −−−−−−−−−−−−−→ −−−−−−−−−−−−−→

− �X09 − GM3�2 + �X10 − GM3�2 + �X11 − GM3�2 + �X12 − GM3�2 SpR3� = n09 + n10 + n11 + n12 − 4 SpR3 = 0.033

Collaborative Laboratory Studies: Part 6

199

Measuring Precision without Duplicates (Youden/Steiner): ------------------------------------------------

RAW DATA ENTRY (Enter single Determinations for Sample X from different laboratories or operators): Sample X LAB LAB LAB LAB

#1 #2 #3 #4

X:=

3.51 3.51 3.46 3.46

nX: = rows(X) mean(X) = 3.484 (Enter single Determinations for Sample Y from different laboratories or operators): Sample Y LAB LAB LAB LAB

#1 #2 #3 #4

Y: =

3.48 3.50 3.45 3.46

nY: = rows(X) mean(Y) = 3.47 3.5

mean(Y)

3.48

Y 3.46

3.44

3.44 3.46 3.48 3.5 mean(X), X

3.52

Two-sample Chart Illustrating systematic errors

200

Chemometrics in Spectroscopy

CALCULATIONS: Dxy �=�X − Y� Txy �=�X + Y� mean�Dxy� = 0.014 mean�Txy� = 6.955

Precision (Sr):

− −−−−−−−−−−−−−−−−−−−−−−→

−−−−− 1 Sr �= · �Dxy − mean�Dxy��2 2 · �nY − 1� Sr = 8.276473 ·10 –3

Measuring the Standard Deviation of the Data (Youden/Steiner): -----------------------------------------------------

Standard Deviation (Sd):

−−−−−−−−−−−−−−−−−−−−−−−−−−−→

− 1 2 Sd �= · �Txy − mean�Txy�� 2 · �nY − 1� Sd = 0.033653

Statistical Test for presence of systematic errors (Youden/Steiner):

------------------------------------------------------

F-statistic Calculation (Fs):

Fs �=

Sd2 Sr2

Fs = 16.533

F-statistic Table Value (Ft): df1 �= nY − 1 df1 = 3 qF�0.95,df1� df1� = 9.277

Collaborative Laboratory Studies: Part 6

201

Test Criteria: If Fs is less than or equal to Ft, then there is NO SYSTEMATIC ERROR

If Fs is greater than Ft, then there is SYSTEMATIC ERROR (BIAS)

Standard Deviation estimate for the distribution of systematic errors (Sb2):

2

Sd − Sr2

Sb2�=

2 Sb2 = 5.32 ·10–4

202

Chemometrics in Spectroscopy

Collabor_TV

Collaborative Test Worksheet -------------------------

RAW DATA ENTRY: X01

X05

X09

3.42 3.38 3.40 3.38 3.38 3.56 3.57 3.56 3.58 3.59 3.76 3.74 3.77 3.77 3.77

X02

X06

X10

3.41 3.40 3.42 3.35 3.38 3.54 3.55 3.57 3.53 3.54 3.86 3.83 3.93 3.87 3.81

X03

X07

X11

3.37 3.36 3.36 3.36 3.37 3.54 3.54 3.54 3.54 3.54 3.74 3.74 3.74 3.74 3.74

X04

X08

X12

3.38 3.38 3.38 3.38 3.38 3.56 3.58 3.59 3.58 3.56 3.74 3.76 3.73 3.77 3.75

Mean Values for Data Rows:

n01:=rows( X01) n02:=rows(X02) n03:=rows(X03) n04:=rows(X04) mean(X01) mean(X02) mean(X03) mean(X04)

= = = =

3.391 3.391 3.364 3.38

n05:= rows(X05) n06:=rows(X06) n07:=rows(X07) n08:=rows(X08) mean(X05) mean(X06) mean(X07) mean(X08)

= = = =

3.571 3.548 3.541 3.574

n09:=rows(X09) n10:=rows(X10) n11:=rows(X11) n12:=rows(X12) mean(X09) mean(X10) mean(X11) mean(X12)

= = = =

3.763 3.861 3.741 3.75

----------------------------------------------------------

ENTER TRUE VALUES FOR EACH ROW (SPIKED RECOVERY SAMPLES): TR1:=3.40

TR1:=3.61

TR1:=3.80

COMPUTATIONS FOR PRECISION AND ACCURACY: Precision: −−−−−−−−−−−−−−−−→ −−−−1−−−−− SDp�X01� �= · �X01 − mean�X01��2 n01 − 1

Collaborative Laboratory Studies: Part 6

−−−−−−−−−−−−−−−−→ −−−−1−−−−− 2 SDp�X02� �= · �X02 − mean�X02�� n02 − 1 SDp�X01� = 0.019 SDp�X02� = 0.025 −−−−−−−−−−−−−−−−−−−−−−−−−→ 1 2 SDp�X03� �= · �X03 − mean�X03�� n03 − 1 −−−−−−−−−−−−−−−−→ −−−−1−−−−− SDp�X04� �= · �X04 − mean�X04��2 n04 − 1 SDp�X03� = 0 SDp�X04� = 0 −−−−−−−−−−−−−−−−−−−−−−−−−→ 1 SDp�X05��= · �X05 − mean�X05��2 n05 − 1 −−−−−−−−−−−−−−−−→ −−−−1−−−−− 2 SDp�X06��= · �X06 − mean�X06�� n06 − 1 SDp�X05� = 0.01 SDp�X06� = 0.015 −−−−−−−−−−−−−−−−→ −−−−1−−−−− 2 SDp�X07��= · �X07 − mean�X07�� n07 − 1 −−−−−−−−−−−−−−−−→ −−−−1−−−−− SDp�X08��= · �X08 − mean�X08��2 n08 − 1 SDp�X07� = 2.588 ·10–3 SDp�X08� = 0.013

203

204

−−−−−−−−−−−−−−−−→ −−−−1−−−−− 2 SDp�X09��= · �X09 − mean�X09�� n09 − 1 −−−−−−−−−−−−−−−−−−−−−−−−−→ 1 SDp�X10��= · �X10 − mean�X10��2 n10 − 1 SDp�X09� = 0.012 SDp�X10� = 0.047 −−−−−−−−−−−−−−−−−−−−−−−−→ − 1 2 SDp�X11��= · �X11 − mean�X11�� n11 − 1 −−−−−−−−−−−−−−−−−−−−−−−−−→ 1 SDp�X12��= · �X12 − mean�X12��2 n12 − 1 SDp�X11� = 1.924 ·10 –3 SDp�X12� = 0.016

Accuracy: −−−−−−−−−−−→ −−−−1−−−−− SDa�X01� � = · �X01 − TR1�2 n01 − 1 −−−−−−−−−−−→ −−−−1−−−−− 2 SDa�X02� � = · �X02 − TR1� n02 − 1 SDa�X01� = 0.022 SDa�X02� = 0.027 −−−−−−−−−−−−−−−−−−−−→ 1 SDa�X03� � = · �X03 − TR1�2 n03 − 1 −−−−−−−−−−−→ −−−−1−−−−− SDa�X04� � = · �X04 − TR1�2 n04 − 1 SDa�X04� = 0.022 SDa�X03� = 0.041

Chemometrics in Spectroscopy

Collaborative Laboratory Studies: Part 6

− −−−−−−−−−−−→ −−−1−−−−− 2 SDa�X05� � = · �X05 − TR2� n05 − 1 −−−−−−−−−−−−−−−−−−−−→ 1 2 SDa�X06� � = · �X06 − TR2� n06 − 1 SDa�X05� = 0.044 SDa�X06� = 0.071 −−−−−−−−−−−−−−−−−−−−→ 1 2 SDa�X07� � = · �X07 − TR2� n07 − 1 −−−−−−−−−−−−−−−−−−−−→ 1 2 SDa�X08� � = · �X08 − TR2� n08 − 1 SDa�X07� = 0.077 SDa�X08� = 0.042 −−−−−−−−−−−−−−−−−−−−→ 1 2 SDaX09 � = · �X09 − TR3� n09 − 1 −−−−−−−−−−−−−−−−−−−−→ 1 SDa�X10� � = · �X10 − TR3�2 n10 − 1 SDa�X09� = 0.043 SDa�X10� = 0.083 −−−−−−−−−−−−−−−−−−−−→ 1 2 SDa�X11� � = · �X11 − TR3� n11 − 1 −−−−−−−−−−−−−−−−−−−−→ 1 SDa�X12� � = · �X12 − TR3�2 n12 − 1 SDa�X11� = 0.066 SDa�X12� = 0.058

205

206

Chemometrics in Spectroscopy

Pooled Standard Deviations (As Precision): Row 1: SpR1 �= −−−−−−−−−−−−−−−−−→ − −−−−−−−−−−−−−−−−→ − −−−−−−−−−−−−−−−−→ − −−−−−−−−−−−−−−−−→

�X01 − mean�X01��2 + �X02 − mean�X02��2 + �X03 − mean�X03��2 + �X04 − mean�X04��2 n01 + n02 + n03 + n04 − 4 SpR1 = 0.0159961

Row 2: SpR2 �= −−−−−−−−−−−−−−−−→ −−−−−−−−−−−−−−−−−→ − −−−−−−−−−−−−−−−−→ − −−−−−−−−−−−−−−−−→

−

�X05 − mean�X05��2 + �X06 − mean�X06��2 + �X07 − mean�X07��2 + �X08 − mean�X08��2 n05 + n06 + n07 + n08 − 4 SpR2 = 0.0114967

Row3: SpR3 �= −−−−−−−−−−−−−−−−→ − −−−−−−−−−−−−−−−−→ − −−−−−−−−−−−−−−−−→ − −−−−−−−−−−−−−−−−→

−

�X09 − mean�X09��2 + �X10 − mean�X10��2 + �X11 − mean�X11��2 + �X12 − mean�X12��2 n09 + n10 + n11 + n12 − 4 SpR3 = 0.025

Pooled Standard Deviations (As Accuracy): Row 1: − −−−−−−−−−−−→ −−−−−−−−−−−→ −−−−−−−−−−−→ − −−−−−−−−−−−→ −

−

�X01 − TR1�2 + �X02 − TR1�2 + �X03 − TR1�2 + �X04 − TR1�2 SpR1 �= n01 + n02 + n03 + n04 − 4 SpR1 = 0.0289623

Row2: − −−−−−−−−−−−→ −−−−−−−−−−−→ −−−−−−−−−−−→ −

−−−−−−−−−−−→ −

−

�X05 − TR2�2 + �X06 − TR2�2 + �X07 − TR2�2 + �X08 − TR2�2 SpR2 �= n05 + n06 + n07 + n08 − 4 SpR2 = 0.0608954

Collaborative Laboratory Studies: Part 6

207

Row 3: − −−−−−−−−−−−→ −−−−−−−−−−−→ −−−−−−−−−−−→ − −−−−−−−−−−−→ −

−

2 2 + �X11 − TR3�2 + �X12 − TR3�2 �X09 − TR3� + �X10 − TR3� SpR3 �= n09 + n10 + n11 + n12 − 4 SpR3 = 0.064

Measuring Precision without Duplicates (Youden/Steiner):

-----------------------------------------------

RAW DATA ENTRY

(Enter single Determinations for Sample X from different laboratories or

operators):

Sample X LAB LAB LAB LAB

#1 #2 #3 #4

X:=

3.42 3.41 3.37 3.38

nX� = rows�X� mean�X� = 3�394 (Enter single Determinations for Sample Y from different laboratories or operators): Sample Y LAB LAB LAB LAB

#1 #2 #3 #4

Y:=

nY� = rows�Y� mean�Y� = 3�551

CALCULATIONS: Dxy�= �X − Y� Txy�= �X + Y� mean�Dxy� = −0�157 mean�Txy� = 6�944

3.56 3.54 3.54 3.56

208

Chemometrics in Spectroscopy

3.56

mean(Y) 3.55

Y

3.54

3.53

3.36 3.38

3.4

3.42

3.44

mean(X), X

Two-sample Chart illustrating systematic errors

Precision (Sr):

−−−−−−−−−−−−−−−−−−−−−−→

−−−−−− 1 2 Sr �= · �Dxy − mean�Dxy�� 2 · �nY − 1� Sr = 0.015805

Measuring the Standard Deviation of the Data (Youden/Steiner): -----------------------------------------------------

Standard Deviation (Sd):

−−−−−−−−−−−−−−−−−−−−−−−−−−−→

− 1 Sd�= · �Txy − mean�Txy��2 2 · �nY − 1� Sd = 0�023765

Statistical Test for presence of systematic errors (Youden/Steiner): ------------------------------------------------------

F-statistic Calculation (Fs): Fs�=

Sd2 Sr2

Fs = 2.261

Collaborative Laboratory Studies: Part 6

209

F-statistic Table Value (Ft): df1� = nY − 1 df1 = 3 qF�0�95� df1� df1� = 9�277 If Fs is less than or equal to Ft, then there is NO SYSTEMATIC ERROR If Fs is greater than Ft, then there is SYSTEMATIC ERROR (BIAS)

Standard Deviation estimate for the distribution of systematic errors (Sb2):

2

Sd − Sr2

Sb2�=

2 Sb2 = 1.575 · 10−4

210

Chemometrics in Spectroscopy

ANOVA_s4

ANOVA (Analysis of Variance) Test -------------------------------------------------------This Worksheet demonstrates using Mathcad’s F distribution function and programming

operators to conduct an analysis of variance (ANOVA) test.

Enter sample data used in test:

An element of D represents the data collected with a particular factor.

Data Entry:

D0

3.421

3.407

3.366

3.380

3.377

3.400

3.360

3.380

3.399

D1

3.417

D2

3.361

D3

3.380

3.379

3.353

3.362

3.380

3.379

3.380

3.370

3.380

Enter level of significance a: � � = 0�05

Collaborative Laboratory Studies: Part 6

211

Program for conducting ANOVA test:

ANOVA( D , α )

n total

0

0

SX

SX2 0 T

0

for i ∈ 0 .. last ( D ) SDi

Di

nDi

length Di

SX

SX

SDi Di .Di

SX2 SX2 2

T

SDi

T

nDi n total

n total

nDi 2

SS factor

SX

T

n total

SS error

SX2 T

SS total

SX2

2

SX

n total

df factor

length ( D )

1

df error

n total

length ( D )

df total

n total

1

SS factor df factor Analysis 0

SS error

df error

SS total

df total

Analysis 0 Analysis 1

Analysis 0

df factor SS error df error 0

0,2 1,2

α , df factor , df error

Analysis 2

qF 1

Analysis 3

Analysis 1 < Analysis 2

Analysis

SS factor

212

Chemometrics in Spectroscopy

Calculate Mean Values: mean�D0 � = 3�391

mean�D1 � = 3�3914

mean�D2 � = 3�3638

mean�D3 � = 3�38

Conducting an analysis of variance:

For a given set of grouped data D and level of significance a:

⎡

⎤ �3� 3�

⎢ 3�281 ⎥

⎢ ⎥ ANOVA�D� �� = ⎣ 3�239 ⎦

0

The ANOVA table: ⎡

SS 2�519 · 10−3

⎢ ⎢ −3 ANOVA�D� ��0 = ⎢ ⎢ 4�094 · 10 ⎣ 6�613 · 10−3

df MS ⎤ 3 8�396 · 10−4 Between Groups ⎥ ⎥ −4 ⎥ 16 2�559 · 10 ⎥ Within Groups ⎦ Total 19 0

The Calculated F statistic: ANOVA�D� ��1 = 3�281485

The critical F Statistic: ANOVA�D� ��2 = 3�238872

The hypothesis test conclusion at the specified level of significance: ANOVA�D� ��3 = 0 0 = reject hypothesis – there is a significant difference 1 = accept hypothesis – there is not a significant difference

Collaborative Laboratory Studies: Part 6

213

ANOVA_s2

ANOVA (Analysis of Variance) Test -----------------------------This Worksheet demonstrates using Mathcad’s F distribution function and programming

operators to conduct an analysis of variance (ANOVA) test.

Enter sample data used in test:

An element of D represents the data collected with a particular factor.

Data Entry:

D0

3.421

3.366

3.377

3.360

3.399

D1

3.361

3.379

3.362

3.379

3.370

Enter level of significance a: � � = 0�05

214

Chemometrics in Spectroscopy

Program for conducting ANOVA test:

ANOVA( D, α )

n total SX

0

0

SX2 0 T

0

for i ∈ 0.. last ( D ) SDi

Di

nDi

length Di

SX

SX

SDi

SX2 SX2

Di Di 2

T

SDi

T

nDi n total

n total

nDi 2

SS factor

SX

T

n total

SS error

SX2 T

SS total

SX2

2

SX

n total

df factor

length ( D )

1

df error

n total

length ( D )

df total

n total

1

SS factor df factor Analysis 0

SS error

df error

SS total

df total

Analysis 0 Analysis 1

Analysis 0

df factor SS error df error 0

0,2 1,2

α , df factor , df error

Analysis 2

qF 1

Analysis 3

Analysis 1 < Analysis 2

Analysis

SS factor

Collaborative Laboratory Studies: Part 6

215

Calculate Mean Values: mean �D0 � = 3�391 mean �D1 � = 3�3638

Conducting an analysis of variance: For a given set of grouped data D and level of significance a: ⎡

⎤ �3� 3�

⎢ 9�755 ⎥

⎥ ANOVA�D� �� = ⎢ ⎣ 5�318 ⎦

0

The ANOVA table: ⎡

SS 1�85 · 10−3

⎢ ⎢ −3 ANOVA�D� ��0 = ⎢ ⎢ 1�517 · 10 ⎣ 3�366 · 10−3

df MS ⎤ 1 1�85 · 10−4 ⎥ ⎥ 8 1�896 · 10−4 ⎥ ⎥ ⎦ 9 0

Between Groups Within Groups Total

The Calculated F statistic: ANOVA�D� ��1 = 9�755274

The critical F Statistic: ANOVA�D� ��2 = 5�317655

The hypothesis test conclusion at the specified level of significance: ANOVA�D� ��3 = 0 0 = reject hypothesis – is a significant difference 1 = accept hypothesis – is not a significant difference

216

Chemometrics in Spectroscopy

CompareT

Comparison Test for a Set of Measurements Vs. True Value -------------------------------------------------

DATA ENTRY: X1:=

5.10 5.20 5.30 5.10 5.00

n�=rows�X1�

Mean of X1: mean�X1� = 5�14

Enter True Value ���: � �= 5�2

Precision (or standard deviation): ⎛ ⎞1 −−−−−−−−−−−−−−−−−−−−−→ 2 1 sd�X1� �= ⎝ · �X1 − mean�X1��2 ⎠ n−1 sd�X1� = 0�114

Compute degrees of freedom as (n-1): df �= n − 1

Enter alpha value as �2: �2 �= �95

Calculate t-table value: �1 �=

�2 + 1 2

t �= qt��1� df�

Collaborative Laboratory Studies: Part 6

217

t-value: t = 2�776

t experimental (Te):

�mean�X1� − �� √ Te �=

· n

sd�X1� Te =1�177

If Te ≤ t-value, then there is NO SIGNIFICANT DIFFERENCE If Te ≥ t-value, then there IS A SIGNIFICANT DIFFERENCE between the set of measured values and the TRUE VALUE (i.e., they are different)

218

Chemometrics in Spectroscopy

Comp_Meth

Computations for the Comparison of Two Methods (Youden/Steiner): ---------------------------------------------------------RAW DATA ENTRY FOR METHOD A (Enter single Determinations for Sample X from different laboratories using Method A): METHOD A: Sample X LAB LAB LAB LAB

#1 #2 #3 #4

AX:= 3.37 3.38 3.36 3.38

nX�= rows�AX� mean�AX� = 3�372 (Enter single Determinations for Sample Y from different laboratories using Method A): Sample Y LAB LAB LAB LAB

#1 #2 #3 #4

AY:= 3.74 3.74 3.74 3.76

nY�= rows�AY�

mean�AY� = 3�746

RAW DATA ENTRY FOR METHOD B:

(Enter single Determinations for Sample X from different laboratories using

Method A):

METHOD B:

Sample X

LAB LAB LAB LAB

#1 #2 #3 #4

BX:= 3.42 3.41 3.38 3.40

nX�= rows�BX� mean�BX� = 3�401

Collaborative Laboratory Studies: Part 6

219

(Enter single Determinations for Sample Y from different laboratories using Method A): Sample Y LAB LAB LAB LAB

#1 #2 #3 #4

BY:= 3.76 3.86 3.74 3.83

nY�= rows�BY� mean�BY� = 3�8

METHOD A:

METHOD B:

3.76

3.85

mean(AY) AY

mean(BY) 3.75

BY

3.74 3.73

3.8 3.75

3.35 3.36 3.37 3.38 mean(AX), AX

3.39

3.7

3.36 3.38

3.4

3.42

mean(BX), BX

Two-sample Charts illustrating systematic errors for Methods A vs. B:

CALCULATIONS: METHOD A:

METHOD B:

ADxy �= �AX − AY� mean�ADxy� = 0�374 ATxy �= �AX + AY� mean�ATxy� = 7�117

BDxy �= �BX − BY� mean�BDxy� = 0�399 BTxy �= �BX + BY� mean�BTxy� = 7�201

d �= ATxy − BTxy

�d = 0�335

Mean Difference: mean�d� = 0�084

d2 �= BTxy − ATxy

3.44

220

Chemometrics in Spectroscopy

Measuring the Precision and Standard Deviation of the Methods (Youden/Steiner): ----------------------------------------------------------

Precision (Sr): −−−−−−−−−−−−−−−−−−−→ −−−−−1−−−−−− ASr�= · �ADxy − mean�ADxy��2 2·�nY − 1� − −−−−−−−−−−−−−−−−−−−→ −−−−1−−−−−− 2 BSr�= · �BDxy − mean�BDxy�� 2·�nY − 1� ASr = 6.692658 · 10−3 BSr = 0.037334

Standard Deviation (Sd): −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−→ 1 ASd�= · �ATxy − mean�ATxy��2 2·�nY − 1� −−−−−−−−−−−−−−−−−−−−−−−−−−−−−→ − 1 2 BSd�= · �BTxy − mean�BTxy�� 2·�nY − 1� ASd = 0�013056 BSd = 0�045387

Statistical Test for presence of systematic errors (Youden/Steiner): ------------------------------------------------------

F-statistic Calculation (Fs) for Precision Ratio: Sr2 Ratio: PFs�=

BSr2 ASr2

PFs = 31�118 Ho: If Fs is less than or equal to Ft, then there is NO DIFFERENCE in Precision

estimation.

Ha: If Fs is greater than Ft, then there is a DIFFERENCE in Precision estimation.

Collaborative Laboratory Studies: Part 6

221

F-statistic Calculation (Fs) for Presence of Systematic Errors: Sd2 Ratio: SFs�=

BSd2 ASd2

SFs = 12�085

Ho: If Fs is less than or equal to Ft, then there is NO DIFFERENCE in systematic error for methods.

Ha: If Fs is greater than Ft, then there is a DIFFERENCE in systematic error for

methods.

F-statistic Table Value (Ft): df1�=nY − 1 df1 �= 3 qF�0�95� df1� df1� = 9�277

Student’s t-Test for the Difference in the biases between Two Methods:

mean�d� = −0�084

Mean Difference:

mean�d� = 0�084

−−−−−−−−−−−−−−−→

−−−1−−−− s�= · �d2− mean�d� �2 �df1� s = 0�053 s sm�= √ nY sm = 0�026 t-test Statistic: Te�=

mean�d� sm

Te =3�189

222

Chemometrics in Spectroscopy

Enter alpha value as a2: �2 �= �95 Calculate t-table value: �1�=

�2+1 2

�1 = 0.975 t�= qt��1� df1� t-Table Value: t = 3�182 Ho: If Te is less than or equal to t, then there is NO SYSTEMATIC DIFFERENCE between method results.

Ha: If Te is greater than t, then there is a SYSTEMATIC DIFFERENCE (BIAS)

between method results.

REFERENCES 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11.

Hinshaw, J.V., LC-GC 17(7), 616–625 (1999). Mark, H. and Workman, J., Spectroscopy 2(2), 60–64 (1987). Workman, J. and Mark, H., Spectroscopy 2(6), 58–60 (1987). MathCad; MathSoft, Inc.: 101 Main Street, Cambridge, MA 02142; Vol. v. 7.0 (1997). Mark, H. and Workman, J., Spectroscopy 10(1), 17–20 (1995). Mark, H. and Workman, J., Spectroscopy 4(7), 53–54 (1989). Youden, W.J. and Steiner, E.H., Statistical Manual of the AOAC, 1st ed. (Association of Official Analytical Chemists, Washington, DC, 1975). Mark, H. and Workman, J., Statistics in Spectroscopy, 1st ed. (Academic Press, New York, 1991). Draper, N. and Smith, H., Applied Regression Analysis (John Wiley & Sons, New York, 1981). Zar, J.H., Biostatistical Analysis (Prentice Hall, Englewood Cliffs, NJ, 1974). Owen, D.B., Handbook of Statistical Tables (Addison-Wesley Publishing Co., Inc., Reading, MA, 1962).

40

Is Noise Brought by the Stork? Analysis of Noise: Part 1

Well no, actually. If the truth be told, we all know that noise is brought (on) by quantum mechanics. Now, if we could some day find a really good quantum mechanic, one who could actually fix once and for all, all those broken quanta around us, then maybe all the noise would go away, but that is probably too much to ask for and not likely to happen. About as likely as our getting away with making more of these sorts of bad jokes, those are more in the domain of other spectroscopy writers? On to more serious matters: where does the noise come from and how does noise affect our data, that is the spectra we measure? Chemists are interested in the effects that various phenomena have on the accuracy of chemical analyses. General books about instrumental analysis discuss some of the sources of error, and even provide elementary derivations relating some of the instrumental phenomena to their effect on the error of the chemical analysis. Elementary texts [1, 2] derive a formula for the “optimum” absorbance a sample should have. More recent work has also been directed to ascertaining the “optimum” transmittance (or reflectance) value a sample should have for best quantitative accuracy, directing their efforts particularly to the situation when multivariate methods of analysis are in use [3, 4]. One standard treatment of the problem derives the error in concentration of an analyte caused by error of the spectral value, and presents the often-seen curve showing that the relative error in concentration, C/C, goes through a minimum and computes that the minimum occurs at a transmittance of 0.368, corresponding to an absorbance of 0.4313 More advanced texts [5] relate the measurements and the measurement process to the noise of the spectrum given the nature of different noise sources, “noise” being the term generally (although rather loosely, to be sure) used to describe error of an instrumental reading, while “error” is used more generally. At the end of the day, though, they really mean the same thing: the random variations superimposed on the desired information. Close examination reveals that these expositions are wanting. Sometimes a simplifying assumption is made that results in an incorrect description [2]. In other cases the argument in taken into the statistical domain prematurely, leaving no room to accommodate different situations [5]. It is clear, however, that one formula cannot fit all cases. There are a large number of ways in which instruments react to various sources of variation of the signal; we summarize some of them here: 1) Many common infrared and near-infrared detectors are subject to phenomena that are mainly thermal in origin, and therefore the detector noise is independent of the signal level. 2) Some detectors for the visible and UV spectral regions can detect individual photons. These detectors are shot-noise limited. X-ray and gamma-ray spectroscopy also detects

224

Chemometrics in Spectroscopy

individual photons and therefore is also limited by this source of variation. Since shotnoise follows Poisson statistics, the detector noise in these cases increases with the square root of the signal. 3) Sometimes the detector noise is not the limiting noise source. One prime noise source can be generically called “scintillation noise”: variation in the amount of energy impinging on the detector. These often have mechanical causes: vibration of the source, or vignetting at an optical stop in the optical system, changing the geometry of the radiation on the detector. Astronomical measurements of course, are subject to this noise source from atmospheric fluctuations, and represent the classic example of this type of variation. From whatever source, however, scintillation noise is directly proportional to the energy of the optical signal. 4) Other cases of non-detector noise occur when the noise is introduced after the detector. These are usually a result of limitations of the instrument and in principle could be reduced by re-engineering the instrument. Examples include power line pickup, and mechanical vibrations affecting a sensitive part (generically called “microphonics”). The magnitude of these would also tend to be independent of the signal level. 5) One noise source tends to affect older design spectrometers, which are the spec trometers that use the optical-null principle. In the case of optical-null spectrometers, various electrical (random noise and power-line pickup) and mechanical (vibrations) noise sources can be introduced after the transmittance via the optical null is determined (P.R. Griffiths, 1998, personal communication), and in those cases the error of the transmittance will be constant. In fact, because of the historical origins, this is the case that is usually treated in the extant literature. However, this is not a simple one-to-one relationship either, since it depends on how the instrument designer chose to deal with the problem. Many of those types of instruments had variable slits, and the slits could be opened or closed during a scan, according to some preset (hardwired, to be sure: these were not computer-controlled instruments) program. One possibility, of course, was to leave the slit at a constant opening that was preset before the scan was run. A second possibility was to program the slit for a constant bandwidth across the spectrum. A third possibility was to program the slit for constant reference energy. Here again, it is clear that the noise characteristics of the instrument will depend on how the construction of the instrument determined which of these situations applied, and therefore gives us at least three subcases here. 6) Variations in the temperature of a blackbody used as the source in a spectrometer. The energy density of blackbody radiation is given by the well-known formula: dE 8h 3 = 3 V h/kt d c e −1

(40-1a)

for radiation in the frequency range from to + d, where t is the temperature, V is the volume of the enclosure containing the radiation and h, k and c have their usual meanings. Collecting the constants (to simplify the expression), we obtain dE K 3 = h/kt −1 d e

(40-1b)

Analysis of Noise: Part 1

225

Taking the derivative of this with respect to temperature, we obtain � � d dE −1 −h d eh/kt 2 = K 3 h/kt 2 − 1 kt dt e Back substituting equation 40-1b into equation 40-2, we obtain � � d ddE dE heh/kt = dt d kt2 eh/kt − 1

(40-2)

(40-3)

and we see that the relative energy change (as a fraction of the energy) in the wavelength interval between and + d is given by the expression: heh/kt kt2 eh/kt − 1

(40-4)

7) Variation of pathlength will create a source of variation in the data such that the change in absorbance is proportional to the absorbance. This can happen even in transmission spectroscopy if the walls of the sample cell for some reason should not be rigidly fixed in place, or possibly the cell might expand through temperature changes. Of course, in that case the sample itself is also likely to be affected directly; expansion of a liquid sample would have an effect equivalent to a reduction in pathlength. It can also happen, and is perhaps more common, in the case of diffuse reflectance. In that measurement technique, absent a rigorous theory to describe this physical phe nomenon, the concept of a variable pathlength is used as a first approximation to the nature of the change in the measurements. 8) There are other sources of noise, whose behavior cannot be described analytically. They are often principally due to the sample. A premier example is the variability of the measured reflectance of powdered solids. Since we do not have a rigorous ab initio theory of diffuse reflectance, we cannot create analytic expressions that describe the variation of the reflectance. Situations where the sample is unavoidably inhomogeneous will also fall into this category. In all such cases the nature of the noise will be unique to each situation and would have to be dealt with on a case-by-case basis. 9) Another source of variability, which can have still different characteristics, is com prised of the interaction of any of the above factors with a nonlinearity anywhere in the system. These nonlinearities could consist of nonlinearity in the detector, in the spectrometer’s electronics, optical effects such as changes in the field of view, and so on. Many of these nonlinearities are likely to be idiosyncratic to the cause, and would have to be characterized individually and also analyzed on a case-by-case basis. 10) Another, specialized, case would be nondispersive analyzers. For these instruments the whole concept of determining the signal between and + is inapplicable, since the measured signal represents the integrated optical intensity of the incident radiation over a broad range of wavelengths, likely including wavelength regions where the optical radiation is weak as well as where it is strong. Furthermore, this will be sampledependent, and almost certainly would have to be dealt with on a sample-by-sample basis.

226

Chemometrics in Spectroscopy

Thus, given the variety of ways that the noise output of a detector is related to the optical signal into the detector, the argument that a single formula cannot account for them all becomes even more forceful. This being so, it is clear that each case needs to be treated separately in order to obtain a correct description of the effect on the noise of the spectrum. For single-beam spectra the noise can be described directly. For ratioed spectra, it is of interest to ascertain the effect of the various noise sources on the ratioed spectrum (i.e., the transmittance or reflectance spectrum as the case may be), on the absorbance spectrum, and also to determine, as was done previously [1, 2, 5], the optimum value for the sample to have that will give the minimum error of the calculated value. We will be doing this exercise during the course of the next few chapters. We will consider each of these types of noise one at a time. We will start from first principles, derive the appropriate expressions and deal with them in a completely rigorous manner. During the course of this we will compare out results with the ones in the literature and see where the standard derivations (NOT deviations!) depart from our presentation. We will begin with the next chapter with an analysis of the effect of one of the most common cases: constant detector noise, typical of mid-infrared and near-infrared instruments.

REFERENCES 1. Strobel, H. A., Chemical Instrumentation – A Systematic Approach to Instrumental Analysis (Addison-Wesley Publishing Co., Reading, MA, 1960). 2. Ewing, G., Instrumental Methods of Chemical Analysis, 4th ed. (McGraw-Hill, New York, 1975). 3. Honigs, D.E., Hieftje, G.M. and Hirschfeld, T., Applied Spectroscopy 39(2), 253–256 (1985). 4. Hirschfeld, T., Honigs, D. and Hieftje, G., Applied Spectroscopy 39(3), 430–433 (1985). 5. Ingle, J.D. and Crouch, S.R., Spectrochemical Analysis (Prentice-Hall, Upper Saddle River, NJ, 1988).

41 Analysis of Noise: Part 2

Note to the Reader: Chapters 41 through 53 are derived from a series of papers written about the subject of noise. They are sequential in nature and the rationale and descriptions follow a series of equations, figures and tables that are best followed using a serial numbering system running sequentially throughout the chapter series. Thus the equations, figures, and tables for these chapters will contain the chapter number and then the sequential equation, figure, or table number. For example chapter 42 begins with Equation 41-19 and this equation would be designated as (42-19), following a

Es − E0s Er − E0r

(41-1)

where Es and Er represent the signal due to the sample and reference readings, respec tively, E0s and E0r are the “dark” or “blank” readings associated with Es and Er . (Er − E0r ), of course, must be non-zero. The measured value of T , caused by the error �T is T + �T =

� � �Es + �Es� � − �E0s + �E0s � �Er + �Er� � − �E0r + �E0r �

(41-2)

where the � terms represent the fluctuation in the reading due to instantaneous random effect of noise. An important point to note here is that Es , Er , and T , for any given set

228

Chemometrics in Spectroscopy

of readings at a given wavelength, are constants. All variations in the readings, due to noise, are associated with �Es , �Er , and �T . Rearranging equation 41-2 we have T + �T =

� � �Es − E0s � + ��Es� − �E0s � � �Er − E0r � + ��Er − �E0r �

(41-3)

The difference between two random variables is itself a random variable, therefore we � � � in equation 41-3 with the equivalent, � and (�Er� − �E0r replace the terms (�Es� − �E0s simpler terms �Es and �Er , respectively: T + �T =

Es − E0s + �Es Er − E0r + �Er

(41-4)

The presence of a non-zero dark reading, E0 , will, of course, cause an error in the value of T computed. However, this is a systematic error and therefore is of no interest to us here; we are interested only in the behavior of random variables. Therefore we set E0s and E0r equal to zero and note, if T as described in equation 41-1 represents the “true” value of the transmittance, then the value we obtain for a given reading, including the instantaneous random effect of noise, is T + �T =

Es + �Es Er + �Er

(41-5)

and we also find that upon setting E0s and E0r equal to zero in equation 41-1, equation 41-1 becomes E T= s (41-6) Er where �Es and �Er represent the instantaneous, random values of the change in the sample and reference readings due to the noise. Since, as we noted above, T , Es , and Er are constant for any given reading, any change in the measured value due to noise is contained in the terms �Er and �Es . In statistical jargon this would be called “a point estimate of T from a single reading”, and �T is the corresponding instantaneous change in the computed value of the transmittance. Again, Er must be non-zero. We note here that �Es and �Er need not be equal; that will not affect the derivation. For the case we are considering in this chapter, however, we are assuming constant detector noise, therefore when we pass to the statistical domain, we will consider �Es to be equal to �Er . That, of course, refers only to the expected values; but since the noise is random, the instantaneous values will virtually never be the same. Upon subtracting equation 41-6 from equation 41-5 we obtain the following: T + �T − T =

Es + �Es Es − Er + �Er Er

(41-7)

�T =

Er �Es + �Es � − Es �Er + �Er � Er �Er + �Er �

(41-8)

�T =

Er Es + Er �Es − Es Er − Es �Er

Er �Er + �Er �

(41-9)

�T =

Er �Es − Es �Er

Er �Er + �Er �

(41-10)

Analysis of Noise: Part 2

229

Equation 41-10 might look familiar. If you check an elementary calculus book, you will find that it is about the second-to-last step in the derivation of the derivative of a ratio (about all you need to do is go to the limit as �Es and �Er →zero). However, for our purposes we can stop here and consider equation 41-10. We find that the total change in T , that is �T , is the result of two contributions: �T =

Es �Er Er �Es − Er �Er + �Er � Er �Er + �Er �

(41-11)

We note that, since by assumption Er is non-zero, and �Er is non-zero and independent of Er , the first term of equation 41-11 is non-zero. The value of the second term of equation 41-11, however, will depend on the value of Es , that is on the transmittance of the sample. In order to determine the standard deviation of T we need to consider what would happen if we take multiple sample and reference readings, then we can characterize the variability of T . Since Er and Es are fixed quantities, when we take multiple readings we note that we arrive at different values of T + �T due to the differences in the values of �Er and �Es on each reading, causing a change in �T . Therefore we need to compute the standard deviation of �T , which we do from the expression for �T in equation 41-11: � � Er �Es Es �Er SD��T � = SD − (41-12) Er �Er + �Er � Er �Er + �Er � Or equivalently, we calculate the variance of �T , which is the square of the standard deviation: � � Er �Es Es �Er Var��T � = Var − (41-13) Er �Er + �Er � Er �Er + �Er � The proof that the variance of the sum of two terms is equal to the sum of the variances of the individual terms is a standard derivation in Statistics, but since most chemists are not familiar with it we present it in the Appendix. Having proven that theorem, and noting that �Es and �Er are independent random variables, they are uncorrelated and we can apply that theorem to show that the variance of �T is: � � � � Er �Es −Es �Er Var��T � = Var + Var (41-14) Er �Er + �Er � Er �Er + �Er � Since �Er is small compared to Er , the �Er in the denominator terms will have little effect on the variance of T and in the limit approaches zero. In a case where this is not true, the derivation must be suitably modified to include this term. This is relatively straightforward: substitute the parenthesized terms into the equation for variance (e.g., as we do in the appendix), hook up about a 100-hp motor or so and “turn the crank” – as we will do in due course. It is mostly algebra, although a lot of it! In our current development, however, we assume �Er is small and therefore negligible compared to Er we replace (Er + �Er ) with Er : � � � � Er �Es −Es �Er Var��T � = Var + Var (41-15) Er 2 Er 2

230

Chemometrics in Spectroscopy

� Var��T � = Var

� � � −T�Er �Es + Var Er Er

(41-16)

We have shown previously that if a represents a constant, then Var �aX� = a2 Var�X� ([2], or see [3] Chapter 11, p. 94). Hence equation 41-16 becomes � � � �2 1 −T 2 Var��Er � (41-17) Var��T � = Var��Es � + Er Er Since we have assumed constant detector noise for this chapter, Var��Es � = Var��Er � = Var��E� Var��T � =

1+T2 Var��E� Er 2

(41-18)

Finally, reconverting variance back to SD by taking square roots on both sides of equation 41-18: SD��T � =

�

1+T2

SD��E� Er

(41-19)

We remind our readers here that �E, as we have been using it in this derivation is, as you will recall, the difference between �E � and �E0� in equation 41-4 and the expected value in the statistical nomenclature is therefore 21/2 times as large as �E� (due to the fact that it is the result of the difference between random variables with equal variance), a difference that should be taken note of when comparing results with the original definition of S/N in equation 41-2. We next note, and this is in accordance with expectations, that the noise of the trans mission spectrum, SD(�T ) is dependent on the noise-to-signal ratio of the readings, the inverse of the S/N ratio commonly used and presented as a spectrometer specification – at least, as long as the noise is small compared to the reference energy reading so that the approximation made in equation 41-15 remains valid. Recall that Er is the energy of the reference reading and SD(�E) is the noise of the readings from the detector; this ratio of SD(�E)/Er is the (inverse of the) true signal-to-noise ratio; the noise observed on a transmission spectrum, while related to S/N , is in itself not the true S/N ratio. Next we note further, and this is probably contrary to most spectroscopist’s expecta tions, that the noise of the transmittance spectrum is not constant, but depends on the transmittance of the sample, being higher for highly transmitting samples than for dark samples. Since T can vary from 0 (zero) to 1 (unity), the noise level can vary by a factor of the square root of two, from a relative value of unity (when T = 0) to 1.414 � � � (when T = 1). This behavior is shown in Figure 41-1. The increase in noise with increasing signal might be considered counterintuitive, and therefore surprising, by some. Intuition tells us that he S/N ratio might be expected to improve with increased signal regardless of its source, or that the noise level of the transmittance spectrum should at least remain constant, for constant detector noise. This misapprehension has worked its way into the literature to modern times: “In most infrared measurements situations, the detector constitutes the limiting noise source. Because the resulting fluctuations have the same effect as a fixed uncertainty in the signal readout, they appear as a constant error in the transmittance”. [4]

Analysis of Noise: Part 2

231

1.6 1.4

Relative noise

1.2 1 0.8 0.6 0.4 0.2

1

0.96

0.92

0.88

0.8

0.84

0.76

0.72

0.68

0.6

0.64

0.56

0.52

0.48

0.4

0.44

0.36

0.32

0.28

0.2

0.24

0.16

0.12

0.08

0

0.40

0

Sample transmittance

Figure 41-1 Noise level of a transmittance spectrum as a function of the sample transmittance.

Intuition tells us that if the transmittance is zero, then it should have no effect on the readings. In fact this is true, but misleading. The transmittance being zero, or the sample energy being zero, does not mean that the variability of the reading is zero. The explanation of the actual behavior comes from a careful perusal of the intermediate equations developed in the course of arriving at equation 41-19, specifically equation 41-14. From the first term in that equation we see that the irreducible minimum noise is contributed by the reference signal level (Er � multiplied by the variation of the sample signal (�Es �, independently of the value of the sample signal. Increasing sample signal then serves to add additional noise to the total, through its contribution, in the second term of equation 41-14, which comes from the sample signal through its being multiplied by the reference noise. Conventional developments of the subject contain flaws that are usually hidden and subtle. In Ewing’s book, for example [5], the development includes the step (see page 43, the section between equations 3-6 and 3-7) of noting that, since the reference energy is essentially set equal to unity, log (Er � (or P0 , the equivalent in Ewing’s terminology) is set equal to zero. However, this is done before the separation of P0 from �P0 , creating the implicit, but erroneous, result that �P0 is zero as well. In our nomenclature, this causes the second term of equation 41-14 to vanish, and as a consequence the erroneous result obtained is that �T is independent of T . This, of course, appears to confirm intuition and since it is based on mathematics, appears to be beyond question. Other treatments [6] simply do not question the origin of the noise in T and assume a priori that it is constant, and work from there. The more sophisticated treatment of Ingle and Crouch [7] comes very close but also misses the mark; for an unexplained reason they insert the condition: “� � � it is assumed there is no uncertainty in measuring Ert and E0t � � � ”. Now in fact this could happen (or at least there could be no variation in �Er �; for example, if one refer ence spectrum was used in conjunction with multiple sample spectra using an FTIR spectrometer. However, that would not be a true indication of the total error of the measurement, since the effect of the noise in the reference reading would have been removed from the calculated SD, whereas the true total error of the reading would in

232

Chemometrics in Spectroscopy

fact include that source of error, even though part of it were constant. It is to their credit that these authors explicitly state their assumption that they ignore the variability of Er rather than hiding it. Furthermore they allude to the fact that something is going on when they state “� � � the approximation is good to within a factor of 21/2 .” Nevertheless they failed to follow through and derive the exact solution to the problem. The bottom line to all this is that in one way or another, previous treatments of this subject have invariably failed to consider the effect of the noise of the reference reading, and therefore arrived at an erroneous conclusion. Whew! I think that is enough for one chapter. I need a rest. And so does the typesetter! We will continue the derivation in our next chapter.

APPENDIX Proof that the variance of a sum equals the sum of the variances Let A and B be random variables. Then the variance of (A + B) is by definition:

Var�A + B� =

�2 � � �A + B� − �A + B� n−1

(41-A1)

Since �A + B� = A + B, we can separate the numerator terms and then expand the numerator: � � � A2 + AB − AA − AB + AB + B2 − AB − BB 2 2 −AA − AB + A + AB − AB − BB + AB + B Var�A + B� = (41-A2) n−1 We can now collect terms as follows: � 2 � � 2 2 2 �B − 2BB + B � �A − A��B − B� �A − 2AA + A � + +2 Var�A + B� = n−1 n−1 n−1 (41-A3) Equation 41-A3 can be checked by expanding the last term, collecting terms and verifying that all the terms of equation 41-A2 are regenerated. The third term in equation 41-A3 is a quantity called the covariance between A and B. The covariance is a quantity related to the correlation coefficient. Since the differences from the mean are randomly positive and negative, the product of the two differences from their respective means is also randomly positive and negative, and tend to cancel when summed. Therefore, for independent random variables the covariance is zero, since the correlation coefficient is zero for uncorrelated variables. In fact, the mathematical definition of “uncorrelated” is that this sum-of-cross-products term is zero. Therefore, since A and B are random, uncorrelated variables: � � �A − A�2 �B − B�2 Var�A + B� = + (41-A4) n−1 n−1

Analysis of Noise: Part 2

233

The two terms of equation 41-A4 are, by definition, the variances of A and B. Var�A + B� = Var�A� + Var�B�

(41-A5)

QED

REFERENCES 1. Mark, H. and Workman, J., Spectroscopy 15(10), 24–25 (2000). 2. Mark, H. and Workman, J., Spectroscopy 3(8), 13–15 (1988). 3. Mark, H. and Workman, J., Statistics in Spectroscopy, 1st ed. (Academic Press, New York, 1991). 4. Honigs, D.E., Hieftje G.M. and Hirschfeld, T., Applied Spectroscopy 39(2), 253–256 (1985). 5. Ewing, G., Instrumental Methods of Chemical Analysis, 4th ed. (McGraw-Hill, New York, 1975). 6. Strobel, H.A., Chemical Instrumentation – A Systematic Approach to Instrumental Analysis (Addison-Wesley Publishing Co., Reading, MA, 1960). 7. Ingle, J.D., and Crouch, S.R., Spectrochemical Analysis (Prentice-Hall, Upper Saddle River, NJ, 1988).

This page intentionally left blank

42

Analysis of Noise: Part 3

We have been discussing the question of how noise in a spectrometer affects the observed noise in the spectra we measure. This question was introduced [1] and various known phenomena was presented that contribute (or, at least, can contribute) to the noise level of the observed spectra. Since this is a continuation of the previous chapters, we will continue the numbering of equations, figures, and so on as though it were all one chapter. In Chapter 41, based on reference [2] we derived the following expression for the noise of a transmission measurement, for the case of constant detector noise, as is commonly found in IR and NIR spectrometers: SDT =

SDE 1+T2 Er

(42-19 also shown as 41-19)

To continue the derivation, the next step is to determine the variation of the absorbance readings; starting with the definition of absorbance. The extension we present here, of course, is based on Beer’s law, which is valid for clear solutions. For other types of measurements, diffuse reflectance for example, the derivation should be based on a suitable function of T that applies to the situation, for example the Kubelka-Munk function for diffuse reflectance should be used for that case: A = − logT

(42-20a)

A = −04343 lnT

(42-20b)

dA = −04343 dT/T

(42-21)

We take the derivative,

and substitute the expressions for T (Equation 41-6) and dT , replacing the differen tials by finite differences: so that we can use the expression for T found previously (Equation 41-11): Er Es Es Er − −04343 Er Er + Er Er Er + Er A = (42-22) Es Er −04343Er Er Es Es Er A = − (42-23) Es Er Er + Er Er Er + Er −04343Er Er Es − Es Er (42-24) A = Er Er + Er Es

236

Chemometrics in Spectroscopy

Again allowing ourselves to neglect Er in comparison with Er : −04343 Er Es − Es Er A = Es Er

(42-25)

At this stage we have two branches of a derivation “tree” to pursue: one is to determine the standard deviation of A, the other is to continue the derivation, toward the final result corresponding to the “standard” treatments of the topic, but using our rigorously derived equations. We start with the computation of standard deviation of A, which is straightforward. We cut the derivations short slightly, however, in that the process we will use will apply the same sequence of steps; as we did to the case of T as we previously showed [2], but present only the results of each step, not all the intermediate equations. These steps are: separating the fraction in equation 42-25 into two terms, taking the variance of both sides of the equation, noting that Var(Es = VarEr = VarE, applying the two theorems that tell us 1) VarX + Y = VarX + VarY 2) VaraX = a2 VarX simplifying the expressions when possible and then taking square roots again. So we start by multiplying through and separating the fractions in equation 42-25: A =

−04343Es 04343Er + Es Er

(42-26)

taking the variance of both sides of the equation: −04343Es 04343Er VarA = Var + Er Es apply the theorem: VarX + Y = VarX + VarY −04343Es 04343Er VarA = Var + Var Es Er

(42-27)

(42-28)

and then the theorem: VaraX = a2 VarX VarA =

−04343 Es

2

04343 Var Es + Er

2 Var Er

(42-29)

Let VarEs = VarEr = VarE:

−04343 2 04343 2 Var E + Var E Es Er 2 −1 2 1 VarA = + 043432 Var E Es Er

VarA =

(42-30)

(42-31)

Analysis of Noise: Part 3

237

and finally: SDA = 04343SDE

1 1 + E s 2 Er 2

(42-32)

We may compare this with SD(A) that would be obtained if Er were set to zero in equation 42-25 (as per the conventional derivation): SDA =

04343 SDE Es

(42-33)

Since Es can go from zero to Er , it is interesting and instructive to plot these two functions, in order to compare the effect of eliminating the terms involving Er from the expressions. We do this in Figure 42-2. To continue on the second branch of our derivation “tree” as described above, we next derive expressions for relative precision, A/A, starting with the use of equations 42-20b and 42-25: −04343 Er Es − Es Er A Es Er = (42-34) A −04343 lnT A Er Es − Es Er = A Es Er lnT A 1 = A lnT

Es Er − Es Er

(42-35) (42-36)

Exact versus approximate solution 0.6

Absorbance noise

0.5 0.4 0.3 0.2 0.1

1

0.96

0.92

0.88

0.8

0.84

0.76

0.72

0.68

0.6

0.64

0.56

0.52

0.48

0.4

0.44

0.36

0.32

0.28

0.2

0.24

0.16

0.12

0.08

0

0.04

0

%T

Figure 42-2 Absorbance noise as a function of transmittance, for the exact solution (upper curve: equation 42-32) and the approximate solution (lower curve: equation 42-33). The noise-to-signal ratio, i.e., E/Er was set to 0.01. (see Color Plate 3)

238

Chemometrics in Spectroscopy

Again going through the steps needed to convert to the statistical domain (as we did before) we first take the variance of both sides of equation (42-36) to obtain A 1 Es Er Var = Var − (42-36a) Er A lnT Es Then apply the theorem: VarA + B = VarA + VarB: A 1 Es 1 −Er Var = Var + Var A lnT Es lnT Er

(42-36b)

And then the theorem: VaraX = a2 VarX: Var

A A

=

1 Es lnT

2 VarEs +

−1 Er lnT

2 VarEr

A 1 1 Var = 2 VarEs + 2 VarEr 2 A Er lnT 2 Es lnT A 1 1 1 Var = VarE + VarE s r A Er 2 lnT 2 Es 2 Then setting VarEs = VarEr = VarE: A 1 1 1 Var = VarE + VarE A Er 2 lnT 2 Es 2 Var Var

A A A A

=

VarE

=

VarE

lnT 2

lnT 2

1 1 + 2 2 Es Er Es 2 +E r 2 Es 2 Er 2

(42-36c) (42-36d)

(42-36e)

(42-36f)

(42-36g)

(42-36h)

And finally, taking square roots on both sides to convert to standard deviations, and substituting Es /Er forT A −SDE Es 2 + Er 2 SD = (42-37) A Es Er lnEs /Er We may compare this, for example, with the equation at an equivalent point in Ingle and Crouch’s development [3] (taking that as a “typical” derivation): A −st = A TEr lnT

(Ingle and Crouch’s equation 5-45)

The relationship and differences between the two equations are obvious, except we may note that, while can never be negative, there is always the issue, when taking a square root, of determining the sign. Since Es /Er is less than unity, the logarithm in the denom inator is negative and therefore we must determine that the sign of the square root in the

Analysis of Noise: Part 3

239 Exact versus Approx Solution for SD [Δ(A)/A]

1.6 1.4 1.2

Δ(A)/A

1 0.8 0.6 0.4 0.2

1

0.88

0.92

0.9

0.95

0.8

0.85

0.7

0.75

0.6

0.65

0.5

0.55

0.45

0.4

0.3

0.35

0.25

0.2

0.1

0.15

0.05

0

0

%T

Exact Approx

Expansion of SD [Δ(A)/A] 0.16 0.14 0.12

Δ(A)/A

0.1 0.08 0.06 0.04 0.02 1

0.96

0.8

0.84

0.76

0.72

0.68

0.64

0.6

0.56

0.52

0.48

0.44

0.4

0.36

0.32

0.28

0.24

0.2

0.16

0.12

0.08

0.04

0

0

%T

Figure 42-3 Comparison of the exact (upper curve: equation 42-37) and approximate (lower curve: Ingle and Crouch equation 5-45) expressions for the standard deviation of A/A as a function of %T. Noise-to-signal is set at 0.01.

numerator is also negative in order to obtain a positive value for SD(A). Equation 42-37 then reduces to the Ingle & Crouch equation if Er goes to zero (as Ingle & Crouch assume) and we pass to the statistical domain. Again, it is interesting and instructive to compare the two expressions by plotting them as a function of T , which we do in Figure 42-3. From Figure 42-3 we also see the well-known effect on the relative precision of spectral analysis of, on the one hand, T → 0 and on the other the effect of lnT → 0 as T → 1. The minimum relative error occurs, in the standard treatment, at T = 0368 [4]. Examining the data table from which Figure 42-3 was created (using EXCEL™) confirms what Figure 42-3 leads us to suspect: using the exact solution, there is a

240

Chemometrics in Spectroscopy

shift from the previously accepted value; the optimum value of transmittance occurs at 33.0%T rather than the generally accepted value of 36.8%T . We wish to develop an analytic expression for this situation. To do so, we will follow the same steps used in the standard development, but use the rigorously correct equation (i.e., equation 42-37 instead of the approximate equation previously used. The steps are the standard ones used for finding a minimum (or maximum) of a function: take the derivative of equation 42-37, then set that derivative equal to zero. Since the derivative of interest is the derivative with respect to T , in preparation for this we reorganize equation 42-37 as follows: we substitute equation 41-6 (reference [2], reorganized to Es = TEr (since Er is a constant) into equation 42-37; this enables us to eliminate Es from the equation:

A SD A

A SD A

SDE TEr 2 + Er 2 = TEr Er lnT

SDE Er 2 T 2 + 1 = TEr 2 lnT

A SD A

√ SDE T 2 + 1 = TEr lnT

(42-38)

(42-39)

(42-40)

We could work with equation 42-40, but it is instructive to slightly reorganize it: SD

A A

=

√ SDE T 2 + 1 Er T lnT

(42-41)

SDE is, as before, the noise-to-signal ratio of the reference Er signal. We can also note that if the variation of the sample reading was neglected, then the term under the radical would simply be unity and the expression would again reduce to the conventional expression. We are now ready to take the derivative with respect to T :

√ d A d SDE T 2 + 1 SD = (42-42) dT A dT Er T lnT We note again that

√ d A SDE d T2 +1 SD = dT A Er dT T lnT

(42-43)

Applying the theorem for the derivative of a ratio:

d A SD dT A

⎫ ⎧ √ d √ 2 d ⎪ ⎪ 2 +1 ⎨ ⎬ lnT T lnT T + 1 − T T SDE dT dT = (42-44) ⎪ ⎪ Er T lnT 2 ⎩ ⎭

Analysis of Noise: Part 3

241

Since the derivative in the first term in the numerator of equation 42-44 is of the form U n , where n has the value of 1/2, we apply the theorem that the derivative of U n is nU n−1 to that part. And since the derivative in the second term of equation 42-44 is of the form U × V , where U = T and V = ln(T ), we apply the theorem that the derivative of the product of U × V is U dV + V dU to that part, then:

d A SDE SD = dT A Er ⎫ ⎧ √ d d d ⎪ T lnT √ 1 ⎪ 2 2 ⎪ ⎪ lnT + lnT T ⎬ T + 1 − T + 1 T ⎨ dT dT 2 T 2 + 1 dT ⎪ ⎪ T lnT 2 ⎪ ⎪ ⎩ ⎭ (42-45)

Now we can start simplifying (in several steps): ⎫ ⎧ √ 2T 1 ⎪ ⎪ 2 +1 T ⎪ ⎪ + lnT T lnT − T √ ⎬ d A SDE ⎨ T 2 T2 +1 SD = ⎪ ⎪ dT A Er T lnT 2 ⎪ ⎪ ⎩ ⎭ (42-46) ⎧ ⎫ 2 ⎪ T 2 lnT − √T 2 + 1 1 + lnT ⎪ ⎨ ⎬ d A SDE = SD √ ⎪ ⎪ dT A Er T 2 + 1T lnT 2 ⎩ ⎭ d A SDE T 2 lnT − T 2 + 1 1 + lnT SD = √ dT A Er T 2 + 1T lnT 2

(42-47)

(42-48)

For comparison, we note that the corresponding equation from the conventional formu lation is d A SDE 1 + lnT SD = (42-49) dT A Er T lnT 2 Now we set the derivative in equation 42-48 equal to zero and obtain T 2 lnT − T 2 + 1 1 + lnT = 0

(42-50)

This is a transcendental equation, which is not easily solved by ordinary methods. Nowadays, however, computers make the solution of such equations by successive approximations easy. In this case, again using EXCEL™, we find that the value of T that makes the left-hand side of equation (42-50) become zero, which thus gives the value corresponding to the transmittance corresponding to minimum relative error, is 0.32994, rather than the previously accepted value of 0.368 By now you probably think we are done. Not by a long shot! There is considerably more to learn about the effect of noise of a spectrum when the detector noise is constant, some of which is even more surprising than what we have seen until now. More to come in the next chapters – Stay tuned

242

Chemometrics in Spectroscopy

REFERENCES 1. Mark, H. and Workman, J., Spectroscopy 15(10), 24–25 (2000). 2. Mark, H. and Workman, J., Spectroscopy 15(11), 20–23 (2000). 3. Ingle, J.D. and Crouch, S.R., Spectrochemical Analysis (Prentice-Hall, Upper Saddle River, NJ, 1988). 4. Strobel, H.A., Chemical Instrumentation – A Systematic Approach to Instrumental Analysis (Addison-Wesley Publishing Co., Reading, MA, 1960).

43

Analysis of Noise: Part 4

This chapter is the continuation of a set [1–3] dealing with the rigorous derivation of the expressions relating the effect of instrument noise to their contributions to the spectra we observe. Our first chapter in this set was an overview; since then we have been analyzing the effect of noise on spectra when the noise is constant detector noise, that is noise that is independent of the strength of the optical signal. Inasmuch as we are dealing with a continuous set of equations, we again continue our discussion by continuing our equation numbering, figure numbering, use of symbols, and so on as though there were no break. We left off in Chapter 42 based on the original publication [3] with determining the sample transmittance corresponding to the best relative precision of a spectral mea surement, we then noted that there is more to learn about noise effects on quantitative spectroscopic analysis. “What more is there?” you might ask. Well, in the previous chapters, we learned that the transmittance level affects the noise. In this chapter we will learn that the noise can also affect the transmittance. To see how, let us go to equation (41-14) (reference [2]), which we reproduce here, and note the discussion that followed it (which we won’t reproduce here: the reader may go back to the original and reread it): �

� � � Er Es −Es Er VarT = Var + Var Er Er + Er Er Er + Er

(41-14)

Basically, the development of the mathematical derivations from Chapter 41 (Equa tions 41-15 onward) was based on the assumption that in Equation 41-14, Er was small compared to Er so that it could be ignored; this was done for several reasons, one being that it allowed considerable simplification of the equations, which was pedagogically useful. More significant and fundamental is that it represents a limiting case of the situation. But suppose the noise is not small enough to be ignored, that is it is not small compared to Er ? Then we cannot ignore it, or its effect on the equations. As we might expect, it also complicates the analysis of the situation enormously. We mentioned at that time that we would discuss that situation in due course, and the time has now come to do so. Normally, mathematical derivations are done the other way round: the full equations are developed first and then the special cases described and their effect on the equations worked through. But we chose to do it “backwards”, so to speak, because we felt it is more pedagogically effective that way; and it allows our readers to follow along with us in the simpler situations, before becoming immersed in the full complexities of the equations.

244

Chemometrics in Spectroscopy

Besides, that is the way we like to do things As we will see, there are significant consequences of non-negligible noise. To start our discussion we will go back even farther than equation 41-14 and start our discussion with equation 41-5 (reference [2]), which we again reproduce here. T + T =

Es + Es Er + Er

(Equation 41-5 from Chapter 41)

This can be separated into two terms: T + T =

Es Es + Er + Er Er + Er

(43-51)

so that now we can equate corresponding terms on the two sides of the equation: TM =

Es Er + Er

(43-52a)

T =

Es Er + Er

(43-52b)

where TM represents the measured transmittance value of a reading subjected to noise. So that now we see that equation 43-52a represents the computed transmittance of the reading, and equation 43-52b represents the deviation due to noise of that transmittance. We will address the possibility of a contribution of equation 43-52b to the computed value of TM a bit later in this chapter. We will also occasionally use the term fEr to refer to the expression on the right-hand side of equation 43-52a. Upon averaging several values of equation 43-52a, the fact that the noise is in the denominator causes the average value of the effect of the noise not to approach zero, and therefore averaging several values of T will result in a computed value different than the actual value of Es /Er . This is because division is a non-linear arithmetic operation. To illustrate this effect, we will use a numerical example, and consider two readings of T with values of Er = 02 and −0.2 times Er (remember that we are dealing specifically with the case where the noise is not negligibly small compared to the signal); this will make the “noise” symmetrical around Er . Then, the general formula for the average value of T computed will be T=

n 1� Es n i=1 Er + Ei

and for two readings as we described this becomes: � � 1 Es Es + T= 2 Er + Er Er − Er

(43-53)

(43-54)

where represents the fractional change of the measurement. For our example, the specific value of = 02: � � Es 1 Es T= + (43-55) 2 Er + 02 × Er Er − 02 × Er

Analysis of Noise: Part 4

245

�

1 1 + 1 + 02 1 − 02

�

T=

Es 2Er

T=

Es 0833333333 + 125 2Er T = 10416666

Es Er

(43-56) (43-57) (43-58)

Thus we see that, even though the noise values of the reference readings are equally distributed around their mean value of Er , their effect on the computed value of trans mittance is not symmetrically distributed due to the nonlinearity of the division process, resulting in a change of the computed value from the (in this case, known) true value. Now, smaller variations will show small effects and larger variations will show larger effects (i.e., change the computed value of T less or more than the amount shown). The relative effect, for this somewhat artificial case, is shown in Figure 43-4. As the noise becomes larger and larger compared to the value of the reference signal, the second term in equation 43-54 becomes more and more dominant. Therefore we cannot allow the “noise” to equal the signal value, since in that case the denominator of the second term would become zero and the “average” value of T would be infinite. This concern will occur again as we continue discussing this situation. One obvious consequence of this is that if data is to be coadded, the coaddition of sample and reference signals should be done individually, before the computation of T rather than computing T for each reading and then averaging the several values of T together. An interesting side note here: in the real world there is nothing to prevent the noise from becoming greater than the signal (except for the alertness of the spectroscopist doing the work!), thus it is entirely possible for the measured value of the reference reading to become arbitrarily small and the computed value of T to become arbitrarily large. Presumably, any spectroscopist will recognize that such data have no meaning. However, we here find ourselves in the quite unusual situation of allowing the mathematics to limit the extent of our analysis, rather than “real world” considerations, just the reverse of what usually happens. It is even possible for an individual noise pulse to exceed −Er so that a negative reading of Er will be obtained. This happens in the real world and therefore must be taken into account in the mathematical description. This is a good place to also note that since the transmittance of a physical sample must be between zero and unity, Es must be no greater than Er , and therefore when Er is small an individual reading of Es can also be negative. Therefore it is entirely possible for an individual computed value of T to be negative. Now, while we are concerned in these chapters with a thorough analysis of these effects, in practice this is usually not too serious a problem, for several reasons. The first reason is that if the data is noisy and needs to be coadded to reduce the noise level, the coadding is normally done in accordance with our recommendation above: before the computation of T . Therefore the error of the values of Es and Er is reduced before the computation of T is performed, thus keeping it out of the regime where the nonlinear effect becomes important. The second reason is that under normal measurement conditions, the only place where such a high N/S ratio is liable to occur will be at the ends of the spectral range, where

246

Chemometrics in Spectroscopy Relative computed transmittance 60

Relative increase

50 40 30 20 10

1

0.96

0.88 0.84

0.92

0.8

0.84 0.8

0.76

0.76

0.72

0.72

0.68

0.68

0.6

0.64

0.56

0.52

0.48

0.4

0.44

0.36

0.32

0.28

0.2

0.24

0.16

0.12

0.08

0

0.04

0

α Expansion of plot 9 8

Relative increase

7 6 5 4 3 2 1 0.92

0.88

0.64

0.6

0.56

0.52

0.48

0.44

0.4

0.36

0.32

0.28

0.24

0.2

0.16

0.12

0.08

0

0.04

0

α Figure 43-4 The relative change in computed value of T from equation 43-53 for various values of .

Er is becoming very small. Here, however, the effect will be masked by other effects contained in the data, such as the effect of small changes in source intensity, external interference or, in the case of FTIR, interferometer misalignment, or any of several other effects that change the actual values of reference and sample energy at the limits of the spectral range. On the other hand, if the measurement situation is such that the reference energy is small and cannot be increased (e.g. outdoor open-air monitoring, or insufficient time available for coaddition of data), so that the noise level is an appreciable fraction of the reference signal, then this phenomenon can become important. Now it is time to examine the effect of a more realistic type of noise than we have been considering so far. In a real situation, of course, where many readings may be

Analysis of Noise: Part 4

247

averaged together, some will contain small errors and some will contain large errors, each one making its nonlinear contribution to the value of T . Obviously, only one average value of T will be computed from the data. The net effect on the value of T computed from many readings, then, will thus depend not only on the standard deviation of Er compared to Er , but also on how many values of each value of Er are there, that is on the distribution of the values of Er . Statisticians call this average of many values of a quantity the “expected value” of that quantity. For many reasons, that we have discussed previously [4] the Normal distribution is the one that inevitably occurs in nature when there is no overriding factor to change it, therefore it is the one we consider. How do we determine the effect of using the Normal distribution? Basically, what we want to do is find the average value for many readings, when we know how often each reading occurs, after all, that is the meaning of a distribution. This would be the expected value. If we had discrete readings, we would let Wi represent the weight of the ith value, that is, how often that value occurs, and Xi represent the value, then use the formula for a weighted average: � Wi FXi i (43-59) XW = � Wi i

The Normal distribution, however, is a continuous distribution as is the distribution of values of Es /(Er + Er ; therefore we have to change the summations to integrations: � Wxfxdx XW = � (43-60) Wxdx and, since� in �this case Wx represents the Normal distribution weighting: − 1 e 2 21/2

1

Er −Er

2

which specifies the relative weights of the different values, we replace Wx with the expression for the Normal distribution, and fx is the function Es /(Er + Er ), so that equation 43-60 becomes T WN =

� �2 r − 21 Er −E Es e dEr − Er +Er � �2 � − 21 Er −E r 1 e dEr 21/2 −

1 21/2

�

(43-61)

where is the standard deviation of the variations of the energy readings and T WN rep resents the mean computed transmittance for Normally distributed detector noise. Since the normalization factor in front of the integral representing the Normal distribution in the denominator is intended to make the final value of the integrated Normal distribution be unity, the denominator of equation 43-61 is therefore unity, hence: T WN =

� �2 � r Es 1 − 21 Er −E e dEr 21/2 − Er + Er

(43-62)

A plot presenting the two parts of the integrand, and their product, is shown in Figure 43-5. We made an attempt to perform the integration analytically, which failed. While that approach may still be possible, it does not seem likely, for a couple of reasons. The dif ficulty arises from two sources. One is the general difficulty of integrating the Normal

248

Chemometrics in Spectroscopy 5

Integration terms 4

f(E r)

3

Normal distribution

Product

2

f(E r)

1 0 –0.25 –1

–0.13

–0.01

0.11

0.23

0.35 ΔE r

0.47

0.59

0.71

0.83

0.95

–2 –3 –4 –5 –6

Expansion of integral functions 2

f(E r)

1.5 1

Normal distribution

Product

0.23

0.2

0.17

0.14

0.11

0.08

0.05

0.02

–0.01

–0.04

–0.07

–0.1

–0.13

–0.16

–0.19

–0.5

–0.22

0

0.25

f(E r)

0.5

ΔE r –1 –1.5 –2

Figure 43-5 The Normal curve, the function f (Er [= Er /(Er + Er from equation 43-62 and their product. (see Color Plate 4)

distribution (sometimes called the Error Function, for obvious reasons). The other is that the Normal distribution is infinite in extent, and therefore, regardless of the value of Er or of the standard deviation being represented by the particular Normal distribution in use, there will inevitably be a point at which term Es /(Er + Er in equation 43-62 attains a value of infinity (when Er = −Er ). While this in itself does not automati cally preclude performing the integration, or prevent the integral from having a finite value, it points to a problem area, one which indicates that if the integral can be evaluated at all, it will require special methods, as the evaluation of the error function itself does. Now in fact, all this is also in accord with reality: an attempt to use data in which the reference energy becomes so small that the noise brings even a single reading down to zero will cause the computed value corresponding to that reading to become infinite; then, averaging that with any finite number of other finite values will still result in an

Analysis of Noise: Part 4

249

infinite value for the computed value of T . This is, of course, catastrophic to our attempt to deal with this situation analytically. Another point to note: if we look at equation 43-62 critically, we note that the variables are not completely separable. While we can remove Es from inside the integration, Er is not so easily removed. How, then, can we determine the effect on the computed value of T ? One way is to multiply the right-hand-side of equation 43-62 by unity, in the form of Er /Er , this leads to T WN =

� �2 � r Es 1 Er − 21 Er −E e dEr Er 21/2 − Er + Er

(43-63)

which now puts the expression into the form of the ratio of the measured values of Es and Er , with a multiplier. It also, perhaps, makes what is going on somewhat clearer: in the limit of small values of Er the base expression reduces to Er /Er which is unity; the integral then reduces to the ordinary Normal distribution, which, as we noted, also evaluates to unity, so that in the limit of small levels of noise, T becomes Es /Er , as it should. However, we still have that pesky Er inside the integral. As we might expect, the effect of the noise, Er , is really going to be affected by its relationship to Er , the signal strength. The overall noise value is contained in the exponent of the Normal distribution weighting factor, but its presence in the first part of the integral indicates that it has more than just that effect. Thus, if we try to determine the effect of changing the signal-to-noise ratio, at constant noise level, by changing Er , we must realize that Er then becomes a parameter affecting the value of the integral. Therefore in order to represent the effect of varying the signal-to-noise in this regime, we will require a family of curves rather than just a single one. Since we have seen that the integral cannot be evaluated analytically, there are several alternatives to analytic integration of equation 43-63: we can perform the integration numerically, we can investigate the behavior of equation 43-63 using a Monte-Carlo simulation, or we can expand equation 43-63 into a power series. In all cases we need to take at least a brief look at what happens when Er is close to the asymptote at −Er ; basically, it goes off to +infinity when approaching from above (as we saw), and to −infinity when approaching from below. If we do not try to compute values when we are too close to −Er , therefore, using either approach there will be a tendency toward cancellation of the positive and negative terms, leaving a finite result. In the case of a power series expansion, the closer we come to unity, the more terms we would need to include in the series. We now report on the evaluation of the integral in equation 43-63, which was done numerically by computer. The numerical computations were carried out using MATLAB. Here we examine the conditions and the results obtained for this exercise. Before attempting to evaluate the integral, we first tested for convergence, that is, that the integral is finite, and also that when evaluating it we are using a sufficiently fine interval of integration to provide accurate results. To do this, we evaluated the integral for a small region around the point Er = 0, using different values of the integration interval. The integration range was −0.01 to +0.01. Integration intervals ranged from 10−2 to 10−7 . The standard deviation of the Normal distribution was set to unity (note that we will eventually investigate the behavior of equation 43-63 for various values of the standard deviation, so that at this point setting it equal to unity is convenient for

250

Chemometrics in Spectroscopy

Table 43-1 Values of the integral between −0.01 and 0.01, for various values of the integration interval Integration interval 10−2 10−3 10−4 10−5 10−6 10−7

Value of integral 0.012130208846544832 0.012130457519586295 0.012130476382397228 0.012130478208633820 0.012130478390650151 0.012130478408845785

pedagogical purposes, and also for a quick “ballpark” evaluation of equation 43-63 for other values of this parameter), and the mean of the Normal distribution was also set equal to unity. Since the section of the Normal distribution, that is 1 standard deviation away from the mean, is the region that has the maximum slope, these conditions gave the maximum weight to the region around the infinity of f (Er ; thus if the integral did not diverge here it would not diverge at any other point of the Normal distribution. The results are in Table 43-1. Since the value of integration interval also determines how close to the point of infinity any contribution may be, presumably, if the integral were to diverge, what we would see around the point of infinity would be contributions to the integral increasing faster and faster as the computation included points closer to the infinity. Under those circumstances, we would observe an increasing value of the integral as we used finer and finer intervals of integration. What we see in Table 43-1 on the other hand is that, as we use smaller intervals of integration, more digits of the integral remain stable; thus we conclude that the integral does indeed seem to be converging on a finite value. We also observe that using an integration interval of 10−4 provides precision on the order of one part in 107 , which is more than sufficient for our purposes. First, the range of integration was set to be wide enough (10 standard deviations) that at the number of iterations we used, there is no further appreciable contribution to the integral from values beyond that range, the value of the Normal distribution at 10 standard deviations is approximately 2×10−22 . The integral is computed for various values of Er , each set of such integrals forming one curve that we will plot. The family of curves is generated by using various values of sigma (, the standard deviation of the readings due to detector noise). For our demonstration, we compute the curve of multiplication factor versus Er for values of sigma of 0.1, 0.2, 1.0. The point at Er = −Er . with the infinite value, was deleted from the set before adding the terms of the integral. Since we are using the Normal distribution, we take this opportunity to point out some of the other characteristics of the error, in particular the fact that the errors have a mean value of zero. The multiplication factor according to the integral of equation 43-63 was computed, and the family of curves is presented in Figure 43-6. Interestingly, while the values of individual computations of the multiplication factor for a finite number of discrete points can reach large values, as we saw above, we find that the expected value of the multiplication factor reaches a maximum value at a modest level, and then approaches zero as Er approaches zero. The explanation is that at large values of the reference signal strength, Er , where the noise becomes small compared to the signal, the multiplication factor approaches unity, so that the computed value of T W approaches Es /Er , as we

Analysis of Noise: Part 4

251

would expect. As the reference signal strength decreases so that it becomes comparable to the noise level, occasional individual data points will be measured in the regime where the nonlinearity of the division process becomes important; this nonlinearity then causes the computed value of T to be higher than the value computed under strong-signal (i.e., low-noise) conditions. When Er approaches zero, however, the Normal curve then allows occasional negative values to be included in the integral, and more and more often as the reference signal strength decreases further. In reality, noise can indeed cause an apparent negative value of Er , which would result in a negative computed value for the computed quantity T , even though it is a mathematical artifact and cannot correspond to an actual negative value for the physical property, T . In the limit of the reference signal strength approaching zero, there will be equal contributions of negative and positive excursions from zero, so that the average value will be zero. Since the sample signal strength must be less than the reference signal strength, the same thing is happening to Es the sample signal, so that in fact the computation would assume the undefined form of 0/0. Examining Figure 43-6, however, shows that the limiting value of T as Er approaches zero is also zero. The family of curves obtained, and presented in Figure 43-6, show that, not surpris ingly, the controlling parameter of the family of curves is the standard deviation of the noise; the maximum value of the multiplication factor occurs at a given fraction of the standard deviation of the energy readings. Successive approximations show that the maximum multiplier of approximately 1.28 occurs when Er is approximately 2.11 times sigma, the standard deviation of Er . Some miscellaneous questions arise, which we address here: First of all, since the value of a reading can become infinite, why is the integral finite and well-behaved? The answer is that while a single reading can indeed become large beyond all bounds as Er approaches −Er the probability of obtaining a value closer and closer to exactly −Er becomes smaller and smaller, and the probability of Multiplication factor for T as a function of E r

1.4

σ = 0.1

σ = 1.0

Multiplication factor

1.2 1 0.8 0.6 0.4 0.2

4.84

4.4

4.62

4.18

3.96

3.74

3.3

3.52

3.08

2.86

2.64

2.2

2.42

1.98

1.76

1.54

1.1

1.32

0.88

0.66

0.44

0

0.22

0

Er

Figure 43-6 Family of curves of multiplication factor as a function of Er , for different values of the parameter sigma (the noise standard deviation), for Normally distributed error. Values of sigma range from 0.1 to 1.0 for the ten curves shown. (see Color Plate 5)

252

Chemometrics in Spectroscopy

being exactly −Er is exactly zero, therefore in reality an infinity will not occur. Hence the integral, representing the average of what will actually occur, remains finite. There are other factors, also. One factor is that, as we consider two values of Er at equal magnitude and opposite directions from Er , we realize that as the two values get closer to Er there is less room for the nonlinearities to act, therefore the magnitudes of the two values of fEr ) become more and more nearly the same, and since they have opposite sign cancel each other more and more exactly. Secondly, why do the curves pass through a maximum and then go to zero as Er approaches −1? If we look at Figure 43-5, and particularly at the expanded plot, we see that the asymmetry of the Normal curve with respect to the function f (Er causes the cross-product of the two curves (which, after all, is what is being integrated) to exhibit a fairly large area between the peak of the Normal curve and where the curve of f (Er ) really “takes off” that has no counterpart in the region where f (Er ) is negative. This creates a net positive contribution to the integral. As Er approaches −1, the Normal curve “slides under” f (Er ), and there is an increasing contribution from the negative portion of f (Er ), until symmetry assures us that when Er = −1 there is always a negative contribution of f (Er ) to cancel each positive contribution, so that T W = 0 at that point. Thirdly, when we separated equation 43-51 into two terms, we only worked with the first term. The second term, which we presented in equation 43-52B, was neglected. Is it possible that the nonlinear effects observed for equation 43-52A will also operate on equation 43-52B? The answer is yes, it will, but And the “but ” is this: Es is a random variable, just as Er is. Furthermore, it is uncorrelated with E r . Therefore, in order to evaluate the integral representing the variation of both Es and Er , it would be necessary to perform a double integration over both variables. Now, for each value of Es , the nonlinearity caused by the presence of Er in the denominator would apply. However, Es is symmetrically distributed around zero, therefore for every positive value of Es there is an equal but negative value that is subject to exactly the same nonlinear effect. The net result is that these pairs always form equal and opposite contributions to the integral, which therefore cancel, leaving no effect due to Es . We have analyzed the effect that noise has on the computed transmittance, just as we previously analyzed the effect that the sample transmittance has on the computed noise value. We can experimentally measure the variation in noise level due to the sample transmittance. On the other hand, we will not be able to realize the effect of noise on the computed transmittance, for reasons we will discover in our next chapter, which will deal with the noise of the transmittance when the energy is low, or the noise is high, so that again we cannot make the “low noise” approximation we made previously.

REFERENCES 1. 2. 3. 4.

Mark, Mark, Mark, Mark,

H. H. H. H.

and and and and

Workman, Workman, Workman, Workman,

J., J., J., J.,

Spectroscopy Spectroscopy Spectroscopy Spectroscopy

15(10), 24–25 (2000). 15(11), 20–23 (2000). 15(12), 14–17 (2000). 3(1), 44–48 (1988).

44

Analysis of Noise: Part 5

This chapter is the continuation of Chapters 40–43 from a set of articles [1–4] dealing with the rigorous derivation of the expressions relating the effect of instrument (and other) noise to their effects to the spectra we observe. Chapter 40 in this set was an overview; since then we have been analyzing the effect of noise on spectra, when the noise is constant detector noise, that is, noise that is independent of the strength of the optical signal. Inasmuch as we are dealing with a continuous set of chapters (40 through 53) on the same subject, we continue our discussion by serially numbering our equations, figures, use of symbols, and so on. as though there were no break across these chapters. It seems we said something wrong. When we first began this series of chapters (starting at 40) dealing with the effects of various kinds of noise on spectra [1, 2], we said that there does not seem to have been any recent attention paid to the question of noise in spectra. It turns out that that is not quite true. Edward Voigtman pointed out that in fact, he had performed and published computer simulation studies of just this subject [5, 6]. His studies were based on computer simulations of the behavior of various analytical instruments in various situations using a simulation engine described in an Analytical Chemistry Report [7]. In addition to the simulations of spectrometers, he also published simulations of polarimeters [8, 9] with results that are interesting, if not of direct application to our current study. The diagrams he published [5] clearly show the difference in the optimum absorbance values (i.e., minimum relative absorbance error) between these simulations and the conventional theory in use previously. Unfortunately the noise levels of the simulations were too high to precisely determine the actual minimum. When Dr. Voigtman contacted us to inform us of these papers, we discussed the results he obtained, and he revealed that due to the limitations of the computer hardware available at the time the simulations were performed, he could not use more than a few hundred repeats of the Monte-Carlo experiments, resulting in the high noise levels observed. Having seen our early Chapters 40 and 41 dealing with this topic from the papers first published [1, 2], he reprogrammed his simulation engine to perform new simulations and compared the results with the exact solution we derived (see equation 41-19 [2]), and with new hardware allowing use of much more extensive Monte-Carlo calculations, he found excellent agreement (E. Voigtman, 2001, personal communication). We are grateful to Dr Voigtman for pointing out the previous literature that we had missed, as well as sharing the results of his new simulations with us. Now let us recap where we came from in our discussion, in this mini-series-within a-book, and where we are going. In Chapter 41, referenced in [2] we demonstrated that, because previous treatments of this topic failed to take into account the effect of the noise of the reference reading, they did not come up with the rigorously correct formula to describe the effect of transmittance on the computed value of the noise. The rigorously exact solution to this situation shows that the noise level of a transmittance

254

Chemometrics in Spectroscopy

spectrum increases with the transmittance of the sample, rather than being independent of the sample characteristics, as previously thought. We then continued the development of those equations in Chapter 42 [3] to show the effect of the random noise on absorbance spectra, and on the relative precision: SDA/A, in both cases comparing the result of the rigorous treatment of the topic to the previous mathematical analysis, and showing that in both cases, the results from the rigorous treatment differ slightly but noticeably from the previous results. Finally we developed and solved the equations for the minimum in the curve of SDA/A, this being the generally accepted criterion for determining the best value of transmittance (or absorbance) that a sample should have, to obtain the most accurate results from this form of spectroscopic chemical analysis. Our conclusion here was that the optimum value of transmittance under these conditions, that is constant detector noise, is approximately 33 %T rather than the previously accepted 36.8 %T . We next noted in Chapter 43 [4] that all the results obtained up until that point were relevant only to the condition where the detector noise was small compared to the reference signal, and therefore the S/N ratio was high. We then noted that if that condition did not hold for any particular set of measurements, then other phenomena also come into action. We then pointed out that under low-noise conditions the signal can affect the noise level, but under conditions where the signal is weak or the noise excessive, the noise can affect the computed transmittance, as well. The expressions we obtained showed that as the reference signal gets weaker and weaker (or the noise gets larger and larger), the system first reaches a point where the expected value of T is larger than Es /Er and as the reference signal continues to decrease, the multiplying factor first goes through a maximum and then decreases, so that the expected value of T approaches zero as Er as the reference signal energy approaches zero. We are now ready in this chapter to consider the behavior of the noise under conditions where it is not small compared to the signal. We start with the definition of transmittance, as we pointed out previously, and we rewrite the equation here: T=

Es Er

(44-6)

To put equation 44-6 into a usable form under the conditions we wish to consider, we could start from any of several points of view: the statistical approach of Hald (see [10], pp. 115–118), for example, which starts from fundamental probabilistic considerations and also derives confidence intervals (albeit for various special cases only); the mathe matical approach (e.g., [11], pp. 550–554) or the Propagation of Uncertainties approach of Ingle and Crouch ([12], p. 548). In as much as any of these starting points will arrive at the same result when done properly, the choice of how to attack an equation such as equation 44-6 is a matter of familiarity, simplicity and to some extent, taste. At this point, however, we again need to take cognizance of comments we received after the material of this chapter was published as a column. One of our respondents noted that the analysis performed could be done in a different way, a way which might be superior to the way we did it. Normally, if we agree with someone who takes issue with our work we would simply publish a correction, or, when rewriting the material for this book, use the corrected form (as we have done in various places). In this case, however, that seems inappropriate, for several reasons. First, we are not convinced that

Analysis of Noise: Part 5

255

our original approach is “wrong”, therefore we wish to retain it. Secondly, some of our readers may wish to refresh themselves about our original material. Thirdly, some of our readers may wish to compare the two approaches for themselves, to decide if the original one is “wrong” or simply “not as good”, or whether, in fact, the new analysis is better. Therefore we present, at this point, the original analysis of the situation, the same way it was presented in the original column except, perhaps, for some minor enhancements in the wording to improve the comprehensibility. Later on in this chapter, under the heading “Alternate Analysis” we present the new analysis, as recommended. Therefore, continuing as we originally did, we note that we, being chemists and spectroscopists, and writing for spectroscopists, will use the Propagation of Uncertainties approach of Ingle and Crouch: FC D =

fC D fC D C + D C D

(44-64)

Note that we use the letters C, D to represent the variables in equation 44-64 to avoid confusion with our usage of A to mean absorbance. Applying this to equation 44-6: T =

Es /Er Es /Er Es + Er Es Er

(44-65)

Es −Es Er + Er Er2

(44-66)

T = As usual, we take the variance of this:

� VarT = Var

Es −Es Er + Er Er2

�

And apply first, the theorem that VarA + B = VarA + VarB: � � � � −Es Er Es VarT = Var + Var Er Er 2

(44-67)

(44-68)

and then the theorem that VaraX = a2 VarX: VarT =

� � 1 −Es 2 E Var + Var Er s Er2 Er2

and continue as before by setting Es = Er = E: � � 1 Es2 + Var E VarT = Er 2 Er4

(44-69)

(44-70)

and finally take square roots to obtain: � SDT =

1 Es2 + SD E Er2 Er4

(44-71)

256

Chemometrics in Spectroscopy

This is clearly a function of both Er and Es ; in the regime we are concerned with in this chapter, however, as Er approaches 0, the second term under the radical dominates the expression, although clearly the point at which the numerical value becomes large com pared to 1/Er 2 will depend on the value of Es as well, or equivalently, the transmittance of the sample. Here, again, therefore, the behavior of the noise of the transmittance must be expressed as a family of curves. Figures 44-7 and 44-8 present the behavior of this family of curves. Note that equation 44-71 can be reduced to equation 41-19 [2], which is appropriate when the signal-to-noise ratio is high and may be considered constant. Under these conditions Er is large and the second term under the radical is small and the first term under the radical, which is independent of Es , dominates; then the noise of the √ transmittance increases with T as 1 + T 2 and inversely with the reference energy. Here, however, under low-signal/high-noise conditions, where the variation of Er cannot be ignored and therefore the S/N ratio varies, we must use the full expression of equation 44-71. Note further that when Er is small enough, as we noted above, the second term under the radical dominates, then � T2 T SDT = SD E = SD E (44-72) 2 Er Er The noise of the transmittance thus becomes directly proportional to T and inversely proportional to Er . Under these conditions; the noise of the transmittance approaches infinite values as Er approaches zero, even as the expected value of the transmittance approaches zero, as we saw in Chapter 43 [4]. To summarize the effects at low signal-to-noise to compare with the high signal-to-noise case summarized above, here the noise of the transmittance increases directly with T and still inversely with the reference energy. We now wish to follow through, as we did before, on finding the “optimum” value for sample transmittance under these conditions. To do this, we start with equation 44-24 (reference [3]): � � −04343Er Er Es − Es Er A = (44-24) Er Er + Er Es This is the point at which, in the previous development, we considered the effect of letting Er become negligible, but of course in this case we wish to investigate the small-signal/large-noise behavior. We now, therefore, go directly to dividing A by A (from equation 44-20b): � � −04343Er Er Es − Es Er A Er Er + Er Es = (44-73) A −04343 ln T � � A Er Er Es − Es Er = (44-74) Er + Er A Es Er ln T � � A 1 Er Es −Er = + (44-75) A ln T Es Er + Er Er + Er

Analysis of Noise: Part 5

1 −Er A 1 Es = + A T ln T Er + Er ln T Er + Er

257

(44-76)

To determine the variance of A/A we perform our usual exercise of taking the variance of both sides of equation 44-76 and applying our two favorite theorems; the result is � � � �2 � � � � � �2 A 1 Es 1 −Er Var = Var + Var (44-77) A T ln T Er + Er ln T Er + Er We cannot simplify this equation further; in particular, we cannot separate out the variances of Es and Er , n in order to replace them with the same generic value. To determine the variance of A/A, that is the relative precision (in chemists terms), we need to evaluate the variance of the two terms in equation 44-77. As we had observed previously, as the value of Er approaches −Er , the value of the expressions attains infinite values. However, a difference here is that when computing the variance, these values are squared, and hence the computations are always done using positive values. This differs from out previous case, where the presence of both positive and negative values afforded the opportunity for cancellation of near-infinite contributions; we do not have that situation here. Therefore we are faced with the possibility that the variance will be infinite. An empirical test of this possibility was performed by computing values of the variance of the two terms in equation 44-77. The Normal random number generator of MATLAB was used to create multiple values of Normally distributed random numbers for Er and Es ; these were plugged into the two expressions of equation 44-77 and the variance computed. Values between 100 and 106 were used in each computation of the variance. When Er was more than five standard deviations away from the center of the Normal distribution representing Er , the computed variance was fairly small and reasonably stable, and decreased as Er was moved further away from the center of Er . This might be considered an empirical determination of the point of demarcation of the “small-signal” case. When Er was moved below five standard deviations, the computed value of the variance became very unstable; computed values of the variance would differ by as much as four orders of magnitude. The closer Er came to Er , the more erratic the computed variance became. It was clear that bringing Er close to the center of Er afforded more opportunity for a given reading of the noise to become close to −Er , thus giving a value approaching infinity that would be included in the calculation. Furthermore, for a given relationship between Er and Er , the more readings that were included in the computation, the higher the values of variance that would be calculated. For example, with 100 readings, values of variance might fall between 101 and 104 , while with 10,000 readings calculated variance values would fall in the range of approximately 103 and 106 . This is attributed to the increased likelihood of more data points being close to −Er and also of at least a few points being closer to −Er than with fewer data. Another test of whether the variance actually diverges and becomes infinite is the same as the test we applied in the previous chapter: to integrate the expressions in equation 44-77 in a small region around the point Er = −Er using different intervals of integration and see if the values converge or diverge. Basically, except for a multiplying factor these are both the same expression, so evaluating the expression once suffices to settle the question for both of them. Furthermore, since we are integrating over values of

258

Chemometrics in Spectroscopy

Table 44-2 Value of integral of 1/Er 2 over range −001 to +001 Integration interval 10−2 10−3 10−4 10−5 10−6 10−7

Value of integral 2.0000000000000000e+002 3.0995354623330845e+003 3.2699678003698089e+004 3.2878691333625099e+005 3.2896681436917488e+006 3.2898481337470137e+007

variance, the expression that needs to be integrated is 1/Er 2 . The result of performing this test is presented in Table 44-2. In contrast to the previous test results, the values are clearly growing increasingly larger without bound as the integration interval is reduced. The conclusion from all this is that the variance and therefore the standard deviation attains infinite values when the reference energy is so low that it includes the value zero. However, in a probabilistic way it is still possible to perform computations in this regime and obtain at least some rough idea of how the various quantities involved will change as the reference energy approaches zero; after all, real data is obtained with a finite number of readings, each of which is finite, and will give some finite answer; what we can do for the rest of this current analysis is perform empirical computations to find out what the expectation for that behavior is; we will do that in the next chapter.

ALTERNATE ANALYSIS Here we present the revised analysis of the situation of the effect on the expected noise level of noise that is not small compared to the signal level Er . Before we proceed, however, there is a technical point we need to clear up. This is the numbering of the equations, figures, etc. The previous column/chapter ended with equation 43-63. Therefore it is appropriate to begin the analysis with equation 43-64, as we did above in this chapter, and in the original analysis published in the columns. For obvious reasons, however, we cannot simply repeat using the same equation/figure/etc. numbers that we did above. Neither can we simply continue from the last number used in the first analysis, above, because then we would have to renumber all equations, figures, etc., for the rest of this series of chapters. While laborious, that could be done, but would raise another, insoluble, problem: it would put the numbering of the equations, etc., out of synchronization with the numbering of the original columns. Therefore anybody reading the later chapters and wishing to compare them with the original columns will find that task well-nigh impossible. Fortunately, none of the equations developed in this chapter, nor the figures, used any suffix, as was occasionally done in previous chapters (we do refer to equation 42-20b above, but that equation is in a previous chapter and we will not repeat the use of equation 42-20 here. We will also copy equation 43-52b from the previous column, but the b suffix does not signify a new equation, since it is the equation used previously; also, a b suffix is not indicative of a copy of an equation number in this section). Therefore, we can distinguish the numbering of any equations or other numbered entities in this

Analysis of Noise: Part 5

259

section by appending the suffix “a” to the number, without causing confusion with other corresponding entities. Now we are ready to proceed. We reached this point from the discussion just prior to equation 44-64, and there we noted that a reader of the original column felt that equation 44-64 was being incorrectly used. Equation 44-64, of course, is a fundamental equation of elementary calculus and is itself correct. The problem pointed out was that the use of the derivative terms in equation 44-64 implicitly states that we are using the small-noise model, which, especially when changing the differentials to finite differences in equation 44-65, results in incorrect equations. In our previous column [4] we had created an expression for T + T (as equa tion 44-51) and separated out an expression for T (as equation 44-52b). We present these two equations here: T + T =

Es Es + Er + Er Er + Er

(44-51)

from which we concluded that: T =

Es Er + Er

(44-52b)

At this point we would like to compute the variance of T , but simply computing s would also not be correct, since it would ignore the influence the variance of E E r +Er of the variability of the first term in equation 44-51 [4], and not take its contribution to the variance into proper account. Therefore the expression for T in equation 44-52b is not correct, even though it is the result of the formal breakup of equation 44-51 [4]. We should be using a formula such as: T =

Es Es + Er + Er Er + Er

(44-64a)

in order to include the variability of the first term, also. This, however, leads to another problem: subtracting equation 44-64a from equa tion 44-51 leaves us with the result that T = 0. Furthermore, the definition of T gives us the result that Es is zero, and that therefore T is in fact equal to the expression given by equation 44-52b; anyway despite our efforts to include the contribution to the variance of the first term in equation 44-51. Our conclusion is that the original separation of equation 44-51 into two equations, while it served us well for computing TM and TA , fails us here. This is because Es and Er are random variables and we cannot treat their influences separately; we have no expectation that they will either cancel or reinforce each other, wholly or partially, in any particular measurement. Therefore when we compute the variance of T we wish to retain the contribution from both terms. This also raises a further question: the analysis of equation 44-52a by itself served us well, as we noted; but was it proper, or should we have maintained all of equation 44-51, as we find we must do here? The answer is yes, it was correct, and the justification is given toward the end of the previous column [4]. The symmetry of the expression when

260

Chemometrics in Spectroscopy

averaged over values of Es means that the average will be zero for each value of Er , and therefore the average of the entire second term will always be zero. Therefore, the best way to maintain the entire expression is to go back still a further step, and note that the ultimate source of equation 44-51 was equation 44-5 [2]: T + T =

Es + Es Er + Er

(44-5)

Therefore we solve equation 44-5 for T and, noting the definition of T , we find: T =

Es + Es Es − Er + Er Er

(44-65a)

Then we take the variance of both sides: � VarT = Var

Es + Es Es − Er + Er Er

� (44-66a)

Once again applying the rule that the variance of a sum is the sum of the variances, we obtain: � � � � E + Es E + Var s (44-67a) VarT = Var s Er + Er Er Since Es /Er is the true transmittance of the sample, the value of T for a given sample is constant, and therefore the variance of that term is zero, resulting in: �

E + Es VarT = Var s Er + Er

� (44-68a)

The variables in equation 44-68a are again not separable. While we could formally split equation 44-68a into the sum of two variances: � VarT = Var

Es Er + Er

�

� + Var

Es Er + Er

� (44-69a)

that would not be correct because the two variances that we wish to add have a common term Er + Er and therefore are not independent of each other, as application of the rule for adding variances requires [2]. Also, evaluation of a variance by integration requires the integral of the square of the varying term, which as we have seen previously [13] is always positive and therefore the integrals of both terms of equation 44-69a

diverge.

Thus we conclude that we must compute the variance of T directly from equation

44-68a and the definition of variance:

n �

VarT =

i=1

��

� � ��2 Es + Es Es + Es − Er + Er Er + Er n−1

(44-70a)

Analysis of Noise: Part 5

261

We can learn something interesting by again noting, as we did previously [4], that Es has a mean of zero, therefore equation 44-70a becomes: n �

VarT =

��

i=1

� � ��2 Es + Es Es − Er + Er Er + Er n−1

(44-71a)

and by splitting up the first term in the numerator of equation 44-71a into its two parts: n �

VarT =

��

i=1

Es Er + Er

�

�

Es + Er + Er

�

�

Es − Er + Er

��2 (44-72a)

n−1

and rearranging the terms: n �

VarT =

i=1

��

Es Er + Er

�

�

Es − Er + Er

�

�

Es + Er + Er

��2

n−1

(44-73a)

and again using the definition of variance: � VarT = Var

Es Er + Er

n �

� +

i=1

�

Es Er + Er n−1

�2 (44-74a)

and then the definition of the average value: � VarT = Var

Es Er + Er

� +

� �2 n Es n − 1 Er + Er

(44-75a)

Where we note that the limit of n/n − 1 → 1 as n becomes indefinitely large. Of course, the noise level we want will be the square root of equation 44-75a. We have previously seen, in equation 44-77 [13], that the variance term in equation 44-75a diverges, and clearly, as Er → −Er the second term in equation 44-75a also becomes infinitely large. However, as we discussed at the conclusion of the original analysis, using finite differences means that the probability of a given data point having Er close enough to −Er to cause a problem is small, especially as Er increases. This allows for the possibility that a finite value for an integral can be computed. To recapitulate some of that here, it was a matter of noting two points: first, that as Er gets further and further away from zero (in terms of SD) it becomes increasingly unlikely that any given value of Er will be close enough to Er to cause trouble. The second point is that, in a real instrument there is, of necessity, some maximum limit on the value that 1/Er − Er can attain, due to the inability to contain an actually infinite number. Therefore it is not unreasonable to impose a corresponding limit on our calculations, to correspond to that physical limit. We now consider how to compute the variance of T , according to equation 44-68a. Ordinarily we would first discuss converting the summations of finite differences to

262

Chemometrics in Spectroscopy

integrals, as we did previously, but we will forbear that, leaving it as an exercise for the reader. Instead we will go directly to consideration of the numerical evaluation of equation 44-68a, since a conversion to an integral would require a back-conversion to finite differences in order to perform the calculations. We wish to evaluate equation 44-68a for different values of Es and Er , when each is subject to random variation. Note that VarEs = VarEr , we cannot simply set the two terms equal to a common generic value of E as we did previously, since that would imply that the instantaneous values of Es and Es were the same, but of course they are not since we assume that they are independent noise contributions, although they have the same variance. Under these conditions it is simplest to work with equation 44-68a itself, rather than any of the other forms we found it convenient to convert equation 44-68a into, for the illustrations of the various points we presented and discussed. There are still a variety of ways we can approach the calculations. We could assume that Es or Er were constant and examine how the noise varies as the other was changed. We could also hold the transmittance constant and examine how the transmittance noise varies as both Es and Er are changed proportionately. What we will actually do here, however, is all of these. First we will assume that the ratio of Es /Er , representing T , the true transmittance of the sample, is constant, and examine how the noise varies as the S/N ratio is changed by varying the value of Er , for a constant noise contribution to both Es and Er . The noise level itself, of course, is the square root of the expression in equation 44-67a: � � � E + Es SD T = Var s (44-76a) Er + Er To do the computations, we again use the random number generator of MATLAB to produce Normally-distributed random numbers with unity variance to represent the noise; values of Er will then directly represent the S/N ratio of the data being evaluated. For the computations reported here, we use 100,000 synthetic values of the expression on the RHS of equation 44-76a to calculate the variance of, for each combination of conditions we investigate. A graph of the transmittance noise as a function of the reference S/N ratio is presented in Figure 44-7a-1 and the expanded portion of Figure 44-7a-1, shown in Figure 44-7a-2. The “true” transmittance Es /Er was set to unity (i.e., 100%T ). The inevitable existence of a limit on the value of TM , as described in the section following equation 44-75a was examined in Figure 44-7a-1 by performing the computa tions for two values of that limit, by setting the limit value (somewhat arbitrarily, to be sure) to 1,000 and 10,000, corresponding to the lower and upper curves, respectively. Note that there are effectively two regimes in Figure 44-7a-1, with the transition between regimes occurring when the value of S/N ratio equals approximately 4. When the value of Er was greater than approximately four, i.e., the S/N ratio was greater than four, the curves are smooth and appear to be well-behaved. When Er was below an S/N of four, the graph entered a regime of behavior that shows an appreciable random component. The transition point between these two regimes would seem to represent an implicit definition of the “low noise” versus the “high noise” conditions of measurement. In the low-noise regime the transmittance noise decreases smoothly and continuously as

Analysis of Noise: Part 5

263

140

Transmittance noise

120

100

80

60

40

20 0

0

1

2

3

4

6

5

7

8

9

10

S/N (Er /ΔEr)

Figure 44-7a-1 Transmittance noise as a function of reference S/N ratio, for alternate anal ysis (equation 44-68a). The sample transmittance was set to unity. The limit for the value of Es + Es /Er + Er was set to 10,000 for the upper curve and to 1000 for the lower curve. (see Color Plate 6)

1.2

Transmittance noise

1

0.8

0.6

0.4

0.2

0 4

5

6

7

8

9

10

S/N (Er /ΔEr)

Figure 44-7a-2 Expansion of Figure 44-7a-1. (see Color Plate 7)

the S/N ratio increases. This was verified by other graphs (not shown) that extended the value of S/N ratio beyond what is shown here. The “high-noise” regime seen in Figure 44-7a-1 is the range of values of S/N ratio where the computed standard deviation is grossly affected by the closeness of the approach of individual values of Er to Er . This is, in fact, a probabilistic effect, since

264

Chemometrics in Spectroscopy 140

120

Transmittance noise

Monto-Carlo (equation 44-76a) 100

80

Theory (equation 44-19) Approx (equation 44-52b)

60

40

20

0

0

1

2

3

4

5

6

7

8

9

10

S/N (Er /ΔEr)

Figure 44-8a Comparison of empirically determined transmittance noise value with those determined according to the low-noise approximations of equation 44-19 and equation 44-52b. (see Color Plate 8)

it depends not only on how closely the two numbers approach each other, but also on how often that occurs; a single or only a few “close approaches” will be lost in a large number of readings where that does not happen. As we will see below, there is indeed a regime where the theoretical “low-noise” approximation differs from the results we find here, without becoming randomized. Changing the number of values of Es + Es /Er + Er used for the computa tion of the variance made no difference in the nature of the graph. As is the case in Figure 44-7a-1, the transition between the low- and high-noise regimes continues to occur at a value between 4 and 5. Figure 44-8a shows the graph of transmittance noise computed empirically from equation 44-76a, compared to the transmittance noise computed from the theory of the low-noise approximation, as per equation 44-19 [2] and the approach, under question, of using equation 44-52b. We see that there is a third regime, where the difference between the actual noise level and the low-noise approximation is noticeable, but the computed noise has not yet become subject to the extreme fluctuations engendered by the too-close approach of Er to Er . Since the empirically determined curve approaches the theoretical curve asymptotically as the S/N increases, where the separation becomes “noticeable” will depend on how hard you look, but there is certainly a region in which this occurs, in any case. This is the situation we alluded to above, representing the “middle ground” of the transmittance noise. Figure 44-9a-1 shows what happens to the noise level, for the same condition of constant “sample transmittance” as a function of S/N, for different values of sample transmittance. As we see, in the “low noise” regime the noise has the behavior we have derived for it. However, the effect of the exaggeration of the random variations very quickly takes over, and in the “high noise” regime there is virtually no difference in the

Analysis of Noise: Part 5

265

140

Transmittance noise

120 100 80 60 40 20

0

0

1

2

3

4

5

6

7

8

9

10

S/N (Er /ΔEr)

Figure 44-9a-1 Transmittance noise as a function of reference S/N ratio, at various val ues of sample transmittance. Blue curve: T = 1. Green curve: T = 0.5. Red curve: T = 0.1. (see Color Plate 9)

1.2 1.1

Transmittance noise

1 0.9 0.8 0.7 0.6 0.5 0.4

T=1

0.3

T = 0.5

0.2

T = 0.1 4.2

4.4

4.6

4.8

5

5.2

S/N (Er /ΔEr)

Figure 44-9a-2 Expansion of Figure 44-9a-1. (see Color Plate 10)

noise behavior at different values of transmittance, since that is now dominated by the divergence of the integrals involved. A verification of the effects is seen in Figures 44-9a-1 and 44-9a-2; which is also an investigation that is part of our original plan, and is presented in Figure 44-10a where we present a graph showing the transmittance noise as a function of the sample transmittance Es /Er . As we see, except for the occasional spike, when the S/N ratio is

266

Chemometrics in Spectroscopy S/N = 4 1.2 1.1

Transmittance noise

1 0.9

S/N = 4.5

0.8 0.7 0.6 0.5 0.4 0.3 0.2 0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Transmittance

Figure 44-10a Transmittance noise as a function of transmittance, for different values of refer ence energy S/N ratio (recall that, since the standard deviation of the noise equal unity, the set value of the reference energy equals the S/N ratio). (see Color Plate 11)

5 and even when it is only 4.5, the transmittance noise varies essentially as we saw in working out the exact solution for transmittance noise in the low-noise case. Naturally, the underlying transmittance noise value is higher when the reference S/N ratio is lower. When the S/N ratio decreases to 4, then “spikes” happen frequently enough that it becomes almost impossible to tell where the “underlying” transmittance noise level is, since the computed values are again dominated by the divergent integrals.

ABSORBANCE NOISE IN THE “HIGH NOISE” REGIME Just as equation 41-5, which led to equation 44-76a, was the starting point for investigat ing the behavior of transmittance noise in the high noise regime, so too is equation 42-24 the starting point for investigating the behavior of absorbance noise in the high noise regime. While we presented equation 42-24 above, in the original analysis, we did not follow through to investigate its behavior, since we went directly to the analysis of the behavior of VarA/A instead. Therefore we present equation 44-24 again, and take this opportunity to investigate it: � � −04313Er Er Es − Es Er (44-24) A = Es Er Er + Er We therefore take the variance of A: � � �� −04313Er Er Es − Es Er VarA = Var Er Er + Er Es

(44-77a)

Analysis of Noise: Part 5

267

Then we multiply through: � VarA = Var

−04313Er2 Es − Er Es Er Es Er Er + Er

� (44-78a)

Using the definition of variance, we get: n �

VarA =

��

i=1

� � ��2 −04313Er2 Es − Er Es Er −04313Er2 Es − Er Es Er − Es Er Er + Er Es Er Er + Er n−1

(44-79a)

Again, the mean value of Es and Es are both zero; therefore the mean term of equation 44-79a vanishes, leaving us with: �2 � n � −04313Er2 Es − Er Es Er Es Er Er + Er i=1 VarA = n−1

(44-80a)

Again we see that the variance of the absorbance equals n − 1/n times the mean value of the summand of equation 44-80a, and also that we can ignore the premultiplier term n − 1/n for large values of n. We begin our investigation of the behavior of the absorbance noise by comparing it to the theoretical expectation from the low-noise condition according to equation 42-32 [3]. This comparison is shown in Figures 44-11a-1 and 44-11a-2. These figures show what we might expect: that as the S/N increases the computed value approaches the theoretical 8

7

Absorbance noise

6

5

Computed 4

3

2

Theory 1

0

0

5

10

15

20

25

30

35

40

45

50

S/N (Er /ΔEr)

Figure 44-11a-1 Comparison of computed absorbance noise to the theoretical value (accord ing to equation 44-32), as a function of S/N ratio, for constant transmittance (set to unity). (see Color Plate 12)

268

Chemometrics in Spectroscopy

0.35

Absorbance noise

0.3 0.25 0.2

Computed

0.15

Theory

0.1 0.05 0 5

10

15

20

25

30

35

40

45

S/N (Er /ΔEr)

Figure 44-11a-2 Expansion of Figure 44-11a-1. (see Color Plate 13)

value for the low-noise approximation, and also an excessive bulge at very low values of S/N, apparently similar to the abnormally large values observed in the behavior of the transmittance at very low values of S/N. After performing this comparison, we will not pursue the analysis any further, since we will obtain the results we would expect to get from the analysis of the transmission behavior. There is, however, something unexpected about Figure 44-11a-1. That is the decrease in absorbance noise at the very lowest values of S/N, i.e., those lower than approxi mately Er = 1. This decrease is not a glitch or an artifact or a result of the random effects of divergence of the integral of the data such as we saw when performing a similar computation on the simulated transmission values. The effect is consistent and reproducible. In fact, it appears to be somewhat similar in character to the decrease in computed transmittance we observed at very low values of S/N for the low-noise case, e.g., that shown in Figure 43-6.

REFERENCES 1. 2. 3. 4. 5. 6. 7. 8. 9. 10.

Mark, H. and Workman, J., Spectroscopy 15(10), 24–25 (2000). Mark, H. and Workman, J., Spectroscopy 15(11), 20–23 (2000). Mark, H. and Workman, J., Spectroscopy 15(12), 14–17 (2000). Mark, H. and Workman, J., Spectroscopy 16(2), 44–52 (2001). Voigtman, E., Analytical Instrumentation 21(1&2), 43–62 (1993). Voigtman, E., Analytical Chemistry 69(2), 226–234 (1997). Voigtman, E., Analytical Chemistry 65, 1029A–1035A (1993). Voigtman, E., Analytical Chemistry 64, 2590–2598 (1992). Voigtman, E., Analyst 120(February), 325–330 (1995). Hald, A., Statistical Theory with Engineering Applications (John Wiley & Sons, Inc., New York, 1952).

Analysis of Noise: Part 5

269

11. Korn, G.A. and Korn, T.M., Mathematical Handbook for Scientists and Engineers, 1st ed. (McGraw-Hill Book Company, New York, 1961). 12. Ingle, J.D. and Crouch, S.R., Spectrochemical Analysis (Prentice-Hall, Upper Saddle River, NJ, 1988). 13. Mark, H. and Workman, J., Spectroscopy 16(4), 34–37 (2001).

This page intentionally left blank

45

Analysis of Noise: Part 6

This chapter is the continuation of Chapters 40–44 referenced from their original papers [1–5] dealing with the rigorous derivation of the expressions relating the effect of instrument (and other) noise to their effects on the spectra we observe. Chapter 40 in this noise series was an overview; since then we have been analyzing the effect of noise on spectra, when the noise is constant detector noise, that is noise that is independent of the strength of the optical signal. Inasmuch as we are dealing with a continuous set of chapters, we again continue our discussion by serially numbering our equations, figures, and use of symbols, and so on as though there were no break. We left off in our previous Chapter 44 with having concluded that the noise level becomes infinite, both for individual noise pulses, and for the variance of the noise, when value of the reference signal actually crosses zero and becomes negative; we learned this from the following equation, which we reproduce from our previous chapter: �

A Var A

�

� =

1 T ln T

�2

�

Es Var Er + Er

�

�

1 + ln T

�2

�

−Er Var Er + Er

� (45-77)

and we showed that both variance terms become infinite at sufficiently small values of Er . However, that still leaves open the question of the behavior of the noise while the reference signal is not quite low enough to become infinite, but still small enough for the noise level to not be considered completely negligible. First of all, we must note that the two terms of equation 45-77 are not exactly the same. While we tested the behavior of the expressions using a random number generator that produces a Normal distribution of numbers with unity variance, the variance of the entire term is not necessarily unity, especially when, as in the second term of equation 45-77, the same random variable appears in both the numerator and the denominator. The first task, then, is to compare the behavior of those two terms. It was necessary to empirically determine the variances of the two terms in equation 45-77 for comparison. To do this, 10,000 random values for Er , created by the MATLAB random number generator to be Normally distributed with variance = 1, were used for each of the two terms in equation 45-77, then the variance is computed for various values of Er between 3 and 20. A different set of 10,000 random numbers were used for each different value of Er . Figure 45-9 presents the two curves obtained. It is clear that, while the variance of Er /Er − Er ) is larger than that of Es /Er − Er when Er is small, the two curves converge for values of Er above approximately 8 times the variance of the noise. From this it would seem, then, that when the reference signal is at least approximately 3 times its noise level as measured by its standard deviation, we are entering the “low-noise” regime that we discussed previously in Chapters 41 and 42, where the approximations made there apply [2, 3].

272

Chemometrics in Spectroscopy Variances of the two terms in equation 45-77 8 7

Variance

6 5 4 3 2 1 20

19

18

18

17

16

16

15

14

14

13

12

11

11

9

10

8

9

7

7

6

5

4

3

4

0

Er Expansion of plot of terms in equation 45-77 0.50 0.45 0.40

Variance

0.35

Er /(Er – ΔEr)

0.30

Er /(Er – ΔEr)

0.25 0.20 0.15 0.10 0.05

20

19

18

17

17

16

15

14

13

13

12

11

10

9

9

8

7

6

5

5

4

3

0.00

Er

Figure 45-9 Values of the variance of Er /Er − Er ) and Es /Er − Er ) for various value of Er , with a Normal distribution of values for the errors.

Now, in this regime, where the two variances become equal we can again equate Es and Er and replace them both with a generic term, E, then the variance can be factored from equation 45-77: �2 � � � � � �� �2 � A 1 1 −E Var = + Var (45-78) A T ln T ln T Er + E so that now, when standard deviations are taken, it can be put into terms of the standard deviation of the expression involving the generic E. However, that only addresses the limiting case. We are interested in the behavior of the standard deviation of A/A in this whole intermediate regime, so that we can determine the optimum sample transmittance, just as we did before, for data measured

Analysis of Noise: Part 6

273

in the regime where signal is always much greater than the noise. This also assumes that we can assign a meaning to the word “optimum”, in a situation where the noise is comparable to or even greater than the signal. But that is a philosophical question, which we will not attempt to address here; we want to simply follow where the mathematics lead us. Since we can, however, compute the variances corresponding to the two terms in equation 45-77 for various values of Er , we can plot the family of curves of SD(A/A, with Er as the parameter of the family. Since the two variances are, in the regime of interest, unequal and are multiplied by different functions of T , it is not unreasonable to expect that the minima of those curves corresponding to different members of the family will occur at different values of T . Figure 45-10 presents this family, for values of Er between 3 and 10, and for %T between 0.1 and 0.9. It is clear that there is indeed a family of curves. However, the variation on the ordinate is due mainly to the changes in signal-to-noise ratio as Er decreases. What is of more concern to us here is whether the value of %T at which the curve passes through a minimum changes, and if so how, as Er changes. To this end, the program that computed the curves in Figure 45-10 was modified, and instead of simply computing the values of variance it also computed the derivative (estimated as the first difference) of those curves, and then solved for the value at which the derivative was zero, for the various values of Er . The results are shown in Figure 45-11. It is obvious that for values of Er greater than five (standard deviations of the noise), the optimum transmittance remains at the level we noted previously, 33 %T . When the reference energy level falls below five standard deviations, however, the “optimum” transmittance starts to decrease. The erratic nature of the variance at these low values of Er , however, makes it difficult to ascertain the exact amount of falloff with any degree of precision; nevertheless it is clear that as much as we can talk about an optimum transmittance level under these conditions, where variance can become infinite and the actual transmittance value itself is affected, it decreases at such low values of Er . Nevertheless, a close look reveals that when 12.00 10.00

Er = 10

SD (A)/A

8.00

Er = 3 6.00 4.00 2.00

0.86

0.82

0.78

0.74

0.7

0.66

0.62

0.58

0.54

0.5

0.46

0.42

0.38

0.34

0.3

0.26

0.22

0.18

0.1

0.14

0.00

%T

Figure 45-10 Family of curves for SD(A/A for different values of Er . (see Color Plate 14)

274

Chemometrics in Spectroscopy Optimum transmittance using 5,000 values in variance computation 0.40 0.35

Optimum %T

0.30 0.25 0.20 0.15 0.10 0.05 10.0

0 0

9.60

10.0

9.20

8.80

8.40

8.00

7.60

7.20

6.80

6.40

6.00

5.60

5.20

4.80

4.40

4.00

3.60

3.20

2.80

2.40

2.00

1.60

1.20

0.80

0.40

0.00

0.00

Er Optimum transmittance using 100,000 values in variance computation 0.40 0.35

Optimum %T

0.30 0.25 0.20 0.15 0.10 0.05 9.60

9.20

8.80

8.40

8.00

7.60

7.20

6.80

6.40

6.00

5.60

5.20

4.80

4.40

4.00

3.60

3.20

2.80

2.40

2.00

1.60

1.20

0.80

0.40

0.00

0.00

Er

Figure 45-11 Optimum transmittance as a function of Er .

Er has dropped to five standard deviations, the optimum transmittance has dropped to 3.2, and then drops off quickly below that value. Surprisingly, the optimum value of transmittance appears to reach a minimum value, and then increase again as Er continues to decrease. It is not entirely clear whether this is simply appearance or actually reflects the correct description of the behavior of the noise in this regime, given the unstable nature of the variance values upon which it is based. In fact, originally these curves were computed only for values of Er equal to or greater than three due to the expectation that no reasonable results could be obtained at lower values of Er . However, when the unexpectedly smooth decrease in the optimum value of %T was observed down to that level, it seemed prudent to extend the calculations to still lower values, whereupon the results in Figure 45-11 were obtained. Verifying the nature of the curve for at least two sets of variances, calculated from different numbers of random values, was necessary in light of the larger values of

Analysis of Noise: Part 6

275 Variances using 5,000 and 100,000 values

20,000 18,000 16,000

Variance

14,000

Er, 100,000 values

12,000 10,000 8,000

Es, 100,000 values

6,000 4,000 2,000

9.65

9.30

8.95

8.60

8.25

7.90

7.55

7.20

6.85

6.50

6.15

5.80

5.45

5.10

4.75

4.40

4.05

3.70

3.35

3.00

0

Er

Expansion of plot 0.20 Er term, 100,000 values

Variance

0.15

Es term, 100,000 values 5,000 values

0.10

0.05

9.65

9.30

8.95

8.60

8.25

7.90

7.55

7.20

6.85

6.50

6.15

5.80

5.45

5.10

4.75

4.40

4.05

3.70

3.35

3.00

0.00

Er

Figure 45-12 Values of the variances in the two terms of equation 45-77, using different numbers of values. (see Color Plate 15)

variance for the two terms of equation 45-77 encountered when more values were included in the calculation, as described above. However, as Figure 45-12 shows, at even moderate values of Er , all the calculated values of the variance converge. From Figure 45-12 , it appears that once the signal level has fallen low enough to include zero with non-negligible probability, the optimum transmittance varies randomly between zero and a well-defined upper limiting value. This upper limit varies in a well-defined manner, from 0.3 at large values of signal as we saw previously, through a minimum at roughly 2.5 standard deviations above zero. In fact, while it does not seem possible to observe this directly. However, comparing Figure 45-12 with the results we found for the maximum value for computed transmittance under high-noise conditions (see Figure 45-6 and the discussion of that) it would not be surprising if the minimum actually occurred when the signal was 2.11 standard deviations above zero.

276

Chemometrics in Spectroscopy

The overall conclusion of all this work is that it is surely unfortunate that the effect of noise in the reference reading was not considered for lo these many a year, since that is where all the action seems to be. We continue in our next chapter by considering a special case of constant noise, with characteristics that give somewhat different results than the ones we have obtained here.

REFERENCES 1. 2. 3. 4. 5.

Mark, Mark, Mark, Mark, Mark,

H. H. H. H. H.

and and and and and

Workman, Workman, Workman, Workman, Workman,

J., J., J., J., J.,

Spectroscopy Spectroscopy Spectroscopy Spectroscopy Spectroscopy

15(10), 24–25 (2000). 15(11), 20–23 (2000). 15(12), 14–17 (2000). 16(2), 44–52 (2001). 16(4), 34–37 (2001).

46 Analysis of Noise: Part 7

This chapter is the continuation of Chapters 40–45 found as papers first published as [1–6] dealing with the rigorous derivation of the expressions relating the effect of instrument (and other) noise to their effects to the spectra we observe. Our first chapter in this set was an overview; since then we have been analyzing the effect of noise on spectra, when the noise is constant detector noise, that is noise that is independent of the strength of the optical signal. As we do in each chapter in this section of the book we take this opportunity to note that we are dealing with a continuous set of chapters, and so we again continue our discussion by continuing our equation numbering, figure numbering, use of symbols, and so on as though there were no break. We left off in Chapter 45 with having found an expression for the optimum value of transmittance, in situations where the noise is large compared to the signal (or, alterna tively, where the signal is small enough to be comparable to the noise), a regime we have investigated for the previous three chapters. Most of the derivations and mathematical analyses we have done so far have been very general, applying to any and all types of noise that might be superimposed on the spectral signal, as long as the noise level was constant and independent of the signal level. Stating it somewhat more rigorously, we assumed that regardless of the signal level, the noise contribution to each measured value represented a random sample taken from a fixed population of such values. In particular, for the most part we made no assumptions about the distribution of the values in the population of the noise readings. In Chapters 43–45 [6], however, we found it necessary to introduce the assumption that the noise was Normally distributed, in order to be able to determine the expected value for the average transmittance and for the expected standard deviation of the noise level in the case where the signal level was small enough to be comparable to the noise. The Normal distribution is, of course, an important and a common distribution to solve for in this development, but there is another important case where a noise contribution also has a constant standard deviation (i.e., independent of the signal level) but does not have a Normal distribution. These days, this contribution is probably almost as common as the ones having the Normal distribution, although it is not as obvious. Also, it is arguably less important than the other contributions, one reason being that it usually (at least in well-designed instruments) will be swamped out by the other noise sources, and therefore rarely observed. Nevertheless, this contribution does exist and therefore is worthy of being treated in this compilation of the effects of noise, if only for the purpose of completeness. This source of noise is not usually called noise; in most technical contexts it is more commonly called “error” rather than noise, but that is just a label; since it is a random contribution to the measured signal, it qualifies as noise just as much as any other noise source. So what is this mystery phenomenon? It is the quantization noise introduced by the analog-to-digital (A/D) conversion process, and is engendered by the fact that for

278

Chemometrics in Spectroscopy

any analog signal with a value between two adjacent levels that the A/D converter can assign, the difference between the actual value of the electrical voltage and the value represented by the assigned digital value is an error, or noise, and the distribution of this error is uniform. In the past, when instruments were not computer-controlled and all signal processing was done using analog circuits, digitization was not an important consideration. Nowadays, however, since almost all instruments use computerized data collection, this noise source is much more important, since it is so much more common than it used to be. The situation is illustrated in Figure 46-13. The actual voltage is a continuous, linear physical phenomenon. The values represented by the output of the A/D converter, however, can only take discrete levels, as illustrated. The double-headed arrows represent the error introduced by digitizing the continuous physical voltage at various points. The error cannot be greater than 1/2 the difference between representing adjacent levels of the converter; if the voltage increases beyond 1/2 the difference between levels, then the conversion will provide the next step’s representation of the value. Furthermore, if the sampling point is random with respect to the A/D conversion levels, as happens, for example, with any varying signal, then the actual voltage at the sampling point can be anywhere between two adjacent levels with equal probability, therefore the error (or noise) introduced will be uniformly distributed between +1/2 and −1/2 of the step size. This can happen even in the absence of other noise sources; as long as the signal varies, as it would, say, when a source is modulated. In that case, then, the measurement points will have a random relationship to the digitization levels. This effect could conceivably even become observable as the dominant error source, if the instrument has an extremely low noise level (a favorable case) or toolarge differences between A/D levels due to the A/D converter having too few bits (an unfavorable case).

Measured value

A/D step Error

Actual voltage

Applied voltage

Figure 46-13 The actual voltage is a continuous, linear function. The values represented by the output of the A/D converter, however, can only take discrete levels. The double-headed arrows represent the error introduced by digitizing the continuous physical voltage at various points.

Analysis of Noise: Part 7

279

EFFECT OF NOISE ON COMPUTED TRANSMITTANCE Therefore it is necessary at this point to repeat the investigations we did for Normally distributed noise, but to consider the effect of range-limited, uniformly distributed, noise. We will find that investigating this special case is relatively simple compared to the previous derivations, both because the expressions we find are much simpler than the previous ones and also because we have previously derived much of what we need here, and so can simply start at an appropriate point and continue along the appropriate path. The point in our previous discussions where the distribution of the noise was found to matter was the point at which we had to introduce the distribution of the errors in the first place; all previous discussion, derivations, and so on prior to that were independent of the distribution of the errors. That point was equation 43-60 in Chapter 43 first published as [4], where we introduced the weighted average in order to be able to compute the expected value for the measured transmittance, under conditions where the signal was small enough to be comparable to the noise. So let us repeat our previous work, starting at the appropriate point, and investigate both the computed transmittance and the noise of the transmittance, when the noise and signal have comparable magnitudes, but the noise is now uniformly distributed: � Wxfxdx (46-60) XW = � Wxdx In the case we investigated there, we had previously derived that the calculated trans mittance for an individual reading was fx =

Es Er + Er

(46-52a)

and in that case, we set the weighting function Wx to be the Normal distribution. We are now interested in what happens when the weighting function is a uniform distribution. Therefore the formula for the expected value of the mean transmittance, found by using equation 46-52a for fx and (1) for Wx in the interval from −1/2 to +1/2 (and zero outside that interval), becomes � 1/2 TWU =

−1/2

Es 1 dEr Er + Er � 1/2 1dEr −1/2

(46-79)

In equation 46-79, TWU represents the mean computed transmittance for Uniformly distributed noise and the parenthesized (1) in both the numerator and the denominator is a surrogate for the actual voltage difference between successive values represented by the A/D steps: essentially a normalization factor for the actual physical voltages involved. In any case, if the actual voltage difference were used in equation 46-79, it would be factored out of both the numerator and the denominator integrals, and the two would then cancel. Since the denominator is unity in either case, equation 46-79 now simplifies to � 1/2 Es TWU = dEr (46-80) −1/2 Er + Er

280

Chemometrics in Spectroscopy

Equation 46-80 is of reasonably simple form; indeed, the evaluation of this integral is considerably simpler than when the noise was Normally distributed. Not only is it possible to evaluate equation 46-80 analytically, it is one of the Standard Forms for indefinite integrals and can be found in integral tables in elementary calculus texts, in handbooks such as the Handbook of Chemistry and Physics and other reference books. The standard form for this integral is �

1 1 dx = ln a + bx a + bx b

To convert equation 46-80 to its Standard Form, we simply move Es outside the integral, whereupon equation 46-80 becomes TWU = Es

�

1/2 −1/2

1 dEr Er + Er

(46-81)

by setting a = Er and b = 1, the integral of equation 46-81 is 2 TWU = Es ln Er + Er 1/ −1/2

(46-82)

On setting Es = TEr and expanding equation 46-82 out by substituting the limits of integration: �� �� �� �� � � 1� 1� TWU = TEr ln ��Er + �� − TEr ln ��Er − �� (46-83) 2 2 From equation 46-83 we see that expectation for the measured value of TW is proportional to the true value of T (i.e., Es /Er , multiplied by a multiplier that is a function of Er . Figure 46-14 presents this function. Just as the expected value for transmittance (TW 2.5

Multiplication factor

2

1.5

1

0.5

2.4

2.3

2.2

2

2.1

1.9

1.8

1.7

1.6

1.5

1.4

1.3

1.2

1

1.1

0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0

0.1

0

Er

Figure 46-14 Plot of the multiplication factor of equation 46-83 as a function of Er . Abscissa unit is the difference between digitization levels.

Analysis of Noise: Part 7

281

in the case of Normally distributed noise went through a maximum, so too does the expected value for uniformly distributed noise, and the multiplier approaches unity at large values of Er , as it should. We note, however, that the value of the function at Er = 05 is not a valid value. When Er = 05, the argument of the logarithm in the second term of equation 46-83 is zero, and the value of the log becomes undefined. The value approaches an asymptote at Er = 05, indicating the mathematical undecidability of the value of the function, even though an actual physical A/D converter will indeed produce one or the other value at that point.

COMPUTED TRANSMITTANCE NOISE Here again, our task is simplified by the two facts we have mentioned above: first, that we can reuse many of the results we obtained previously for the case of Normally distributed noise, and second, that the nature of uniformly distributed noise characteristics simplify the mathematical analysis. Our first step in this analysis starts with equation 44-71, that we derived previously in Chapter 44 referenced as [5] as a general description of noise behavior: � 1 E2 SDT = + s 4 SD E (44-71) from Chapter 44 2 Er Er In our previous development, we presented a family of curves, corresponding to different values of SD(E. In the case of uniformly distributed noise, which is of necessity contained within a limited range √of values, the well-known fact that the standard deviation of the noise equals the range/ 12 helps us, in that it requires only one curve to display, rather than a family of curves. ([7], p. 146). For this case, then, equation 44-71 becomes equation (46-84): � 1 1 Es2 SDT = √ + (46-84) 12 Er2 Er4 where the unit of measure for Es and Er is the digitization interval of the A/D converter. We forebear plotting this function since it is simply one of the family we have presented previously in Chapter 44, as Figures 44-1 and 44-3 (referenced in [5]). Similarly, in Chapter 44, we have previously derived the absorbance noise and relative absorbance noise, and presented those as equations 44-24 and 44-77, respectively. �

A Var A

�

� =

1 T lnT

�2

�

Es Var Er + Er

�

�

1 + lnT

�2

�

−Er Var Er + Er

� (44-77)

In order to evaluate equation 44-77 it is necessary to assume a distribution for the variability of Es and Er , and in the earlier chapter the distribution used was the Normal distribution; here, therefore, we want to now evaluate this function for the case of a uniform distribution. We note here that much of the discussion in the earlier chapter concerning the evaluation of equation 44-77 applies now as well, so it behooves

282

Chemometrics in Spectroscopy Variances for uniformly distributed noise 2.0

Variance

1.5 1.0 0.5

9.9

9.3

8.8

8.3

7.7

7.2

6.6

6.1

5.5

5.0

4.4

3.9

3.3

2.8

2.2

1.7

1.1

0.6

0.0

0.0

Er

Figure 46-15 Values of the variance of Er /Er − Er ) and Es /Er − Er ) for various value of Er , with a uniform distribution of values for the errors.

the reader to review the procedures used there, and also in Chapter 45, immediately preceding this one (first published as [6]), since we will apply those procedures again, with the difference that we will use a uniform distribution for the variability of the noise terms. Figures 44-6 and 44-1 from our Chapter 44 (referenced as [5]) are unchanged, since they do not depend on the distribution of the errors. The figure corresponding to Figure 45-9 (which appeared in Chapter 45 [6]) that was calculated for Normally distributed noise is Figure 46-15, which presents the results of calculating the variance of the two terms of equation 44-77 for uniformly distributed noise instead. We note that while these terms follows the same trends as the Normally distributed errors, these errors do not become appreciable until Er has fallen below 0.6, which corresponds to the point where values occur close to or less than zero. For values of Er below 0.6 the values of both terms of equation 44-77 become very large and erratic. Following along the developments in Chapter 45, we find that the plot of A/A depends on T , but the variance terms that depend on Er as the parameter are essen tially independent of T . Therefore we expect that the plots of A/A as a function of T will result in a family of curves similar to what we found in Figure 45-11, but different in the values of A/A. However, Figure 45-11 shows only the net result of seeking the minimum of the function; it does not reveal the nature of the curves con tributing to the erratic behavior of the minimum. Therefore, we now present a set of the curves for which the minimum can be found, in Figure 46-16. We see in Figure 46-16A that while the behavior of the curve of A/A is systematic when Er is large enough for the variance to remain small, Figure 46-16B shows how the erratic behav ior of the two standard deviation terms in equation 44-77 result in a set of curves that form a family, but an erratic family rather than a well-ordered and well-behaved family. At this point we have completed our analysis of spectral noise for the case where the noise is constant (or at least independent of the signal level). Having completed this part of the analyses originally proposed in Chapter 40 (referenced as [1]) we will continue by doing a similar analysis for a complicated case.

Analysis of Noise: Part 7

283

(a) Er

Er

Er

Er

Er

Er

1.00000E – 01

2.00000E – 01

3.00000E – 01

4.00000E – 01

5.00000E – 01

6.00000E – 01

0.001

1.07737E + 08

2.68976E + 03

9.07148E + 02

4.86867E + 02

2.99824E + 02

1.96293E + 02

0.002

3.32775E + 07

8.30808E + 02

2.80198E + 02

1.50383E + 02

9.26091E + 01

6.06308E + 01

0.003 0.004

1.69267E + 07 1.05393E + 07

4.22594E + 02 2.63126E + 02

1.42524E + 02 8.87422E + 01

7.64928E + 01 4.76280E + 01

4.71060E + 01 2.93304E + 01

3.08401E + 01 1.92025E + 01

0.005

7.32527E + 06

1.82886E + 02

6.16802E + 01

3.31038E + 01

2.03861E + 01

1.33467E + 01

0.006 0.007

5.45604E + 06 4.26147E + 06

1.36219E + 02 1.06395E + 02

4.59413E + 01 3.58831E + 01

2.46567E + 01 1.92585E + 01

1.51842E + 01 1.18598E + 01

9.94105E + 00 7.76459E + 00

0.008

3.44565E + 06

8.60277E + 01

2.90140E + 01

1.55718E + 01

9.58951E + 00

6.27823E + 00

0.009

2.86035E + 06

7.14152E + 01

2.40858E + 01

1.29268E + 01

7.96068E + 00

5.21184E + 00

0.010

2.42412E + 06

6.05245E + 01

2.04128E + 01

1.09555E + 01

6.74670E + 00

4.41706E + 00

0.011

2.08898E + 06

5.21577E + 01

1.75910E + 01

9.44109E + 00

5.81408E + 00

3.80647E + 00

0.012

1.82508E + 06

4.55692E + 01

1.53690E + 01

8.24853E + 00

5.07967E + 00

3.32566E + 00

0.013

1.61296E + 06

4.02735E + 01

1.35830E + 01

7.28997E + 00

4.48937E + 00

2.93919E + 00

0.014

1.43948E + 06

3.59426E + 01

1.21224E + 01

6.50605E + 00

4.00662E + 00

2.62314E + 00

0.015

1.29549E + 06

3.23479E + 01

1.09100E + 01

5.85540E + 00

3.60593E + 00

2.36081E + 00

0.973

0.919

0.865

0.811

0.757

0.703

0.649

0.595

0.541

0.487

0.433

0.379

0.325

0.271

0.217

0.163

0.109

0.055

50000 45000 40000 35000 30000 25000 20000 15000 10000 5000 0 0.001

Δ(A)/A

(b)

T

Figure 46-16 The behavior of the family of curves of A/A. Figure 46-16a shows the systematic behavior obtained when Er is greater than 0.2 (in this case 02 < Er < 1). Figure 46-16b shows the erratic behavior obtained when Er is less than 0.2, in this case 006 < Er < 02.

REFERENCES 1. 2. 3. 4. 5. 6. 7.

Mark, H. and Workman, J., Spectroscopy 15(10), 24–25 (2000). Mark, H. and Workman, J., Spectroscopy 15(11), 20–23 (2000). Mark, H. and Workman, J., Spectroscopy 15(12), 14–17 (2000). Mark, H. and Workman, J., Spectroscopy 16(2), 44–52 (2001). Mark, H. and Workman, J., Spectroscopy 16(4), 34–37 (2001). Mark, H. and Workman, J., Spectroscopy 16(5), 20–24 (2001). Ingle, J. D. and Crouch, S. R., Spectrochemical Analysis (Prentice-Hall, Upper Saddle River, NJ, 1988).

This page intentionally left blank

47

Analysis of Noise: Part 8

This chapter further continues the set of chapters 40 through 46 first published as [1–7] dealing with the rigorous derivation of the expressions relating the effect of instrument (and other) noise to their effects to the spectra we observe. Our Chapter 40 was an overview; since then we have been analyzing the effect of noise on spectra by considering the case when the noise is constant detector noise, that is noise that is independent of the strength of the optical signal, which is the typical behavior of detectors for the IR and near-IR. As we do in each chapter in this section of the book we take this opportunity to note that we are dealing with a continuous set of chapters, and so we again continue our discussion by continuing our equation numbering, figure numbering, use of symbols and so on as though there was no break in the chapters. However, this chapter differs somewhat from the previous seven chapters in that, as we will see shortly, we will be performing parts of the same derivations all over again. Therefore, when we re-use previously derived equations, we will use the same equation numbers as we did for the original derivation. When we change course from the previous derivation, then we will number the equations starting with the next higher equation number from the last one we used (which we will note was equation 46-84 [7]). This procedure will also allow us to use some of our previous results to save time and space, allowing us to move along somewhat faster without sacrificing either rigor or detail. We left off in Chapter 46 by noting that we had just about exhausted the topic of the constant-noise (and by implication, a relatively “simple”) case (although not completely, in fact: there is still more to be said about the constant noise case, but that is for the future, right now it is time to move on), with the threat to begin discussion of a complicated case. Whether in fact it is more complicated than what we have been discussing remains to be seen; the question of whether something is “complicated” and “difficult” is partially subjective, since it depends on the perceptions of the person doing the evaluating. Something that is “difficult” for one may be “easy” for another because of a better background or more familiarity with the topic. Be that as it may, having decided to move on from the constant-detector-noise case, there remained the question of what to move on TO, that is which of the ten or so types of noise we originally brought up [1] should be tackled next. Tossing a mental coin, the decision was to analyze the case of noise proportional to the square root of the signal. This, as you will recall, is Poisson-distributed noise, characteristic of the noise encountered when the limiting noise source is the shot noise that occurs when individual photons are detected and represent the ultimate sensitivity of the measurement. This is a situation that is fairly commonly encountered, since it occurs, as mentioned previously, in UV-Vis instrumentation as well as in X-ray and gamma-ray measurements. This noise source may also enter into readings made in mass spectrometers, if the detection method includes counting individual ions. We have, in

286

Chemometrics in Spectroscopy

fact, discussed some general properties of this distribution quite a long time ago (see [8] or p. 175 in [9]). Now, we are not particular experts in X-ray and gamma-ray spectroscopy (nor mass spectroscopy, for that matter), but our understanding of those technologies is that they are used mainly in emission mode. Even when the exciting source is a continuum source, such as is found when an X-ray tube is used to produce the exciting X-rays for an X-ray Fluorescence (XRF) measurement, the measurement itself consists of counting the Xrays emitted from the sample after the sample absorbs an X-ray from the source. These measurements are themselves the equivalent of single-beam measurements and will thus also be Poisson-distributed in accordance with the basic physics of the phenomenon. The interesting parts occur when we calculate the transmittance (or reflectance) or absorbance of the sample under consideration, and therefore we must take a dual-beam measurement (or, at least the logically equivalent measurement of sample and reference readings) and compute the transmittance/reflectance or absorbance from those readings. Therefore, while the underlying physics results in the same form of noise characteristic in all those technologies, our results will be applicable mainly to UV-Vis measurements, where the quantity actually of interest is the amount of energy removed from the optical beam by absorption in the sample. Therefore, for the mathematical development we wish to pursue, we will again assume (as we did for the constant-noise case) that we are measuring transmittance through a clear (non-scattering) solution, and that Beer’s law applies. Examining Ingle and Crouch ([10], p. 152) we find the same situation as we found for constant detector noise: the computed noise of absorbance values does not take into account the effect of the noise of the reference reading. Hence, we can expect the results of our derivations to differ from the classic values for this situation as it did for the constant-detector noise case. We have recently found out and it is interesting to note, however, that in a much more obscure part of the book [10], in Table 6-2, there are expressions for absorbance noise that include terms for the noise of both sample and reference beam readings. The expressions given there are very complicated, since they include the combined effect of several different noise sources. However, since the main discussion in that book does not deal with the broader picture, the relegating of the full expression to such an obscure part of the book with no pointer to it in the text causing it to be missed, we are forced to treat Poisson noise as though it too, has not been derived for the full situation despite our finding it in that table. Indeed, the main discussion in Chapter 5 gives expressions, and results that, as we shall see, conform to the expressions obtained when the reference noise is neglected. Also, we just received a last-minute bulletin: one of the authors of [10] has kindly pointed out a typographical error in Table 6-2, so that we might put the matter right. The T within the parenthesis in the first expression for sT should be squared; this will correct an otherwise erroneous result that might be derived from that expression (J.D. Ingle, 2001, personal communication). With this correction, the expression in Table 6-2 results in exactly the same expression we obtained in our own derivation for the constant-noise case [2]. We begin, as we did before with the basic expression for the transmittance of a sample; since this is a repeat of previous equations we use the same numbers instead of starting with new numbering for the same equations: T=

Es − E0s Er − E0r

(47-1)

Analysis of Noise: Part 8

287

and, with the addition of noise affecting the computation of T : T + T =

Es + Es − E0s + E0s Er + Er − E0r + E0r

(47-2)

At this point we make a slight alteration to what we did previously. Strictly speaking we are being slightly premature here, but the gain in simplification of the equations more than compensates for the slight departure from complete rigor. Since the noise for the pure Poisson case is related to the signal, the noise at zero signal is zero; that is E 0s and E 0r are both zero. Therefore, for this case Es = E s and Er = E r . With this substitution, we can write equation 47-4 unchanged; however, we must keep in mind the difference in the meaning of these two terms (Es and Er ) compared to the meaning in the previous chapters. Hence, T + T =

Es − E0s + Es Er − E0r + Er

(47-4)

From this point, up to and including equation 47-17, the derivation is identical to what we did previously. To save time, space, forests and our readers’ patience we forbear to repeat all that here and refer the interested reader to Chapter 41 referenced as [2] for the details of those intermediate steps, here we present only equation 47-17, which serves as the starting point for the departure to work out the noise behavior for case of Poisson-distributed detector noise: � � � �2 1 −T 2 VarT = VarEs + VarEr (47-17) Er Er This is the point at which we must depart from the previous work. At this point in the previous (constant-noise) case we noted that SD(Es = SDEr and therefore we set both of those quantities equal to SD(E); We cannot make this equivalency in this case, since the noise values (or, at least, the expected noise values) will in general NOT be equal except when Es = Er , that is the transmittance (or reflectance) of the sample is unity. Poisson-distributed noise, however, has an interesting characteristic: for Poissondistributed noise, the expected standard deviation of the data is equal to the square root of the expected mean of the data ([11], p. 714), and therefore the variance of the data is equal (and note, that is equal, not merely proportional) to the mean of the data. Therefore we can replace Var(Es ) with Es in equation 47-17 and Var(Er ) with Er : � �2 � � 1 −T 2 VarT = Es + Er (47-85) Er Er The next transformation we are going to have to do in really tiny little baby steps, lest we be accused of doing something illegal to equation 47-85: VarT =

Es Er T 2 + Er 2 Er 2

(47-86)

T T2 + E r Er

(47-87)

VarT =

288

Chemometrics in Spectroscopy

And upon converting variance to standard deviation: � T +T2 SDT = Er

(47-88)

Compare equation 47-87, for Poisson noise with equation 47-18, or equation 47-88 with equation 47-19 as we derived for constant detector noise [2]. Equation 47-88 has also been previously derived by Voigtman, it turns out [12], in the course of his √ simulation studies. We note that now, instead √ of varying over a relative range of 1 to 2, the noise will vary over a range of zero to 2 as the sample transmittance varies from zero to unity. What is even more interesting is that nowhere in equation 47-88 is there a term representing the S/N (or N/S) ratio, as we found in equation 47-19. This is because the noise level of a detector with Poisson-distributed noise is predetermined by the signal level, and was implicitly introduced with which we substituted Es and Er for Var(Es ) and Var(Er ) in equation 47-85. Therefore the shape of the transmittance noise curve as a function of sample transmittance is constant (as it was for the case of constant noise). However, as equation 47-88 shows, the value of the noise is scaled by the reference signal, and varies inversely with the square root of the reference signal. We present the curve of SD(T ) as a function of T in Figure 47-17. From Figure 47-17 we note several ways in which the behavior of the transmittance noise for the Poisson-distributed detector noise case differs from the behavior of the constant-noise case. First we note as we did above that at T = 0 the noise is zero, rather than unity. This justifies our earlier replacement of E0 by E0 for both the sample and the reference readings. Second, we note that the curve is convex upward rather than concave upward. Third we note that for values of T greater than roughly 0.25, the curve appears almost linear, at least to the eye. This is a consequence of the fact that, at small values of T , the square of T inside the radical becomes negligible√compared to T , causing the overall value of the curve to be roughly proportional to T , while at large values of T , the Poisson-distributed transmittance noise 1.6 1.4

Relative noise

1.2 1 0.8 0.6 0.4 0.2

%T

Figure 47-17 Standard deviation of T as a function of T .

0.99

0.95

0.9

0.86

0.81

0.77

0.72

0.68

0.63

0.59

0.5

0.54

0.45

0.41

0.36

0.32

0.27

0.23

0.18

0.14

0.09

0.05

0

0

Analysis of Noise: Part 8

289

square √ term dominates, causing the overall value of the curve to be roughly proportional to T 2 , or, in other words, roughly proportional to T . Another issue to bring up is the question of units. In the case of constant noise, as expressed by equation 47-19, T was dimensionless, being a ratio of two numbers (Es and Er with the same units, whatever those units might be, and the other term in equation 47-19: SD(Er /Er is also a ratio of two numbers with the same units. In equation 47-88, on the other hand, T is still dimensionless, but Er is not dimen sionless; since it is a measurement, it must have units. The question of the units of Er bring us to an important caveat concerning the interpretation of equation 47-88 and Figure 47-17. First, to answer the question of units, we recall that the Poisson distribution applies to measurements for X-ray, UV, and visible detectors, and the reason that distribution applies is because it is the distribution describing the behavior of the number of discrete events occurring in a given time interval; the actual data, then, is the number of counts occurring during the measurement time. The unit of Er , then, is the absolute number of counts, and this brings us to our caveat. Equation 47-88 and Figure 47-17 are presented as describing a continuous series of values, and if Er is sufficiently large (large enough that a change of 1 count is small compared to the total number of counts), these equations and figures are a good approximation to a continuum. However, suppose Er is small. Let us pick a small number and see what happens: let us say Er is five. That means that the reference reading is five counts. Now it is immediately clear that we simply cannot have any value of T along the X-axis of Figure 47-17. Since Es can take only integer values (0, 1, 2, 3, ) T can take only discrete values of 0, 0.25, 0.5, 0.75, and unity, since you cannot have a fraction of a count as data. For those values of T , Figure 47-17 will provide an accurate measure of the expected value for SD(T ), but not necessarily the actual value you will measure in any particular measurement. This is a result of the randomness inherent in the measurement and the discreteness of the measurement of Es as well as Er . We discussed these issues a long time ago, when our series was still called “Statistics in Spectroscopy” rather than its current appellation of “Chemometrics in Spectroscopy”; we recommend our readers to go back and reread those columns, or the book that they were collected into [9], or any good book about elementary Statistics. Another consequence of the behavior of the Poisson distribution is that for small values of Er , the N/S ratio becomes large, to the point where values of T appreciably greater than unity may be measured. For example, if Er = 5 as we presented just above, the standard deviation of Er can be calculated as SD(Er = 223. Given a ±2 standard deviation range, we can expect (truncating to the nearest integer) that values of Es (when T = 1) as high as 5 + 2 × 223 = 5 + 4 = 9 counts will be observed, corresponding to a calculated value of T = 9/5 = 18 Furthermore, one of the steps taken during the omitted sequence between equation 47-4 and equation 47-17 was to neglect Er compared to Er . Clearly this step is also only valid for large values of Er , both for the case of constant detector noise and for the current case of Poisson-distributed detector noise. Therefore, from both of these considerations, it is clear that equation 47-88 and Figure 47-17 should be used only when Er is sufficiently large for the approximation to apply. Therefore our caveats. Equation 47-88 and Figure 47-17 are best reserved for cases of high signal, where the continuum approximation will be valid.

290

Chemometrics in Spectroscopy

Now that we have completed our expository interlude, we continue our derivation along the same lines we did previously. The next step, as it was for the constantnoise case, is to derive the absorbance noise for Poisson-distributed detector noise as we previously did for constant detector noise. As we did above in the derivation of transmittance noise, we start by repeating the definition and the previously derived expressions for absorbance [3]. A = − logT

(47-20a)

A = −04343 lnT

(47-20b)

We take the derivative dA = −04343

dT T

(47-21)

and substitute the expressions for T (47-6) and dT , replacing the differentials by finite differences: so that we can use the expression for T found previously (J.D. Ingle, 2001, personal communication): � −04343 A =

Es Er Er Es − Er Er + Er Er Er + Er Es Er

� (47-22)

Again in the interests of saving time, space, and so on, we skip over the repetition of the intermediate steps between equation 47-22 and equation 47-29: � VarA =

−04343 Es

�

�2 Var Es +

04343 Er

�2 Var Er

(47-29)

And again our departure from the derivation for the constant detector noise case is to note and use the fact that for Poisson-distributed noise, Var(Er = Er and Var(Es = Es : � VarA =

−04343 Es

�2

�

04343 Es + Er

�2 Er

(47-89)

And simplifying as we did above: VarA =

043432 043432 Es + Er 2 Es Er 2

(47-90)

043432 043432 + Es Er

(47-91)

VarA =

Analysis of Noise: Part 8

291

and since T = Es /Er , we solve for Es = TEr and substitute this into equation 47-91: VarA =

043432 043432 + TEt Er

VarA =

043432 Er

(47-92)

and factor out 0.43432 /Er : �

1 +1 T

� (47-93)

and upon taking square roots: 04343 SDA = √ Er

�

1 +1 T

(47-94 – for Poisson noise)

Again we can compare the expression in equation 47-94 with the equivalent expres sion for the constant detector noise case, which starts with equation 42-32, also equation 47-32 [3]. � SDA = 04343SDE

1 1 + 2 2 Er Es

(47-32 – for constant noise)

It is instructive to put equation 47-32 into similar form as equation 47-94 – for Poisson noise by replacing Es with TEr : � 1 1 + T 2 Er 2 E r 2 � SDE 1 SDA = 04343 +1 Er T2

SDA = 04343SDE

(47-95 – for constant noise)

(47-96 – for constant noise)

Thus, in the constant-noise case the absorbance noise is again proportional to the N/S ratio, although this is clearer now than it was in the earlier chapter; there, however, we were interested in making a different comparison. The comparison of interest here, of course, is the way the noise varies as T varies, which is immediately seen by comparing the expressions in the radicals in equations 47-94 – for Poisson noise and 47-96. Also, as equation 47-94 shows, the absorbance noise is again inversely proportional to the square root of the reference signal, as was the transmittance noise. And once again we remind our readers concerning the caveats under which equation 47-94 is valid. We present the variation of absorbance noise for the two cases (equations 47-94 – for Poisson noise and 47-96, corresponding to the Poisson noise and constant noise cases) in Figure 47-18. While both curves diverge to infinity as the transmittance → 0 (and the absorbance → ), the situation for constant detector noise clearly does so more rapidly, at all transmittance levels. Again, we continue our derivations in our next chapter.

292

Chemometrics in Spectroscopy Absorbance noise

Relative absorbance noise

12 10 8 6

Constant noise

4 2 Poisson noise 1

0.9

0.95

0.85

0.8

0.75

0.7

0.6

0.65

0.5

0.55

0.45

0.4

0.35

0.3

0.2

0.25

0.1

0.15

0

%T

Figure 47-18 Comparison between absorbance noise for the constant-detector noise case and the Poisson-distributed detector noise case. Note that we present the curves only down to T = 0.1, since they both asymptotically → as T → 0, as per equations 94 and 96.

REFERENCES 1. 2. 3. 4. 5. 6. 7. 8. 9.

Mark, H. and Workman, J., Spectroscopy 15(10), 24–25 (2000). Mark, H. and Workman, J., Spectroscopy 15(11), 20–23 (2000). Mark, H. and Workman, J., Spectroscopy 15(12), 14–17 (2000). Mark, H. and Workman, J., Spectroscopy 16(2), 44–52 (2001). Mark, H. and Workman, J., Spectroscopy 16(4), 34–37 (2001). Mark, H. and Workman, J., Spectroscopy 16(5), 20–24 (2001). Mark, H. and Workman, J., Spectroscopy 16(7), 36–40 (2001). Mark, H. and Workman, J., Spectroscopy 5(3), 55–56 (1990). Mark, H. and Workman, J., Statistics in Spectroscopy, 1st ed. (Academic Press, New York, 1991). 10. Ingle, J.D. and Crouch, S.R., Spectrochemical Analysis (Prentice-Hall, Upper Saddle River, NJ, 1988). 11. Hald, A., Statistical Theory with Engineering Applications (John Wiley & Sons, inc., New York, 1952). 12. Voigtman, E., Analytical Instrumentation 21(1&2), 43–62 (1993).

48

Analysis of Noise: Part 9

We keep learning more about the history of noise calculations. It seems that the topic of the noise of a spectrum in the constant-detector-noise case was addressed more than 50 years ago [1]. Not only that, but it was done while taking into account the noise of the reference readings. The calculation of the optimum absorbance value was performed using several different criteria for “optimum”. One of these criteria, which Cole called the Probable Error Method, gives the same results that we obtained for the optimum transmittance value of 32.99%T [2]. Cole’s approach, however, had several limitations. The main one, from our point of view, is the fact that he directed his equations to represent the absorbance noise as soon as possible in his derivation. Thus his derivation, as well as virtually all the ones since then, bypassed consideration of the behavior of noise of transmittance spectra. This, coupled with the fact that the only place we have found that presented an expression for transmittance noise had a typographical error as we reported in our previous column [3], means that as far as we know, the correct expression for the behavior of transmittance noise has still never been previously reported in the literature. On the other hand, we do have to draw back a bit and admit that the correct expression for the optimum transmittance has been reported. Not only that, but Cole points out and laments that, at that time, other scientists were already using the incorrect formulas for noise behavior. That means that the same situation that exists now, existed over 50 years ago, and in all the intervening time has not been corrected. This, perhaps, explains why the incorrect theory is still being used today. We can only hope that our efforts are more successful in persuading both the practitioners and teachers of spectroscopic theory to use the more exact formulations we have developed. Getting back to the current state of the columns, this column is one more in the set [2–9] dealing with the rigorous derivation of the expressions relating the effect of instrument (and other) noise to their effects to the spectra we observe. The impetus for this was the realization that the previously existing theory was deficient in that the derivations extant ignored the effect of noise in the reference reading, which turns out to have appreciable effects on the nature of the derived noise behavior. Our first chapter in this set [4] was an overview; the next six examined the effects of noise when the noise was due to constant detector noise, and the last one on the list is the first of the chapters dealing with the effects of noise when the noise is due to detectors, such as photomultipliers, that are shot-noise-limited, so that the detector noise is Poisson-distributed and therefore the standard deviation of the noise equals the square root of the signal level. We continue along this line in the same manner we did previously: by finding the proper expression to describe the relative error of the absorbance, which by virtue of Beer’s law also describes the relative error of the concentration as determined by the spectrometric readings, and from that determine the

294

Chemometrics in Spectroscopy

value of transmittance a sample should have in order to optimize the analysis, in the sense that the relative error of the concentration is minimized. As we do in each chapter in this section of the book we take this opportunity to note that we are dealing with a continuous set of chapters, and so we again continue our discussion by continuing our equation numbering, figure numbering, use of symbols, and so on as though there were no break, except that when we repeat an equation or series of equations that were derived and presented previously, we retain the original number(s) for those equation(s). So let us continue. We now wish to generate the expression for the relative error of the absorbance, A/A, which we again obtain by using the expression in equation 48-25 −04343 Er Es − Es Er (48-25) A = Es Er for A, and the expression in equation 42-20b: A = −04343 lnT , for A. This results in the same expression we obtained previously, which we present, as usual, without repeating all the intermediate steps: A 1 Es Er = − (48-36) Er A lnT Es We again go through the usual sequence of steps needed to pass to the statistical domain, which we do in detail here since, looking back we find that we had neglected to present them previously due to somewhat of a feeling of being rushed. First we take the variance of both sides of equation 48-36: A 1 Es Er Var = Var − (48-97) A lnT Es Er A 1 Es 1 Er Var = Var − (48-98) A lnT Es lnT Er Then we apply the theorem that Var(A + B) = Var(A) + Var(B): −1 Er A 1 Es Var = Var + Var A lnT Es lnT Er

(48-99)

And then we apply the theorem that, if a is a constant, then VaraX = a2 VarX: A 1 1 Var = Var Es + Var Er (48-100) 2 A E r lnT 2 E s lnT Again we use the property of the Poisson distribution that the variance of a value is equal to the value, so that Var(Es = Es and Var(Er = Er : Er A Es Var = + (48-101) A E s lnT 2 E r lnT 2 A 1 1 1 Var = + (48-102) A lnT 2 Es Er

Analysis of Noise: Part 9

295

and finally:

A SD A

1 = lnT

1 1 + E s Er

(48-103)

Interestingly, in Voigtman’s development of these equations, his expression correspond ing to equation 48-103 is missing the 1/Er term inside the radical, even though he arrived at the correct equation corresponding to equation 47-88, as we noted in Chapter 47 referenced as the paper [3]. There are now two ways to proceed with equation 48-103. One way is to replace T in the denominator with Es /Er , which makes it easier to compare with equation 42-37, which is the corresponding equation describing the constant-noise case. Alternatively, we can replace Es in the denominator of equation 48-103 with TEr , which is more convenient for plotting the expression. Since we wish to explore both phenomena, we will do both transformations of equation 48-103. First we will replace T in the denominator with Es /Er , which makes it easier to compare with equation 42-37:

A SD A

A SD A

1 = lnEs /Er 1 = lnEs /Er

E Er + s E s Er E s E r

(48-104)

Es + Er Es Er

(48-105)

Equation 48-105 is the closest we can come to the form of equation 42-37, so compare the functions describing the relative precision for the constant-noise case to that of the Poisson-noise case. To put equation 48-103 into a form easier to plot, we now replace Es in the denominator of equation 48-103 with TEr

A SD A

A SD A

A SD A

1 = lnT

1 1 + TEr Er

1 = lnT

1 Er

1 =√ Er lnT

1 +1 T 1 +1 T

(48-106) (48-107)

(48-108)

Qualitatively we can note that equation 48-108 also passes through a minimum, since it will diverge as T → 0 (in the denominator of the radical) and also as T → 1, which causes lnT → 0. Again, we see that the actual value of the relative error is scaled inversely with the square root of the reference reading, as it did for both transmittance 1 1 and absorbance noise. We verify the behavior of equation 48-108 by plotting lnT +1 T 1 versus T in Figure 48-19 (actually, we plot lnT T1 + 1 , for reasons that will be

296

Chemometrics in Spectroscopy 3 2.5

SD(Δ(A))/A

2 1.5 1 0.5

0.53

0.48

0.505

0.43

0.455

0.405

0.38

0.33

0.355

0.28

0.305

0.23

0.255

0.18

0.205

0.155

0.13

0.08

0.105

0.055

0.03

0.005

0

%T

Figure 48-19 Relative absorbance precision for Poisson-distributed detector noise.

discussed below). Unsurprisingly, the optimum transmittance (roughly T = 011 from the data table used to plot Figure 48-19 ) differs appreciably from what was found for the corresponding situation when the detector noise was constant. The more interesting and important question, however, is how the value we arrived at compares with the “optimum” obtained from the previously derived expression, that neglected the effect of the noise in the reference reading. To continue, therefore, we proceed in the usual manner for finding a minimum: we take the derivative of equation 48-108 and then set the derivative equal to zero. Since equation 48-108 is complicated, and the derivative more so, we will generate the derivative in several steps:

d A 1 d 1 1 d 1 SD =√ +1 + +1 (48-109) √ dT A T T dT Er lnT dT Er lnT d A d 1 1 1 1 1 1 1 d SD +1 + +1∗ √ =√ dT A T Er lnT 2 1 + 1 dT T Er dT lnT T

(48-110) d d 1 A 1 +1 SD = √ dT A 2 Er lnT T1 + 1 dT T +

1 1 −1 d +1∗ √ lnT 2 T Er lnT dT

(48-111)

Analysis of Noise: Part 9

297

− T1 + 1 −1 d A 1 1 SD = √ + √ ∗ 2 2 dT A lnT Er T 2 Er lnT T1 + 1 T

d A SD dT A

1 − +1 −1 T + = √ 2 √ 2T 2 Er lnT T1 + 1 T Er lnT

It will help our cause to factor out from equation 48-113 what we can ⎤ ⎡ 1 − + 1 d −1 A 1 T ⎥ ⎢ + SD = √ ⎦ ⎣ lnT dT A T Er lnT 2T 1 + 1 T and then combine the terms:

d A SD dT A

⎡

(48-112)

(48-113)

(48-114)

⎤ 1 + 1 + 1 1 − lnT T ⎢ ⎥ = √ + ⎣ ⎦ T Er lnT 2T lnT 1 + 1 1 2T lnT T + 1 T −2T

1 T

(48-115)

d A SD dT A

⎡

=

1

⎤

1 ⎢ − lnT − 2T T + 1 ⎥ ⎣ ⎦ T Er lnT 2T lnT 1 + 1 √

(48-116)

T

Now we can set the derivative equal to zero: ⎡ 0=

1

⎤

⎢ − lnT − 2T T + 1 ⎥ ⎣ ⎦ T Er lnT 2T lnT T1 + 1 √

1

(48-117)

and simplify the expression: 0 = − lnT − 2T 0 = lnT + 2T + 2

1 +1 T

(48-118) (48-119)

Equation 48-119 is a much simpler equation than most of the ones we have had to deal with before, including equation 42-50 (which is the corresponding equation for the constant-detector-noise case [2]); nevertheless, it is still a transcendental equation and is best solved by successive approximations. The solution to 5 decimal places is 0.10886 , or 10.886 %T . The solution given by Ingle and Crouch for this case, which again, does not take into account the variation of the reference channel is 13.5%T ([10], p. 153).

298

Chemometrics in Spectroscopy

We therefore see that in this case also, neglecting the reference channel error also causes a noticeable change in the answer from the correct one. 1 To finish up this chapter, we discuss the use of lnT T1 + 1 as the expression we plotted in Figure 48-19. In passing from equation 48-102 to 48-103, we did the usual and intuitive step of using the positive square root of the expression in equation 48-102, which seems reasonable, since we are working with variances, which must always be positive, and standard deviations, which we also want to have positive values. However, when we come to plot the expression in equation 48-108, we find that since T is always less than unity, lnT is negative, and therefore the entire expression is negative. Thus, plotting this expression directly results in the curve having a maximum rather than a minimum at the point where the derivative is zero. Since this does not conform to reality, where we obtain the best precision rather than the worst, it is clear that this is an artifact of our choice of sign for the square root; the way we obtain a unique answer, and one that is in conformance with the real world, is to use the absolute value of the expression. Again, we continue our derivations in our next chapter.

REFERENCES 1. 2. 3. 4. 5. 6. 7. 8. 9. 10.

Cole, R., Journal of the Optical Society of America 41, 38–40 (1951). Mark, H. and Workman, J., Spectroscopy 15(12), 14–17 (2000). Mark, H. and Workman, J., Spectroscopy 16(11), 36–40 (2001). Mark, H. and Workman, J., Spectroscopy 15(10), 24–25 (2000). Mark, H. and Workman, J., Spectroscopy 15(11), 20–23 (2000). Mark, H. and Workman, J., Spectroscopy 16(2), 44–52 (2001). Mark, H. and Workman, J., Spectroscopy 16(4), 34–37 (2001). Mark, H. and Workman, J., Spectroscopy 16(5), 20–24 (2001). Mark, H. and Workman, J., Spectroscopy 16(7), 36–40 (2001). Ingle, J.D. and Crouch, S.R., Spectrochemical Analysis (Prentice-Hall, Upper Saddle River, NJ, 1988).

49 Analysis of Noise: Part 10

This chapter is one more in the set of chapters starting at Chapter 40 and first published as [1–9], dealing with the rigorous derivation of the expressions relating the effect of instrument (and other) noise to their effects to the spectra we observe. The impetus for this was the realization that the previously existing theory was deficient in that the derivations extant ignored the effect of noise in the reference reading, which turns out to have appreciable effects on the nature of the derived noise behavior. Chapter 40 in this set referenced as [1] was an overview; Chapters 41–46 examined the effects of noise when the noise was due to constant detector noise (e.g., IR/NIR spectroscopy), and the last two chapters (47 and 48) began by considering the effects of noise when the noise is due to detectors, such as photomultipliers, that are shotnoise-limited, so that the detector noise is Poisson-distributed and therefore the standard deviation of the noise equals the square root of the signal level. The path we are taking pretty well follows the one we used for the constant-detector-noise case, and those two chapters derived the effects when the noise is small compared to the measured signal. Since we wish to continue following that same path, we now need to consider what happens when the optical signal falls to the point where the noise becomes an appreciable fraction of the measured signal, and the effects of the noise, such as induced nonlinearities, can no longer be neglected. And as we do in each chapter in this section of the book we once more take this opportunity to remind our readers that we are dealing with a continuous series of chapters, and so we again continue our discussion by continuing our equation numbering, figure numbering, use of symbols, and so on as though there were no break, except that when we reuse an equation or series of equations that were derived and presented previously, we retain the original number(s) for those equation(s). So let us continue. In Chapter 43 [4], which the interested reader may wish to go back and refresh themselves about, we discussed the general descriptions of how and why the equations came about, we noted that the point of departure for investigating what happens when the noise level becomes large enough that it can no longer be ignored was equation 49-5:

T + T =

Es + Es Er + Er

(49-5)

and we noted that in that case, that of Normally distributed noise, the expected computed value of T was

T=

Es Er + Er

(49-52a)

300

Chemometrics in Spectroscopy

the reason being, as we pointed out, the other term that arose, Es /Er + Er , would vanish from the expression for the expected value of T because of symmetry. In the current case, however, we cannot rely on that argument. The Poisson distribution is not symmetric around any particular value, as we will observe shortly when we present a graph of the members of the family of Poisson distributions, despite the fact that this distribution approaches the Normal distribution in the limit as the parameter → . However, in addition to the fact that the distribution never becomes exactly Normal, our interest in this chapter is specifically to examine the effects occurring at small values of . Hence, in this case we must work with equation 49-5, rather than the simpler equation 49-52a: T + T =

Es + Es Er + Er

(49-5)

We next noted that the expected value of T is computed from the general equation for an expected value: � i

TW =

Wi FXi � Wi

(49-59)

i

Fx, here, is Es +Es /Er +Er , as we just noted. In the previous case, the weighting function was the Normal distribution. Our current interest is the Poisson distribution, and this is the distribution we need to use for the weighting factor. The interest in our current development is to find out what happens when the noise is Poisson-distributed, rather than Normally distributed, since that is the distribution that applies to data whose noise is shot-noise-limited. Using P to represent the Poisson distribution, equation 49-59 now becomes � X WP =

i

PXi FXi � Pi

(49-120)

i

and since probability distributions have integrals that always equal unity (reflecting the reality that the argument must have SOME value every time it is evaluated, so that it is certain that some value will be obtained over the entire range of summation; certainty of obtaining the value of a means that Pa = 1). The denominator of equation 49-120 vanishes, therefore, and equation 49-120 reduces to � X WP = PXi FXi (49-121) i

The Poisson distribution is actually a special case of the binomial distribution, a fact that is only of mild peripheral interest here, as we will not be using that fact. The formula for the Poisson distribution is PX =

e− X X!

(49-122)

Analysis of Noise: Part 10

301

In our terminology, the parameter corresponds to Er or Es , the (fixed) value of the energy to be measured, and X corresponds to Er or Es , as appropriate. Therefore equation 49-122 becomes PX =

e−Er Er Er Er !

(49-123)

Figure 49-20 presents the Poisson distribution; Figure 49-20a shows the distribution for integer values of up to = 11, and Figure 49-20b shows this distribution for 1 ≤ λ ≤ 11

(a)

Poisson distribution

0.4 0.35

λ=1

0.3

P(X)

0.25 0.2 0.15

λ = 11

0.1 0.05

19

18

17

16

15

14

13

12

11

10

9

8

7

6

5

4

3

2

1

0

0

X 0<λ≤2

(b)

Poisson distribution

0.9 0.8

λ = 0.2

0.7

P(X)

0.6 0.5 0.4

λ=2

0.3 0.2 0.1 0 0

1

2

3

4

5

6

7

8

X

Figure 49-20 Poisson distribution for several values of . Figure 49-20b is an expansion of Figure 49-20a, for values of between 0.2 and 2. (see Color Plate 16)

302

Chemometrics in Spectroscopy

fractional values of up to = 2. Now, one point in which the Poisson distribution differs from the Normal distribution is the presence of the parameter . As we show in Figure 49-20, different values of lambda give rise to different curves. While they all share the property that their integral is unity, they also differ in several respects. The characteristic that we draw attention to first at this point is that the curves have different shapes. This is a key difference from the Normal distribution; as we will recall, when we integrated equation 49-58 where the weighting factor was the Normal distribution, the resulting family of curves had similar shapes, and differed only in their expansion along the abscissa, which then allowed describing their behavior as the same basic curve, but scaled by the standard deviation of the underlying distribution. In the case of the Poisson distribution, we would not expect that to happen, since the different curves in the family of the Poisson distribution have different distributions to start with. This might be expected to give rise to a double family of curves, corresponding to different values of standard deviation and different values of . However, this is obviated by the fact that for the Poisson distribution, the standard deviation is “locked” to the underlying value of and cannot vary independently. We also note, that as we see in Figure 49-20, to the eye the Poisson distribution resembles the Normal distribution very closely at large values of lambda, so the differences in the integral may not be easily seen by the eye, either, except at the very lowest values of lambda. There are some other characteristics of the Poisson distribution that differ from the Normal distribution in ways that are of importance to us here. The chief one is that the Poisson distribution does not admit of negative values. This makes intuitive sense; since the Poisson distribution is a distribution that results from a counting operation, the smallest number that you can achieve when counting objects is zero. We will be using this fact during the course of our derivations. Another point to be made is that, in fact, a value of zero is indeed a legitimate value for X. This comes from the generation of the distribution as the result of a counting operation: when counting photons in X-ray analysis for example, if the average count in any given time interval is a small number, less than five, say, then it can happen and there is a reasonable probability for it to happen that in some of those time intervals there will in fact be no counts occurring in a given time interval. Lambda (), however, is not restricted to integer values. Since represents the mean value of the data, and in fact is equal to both the mean and the variance of the distribution, there is no reason this mean value has to be restricted to integer values, even though the data itself is. We have already used this property of the Poisson distribution in plotting the curves in Figure 49-20b. To start our current derivation, we substitute the appropriate expressions for PX and FX into equation 49-121, and letting Es = TEr we obtain the following: X WP =

� e−Er Er Er X

Er !

�

TEr + Es Er + Er

� (49-124)

However, equation 49-124 is incomplete, the cause of the incompleteness being the presence of Es in the formula. As mentioned above, Es is also a random variable, is independent of Er , and we do not expect its effect to cancel as it did with the Normal distribution. Therefore we must also compute the weighted sum over the (also

Analysis of Noise: Part 10

303

Poisson-distributed) values of Es , which, corresponding to the expression for the first term of equation 49-124, is � � � � �� � e−Er Er Er e−Er Er Er � e−TEr TEr Es Es TEr X WP = + Er ! Er + Er Er ! Es Es ! Er + Er Er (49-125) To investigate the behavior of equation 49-125, we start by investigating the properties and behavior of the inner summation alone. Therefore let us break out that part of the equation and see what we have. � � � e−TEr TEr Es 1 (49-126) S WP = Es Er + Er Es Es ! where we have taken 1/Er + Er outside the summation (since it is not included in that summation and is therefore a constant for the summation), and we are now using the symbol S WP to indicate the weighted averaging to be done over the sample noise term alone. As we see, this is in itself the expected value of Es , which is thus the product of the Poisson distribution of the sample readings multiplied by the values of the readings. The values of Er and Er are constant for the summation over Es and therefore mainly act as a scaling factor; however, they do also affect the values and distribution achievable by the expected value since the value of Es is limited to be no larger than Er , or equivalently, 0 <= T <= 1. The summation over Es , therefore, is still subject to the values of two parameters, Er and T. Let us take a look at the behavior of this system. Figure 49-21 shows, corresponding to Figure 43-5 [4], the Poisson distribution (for two values of , in Figures 49-21a and 49-21b respectively) overlaid with PTEs ∗ FTEs , and also, as we showed in Figure 43-5, the cross-product of the two functions. The factor 1/Er + Er was set at unity. Since Es appears in the numerator, FTEs is linear with T . These figures show how each curve increases in magnitude and is shifted toward larger values of X. Figure 49-22 shows more members of the family of curves described by this function. It may be compared to Figure 49-20 to see how they relate to the original Poisson distri butions. Integrating the curves in Figure 49-22 (by performing the indicated summation) reveals that those integrals equal . In the limit of large values of this behavior is obvious, for the following reason: since the standard deviation of equals the square root of itself, at large enough values the distribution becomes essentially a “spike” of unit integral at , and effectively zero elsewhere; when this is multiplied by the function F = , then the unit value is multiplied by , giving as the value of the integral. From Figures 49-20 and 49-21 it is not at all obvious that this same result is obtained at small values of . However, neither is it very surprising that the expected value of Es = Es and it is gratifying to find that it is so. Given this result, we may now replace the entire inner summation in equation 49-126 by , which as we have seen is Es , and in equation 49-125 we therefore set it equal to TEr . Therefore, S WP =

TEr Er + Er

(49-127)

304

Chemometrics in Spectroscopy

λ=1

(a)

Weighted Poisson distribution

0.4 0.35

P(S)

0.3

P(S) × ΔE s

P(S) × E s 0.25 0.2 0.15

E s (scaled)

0.1 0.05 0 0

1

2

3

λ=2

(b)

4

5

6

ΔE s

7

8

9

10

Weighted Poisson distribution

0.6 0.5 P(S) × E s

P(S) × ΔE s

0.4 P(S) 0.3 0.2

E s (scaled)

0.1 0 0

1

2

3

4

5

6

7

8

9

10

ΔE s

Figure 49-21 Poisson distribution multiplied by Es . P(S) × Es 1.4

λ = 11

1.2

λ=1

P(S) × Es

1 0.8 0.6 0.4 0.2

ΔEs

Figure 49-22 Family of functions of PS × Es at various values of .

21

20

19

18

17

16

15

14

13

12

11

9

10

8

7

6

5

4

3

2

1

0

0

Analysis of Noise: Part 10

305

and we may now also substitute this result in equation 49-125 � � � � �� � e−Er Er Er TEr e−Er Er Er TEr + X WP = Er ! Er + Er Er ! Er + Er Er

(49-128)

which then simplifies to � � � e−Er Er Er TEr X WP = 2 Er ! Er + Er Er

(49-129)

This is a result we could have obtained directly (and much more simply) simply by setting Es = TEr in equation 49-124, but at that point we had justification to do so. We are now interested in integrating equation 49-126; in this equation Er corresponds to and Er corresponds to X, the variable of integration (or summation, actually). Thus the equation has two parameters that can affect the result: Er and T . Our interest here is in the effect of Er on the nature of the computed transmittance at small values of Er , therefore we consider T to be a constant as we integrate (sum) over values of Er and therefore for the integration we take T outside the summation: � � � e−Er Er Er Er X WP = 2T Er ! Er + Er Er

(49-130)

� Equation 49-130 is now exactly in the form of X WP = PX∗ FX (times a scaling factor) as we started with in equation 49-121, and is now in a form that can be more easily worked with. More importantly, it is also in a form that is useful and convenient: it is in the form of T times a multiplying factor. It now remains to find out the nature and behavior of the multiplying factor. We will therefore now investigate the behavior of equation 49-130, similarly to the way we investigated equation 49-126, and for that matter, the corresponding equation 43-62 for the case of Normally distributed noise [4]. Therefore we start by plotting the term Er /Er + Er (which we call FEr ) against Er in Figure 49-23, with Er as the parameter distinguishing the curves. While Er can in fact take non-integer values as described above, for our current discussion we will consider it having integer values for the sake of convenience, although toward the end we will plot it using non-integer values when this serves our purpose. Therefore in Figure 49-23 we plot the values of FEr corresponding to integer values of the parameter Er . One point we note is what we might expect from the nature of the term for FEr in equation 49-130: as Er assumes larger values, the term Er /Er + Er becomes less sensitive to the effect of Er , becoming flatter and flatter as Er increases. This behavior is expected since, if we consider the behavior of FEr as Er becomes indefinitely large, Er will become negligible compared to Er , thus giving the results for the large-signal situation that we obtained in the previous two chapters. At that point, with Er negligible compared to Er , the expression reduces to Er /Er which, of course, is unity. In Figure 49-24 we present the plots of PEr , FEr , and their cross-product (as we previously did for PEs , FEs , and their cross-product), as functions of Er .

306

Chemometrics in Spectroscopy F(ΔEr)

1 0.9 0.8 Er = 1

F(ΔEr)

0.7 0.6 0.5

Er = 11

0.4 0.3 0.2 0.1 0 1

6

11 16 21 26 31 36 41 46 51 56 61 66 71 76 81 86 91

ΔEr

Figure 49-23 FEr at various values of Er . Er = 1

(a)

P(ΔE r) × F(ΔE r), λ = 1

1 0.9

P(ΔEr) × F(ΔEr)

0.8

F(X)

0.7 P(X)

0.6 0.5 0.4

Product

0.3 0.2 0.1 0 1

2

3

4

5

6

7

8

9

10

ΔE r

Er = 2

(b)

P(ΔE r) × F(ΔE r), λ = 2

1 0.9 0.8

Function values

0.7

F(E r )

0.6 0.5 0.4

P(E r )

0.3 Product

0.2 0.1 0 1

2

3

4

5

6 ΔE r

Figure 49-24 Terms for PEr , FEr , and their product.

7

8

9

10

Analysis of Noise: Part 10

307

1 ≤ λ ≤ 11

(a)

Family of terms of P(X ) × F(X )

0.4 0.35 0.3

λ=1

Value

0.25 0.2 0.15

λ = 11

0.1 0.05 0 1

2

3

4

5

6

7

8

9

10 11 12 13 14 15 16 17 18 19 20 21

Er

0.2 ≤ λ ≤ 2

(b)

P(S) × Es

0.9 0.8

P(S) × F(ΔEs)

0.7

λ = 0.2

0.6 0.5 0.4 0.3

λ=2

0.2 0.1 0 0

1

2

3

4

5

6

7

ΔEs

Figure 49-25 Family of terms for PEr × FEr . Figure 49-25b is an expansion of Figure 49-25a, for small values of . (see Color Plate 17)

In Figure 49-25 we present the family of cross-products, for various values of the parameter Er , again corresponding to our treatment of Es . Figure 49-25 presents this family in two parts: Figure 49-25a presents the family for integer values of Er up to 11, while Figure 49-25b concentrates on the family members corresponding to values of Er less than 2.0. It becomes clear that when Er becomes small enough, the inflation of the value of the function at small values of Er can become indefinitely large.

308

Chemometrics in Spectroscopy Multiplier factor for T 1.8 1.6 1.4

Multiplier

1.2 1 0.8 0.6 0.4 0.2 9.8

9

9.4

8.6

8.2

7.8

7

7.4

6.6

6.2

5.8

5

5.4

4.6

4.2

3.8

3

3.4

2.6

2.2

1.8

1

1.4

0.6

0.2

0

Er

Figure 49-26 Multiplying factor for T from equation 49-130.

Finally, Figure 49-26 presents the multiplying factor of T from integrating the terms of equation 49-130: Multiplying factor = 2

� � � e−Er Er Er Er Er ! Er + Er Er

as a function of Er . As we might have expected at this point, the multiplying factor takes values above unity as Er → 0, and approaches unity as Er grows large. The behavior of noise data following the Poisson distribution differs from the behavior of that following the Normal distribution that we observed previously, in that the multiplying factor obtained from the Poisson distribution does not go through a maximum and then approach zero as Er → 0, which was the behavior we observed for the Normal distribution. The reason for this difference is clear, and is due to one of the characteristics of the Poisson distribution we noted above: the Poisson distribution does not admit of negative values, while the Normal distribution does. Thus, when data following the Normal distribution is averaged, including these negative values in the averaging process reduces the average that is computed, and the computed mean therefore approaches zero as Er → 0 at which point the data contains as many negative values as positive values. Since data following the Poisson distribution has no negative values, this effect cannot occur, and therefore in this case the multiplying factor → as Er → 0. As we noted, as Er grows large, the multiplying factor approaches unity, as it must in order that T approach its defined value of Es /Er for the large-signal situation.

DISCUSSION Equation 49-130, and the plot of the multiplying factor presented in Figure 49-26, seem pretty straightforward, but in fact there is a significant problem attendant on its application to the real world, that is to actual measurements. We were able to generate

Analysis of Noise: Part 10

309

that equation and figure based on the fact that Er can, in fact, take non-integer values. Since Er can be zero, equation 49-130 is prevented from diverging only by the fact that Er is non-zero, even if non-integer. In a sense, however, that is only a mathematical fiction, since in a real-world measurement we do not know the value of Er . If we did, we would not need to make the measurement. In the case of a real-world measurement, however, we do not know Er , as we said. The only quantity we know are the values of Er , that is the individual readings, for which Poisson distribution effectively provides us with estimates of the probability of obtaining various values of Er : 0, 1, 2, 3, from given values of Er (i.e., ). This represents a key difference between the Poisson and the Normal distributions. As we discussed at the appropriate point in our derivations dealing with the Normal distribution, a value of exactly zero is never obtained in that case [4]. When we make an actual, real-world measurement from data following the Poisson distribution, the actual reading we obtain will be one of those values of Er from the list 0, 1, 2, 3 each time we make the measurement. Some of the time, the probability of which will depend on the value of Er , the reading will be exactly zero, a situation which would not actually occur when the Normal distribution was the operative distribution. For example, if = 0 5, we will never ever obtain an actual reading of 0.5 counts; what will actually happen is that half the measurements (or roughly 6/10 of the measurements, actually) will contain zero counts, and (slightly less than) half the measurements will contain one count, and a few stragglers will contain more than one count: only the average number of counts from many measurements will be 0.5. In this case, putting even a single value of zero for Er into equation 41-6 [2], unless the measurement of the corresponding Es is also zero (which will give a computed value for T that is undefined, in both the mathematical and the real-world senses), the computed value of T for that reading will be infinite. Clearly, averaging together an infinite value with any number of finite values will still result in a computed average whose value is also infinite. What can we make of this situation? If we knew Er we could deal with the real-world case. In principle we can find out Er by measuring sufficiently many times and averaging together many readings (some of which may still be zero, but that’s OK in this case). To make those measurements, however, will take a longer time and if we are willing to spend the time to make the measurements, we can simply do that at the start, and let the counts accumulate so that we can work in a regime farther removed from the Er → 0 situation that is causing all this trouble in the first place. That is certainly one solution. To measure for many short time intervals and average together the readings certainly is, in principle, another solution, but one that we cannot find a justification for. Perhaps some of our readers knows of, or can do a thought experiment to come up with a scenario that would require many separate short data collection sessions that would provide data that could be averaged as we describe, but does not allow for a single protracted measurement. The bottom line is that the underlying reason for the problem we ran into is the fundamental difference between a continuous (the Normal case) and a discrete (Poisson) distribution. In the first case, values of exactly zero will never be obtained, although a value may come arbitrarily close to zero and the difference from zero may be unmea surable by a particular instrument, although we can argue that even in this case the measurement of an exact zero value is an artifact of the discrete measurement levels

310

Chemometrics in Spectroscopy

inherent in the use of A/D converters. As we will see, however, the solution to this dif ficulty is the same as the solution to the creation of distortions we found when operating at low signal-to-noise levels when the Normal distribution is the operative one. In any case, using single readings of Er when the individual values equal zero is not an option, due to the generation of the infinity. However, as long as no single reading comes up with a value of zero, then there is nothing wrong with making the short-time measurements and averaging together the computed values of transmittance. We simply need to make sure that in any given series of readings the probability of obtaining a value of zero for any reading is small enough that it does not actually occur during our series of readings. Toward this end we present these probabilities in Table 49-3, which were simply computed directly from the formula for the Poisson Distribution, for several values of , for X = 0. From this table, and some elementary probability theory which can be found in virtually any book on elementary Statistics, or in our early chapters (collected in [10]) the interested reader can pick a value for which will virtually always give high enough counts that no reading will never be zero. In the practical matter of performing the summations indicated for the various formulas that must be evaluated, the question arises as to how many terms need to be included; this question is analogous to the need to decide the limits of integration that was implicit in evaluating the analogous expressions for the Normal Distribution. In the case of the Poisson distribution this is one decision that is actually easier to make. The reason is

Table 49-3 Probability of obtaining a reading of zero, for various values of the parameter Lambda 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Poisson probability at X = 0 0 367879441 0 135335283 0 049787068 0 018315639 0 006737947 0 002478752 0 000911882 0 000335463 0 000123410 0 000045400 0 000016702 0 000006144 0 000002260 0 000000832 0 000000306 0 000000113 0 000000041 0 000000015 0 000000006 0 000000002

Analysis of Noise: Part 10

311

twofold. The first reason is that we are in fact doing a summation. In the case of the Normal distribution, the summation that was done was an approximation to an integral, and therefore engendered questions as to how closely the summation we performed approximated that integral, a question that was affected by the size of the interval used for the summations. The Poisson distribution, as equation 49-122 shows, is defined directly as a summation, and the question of approximating an integral does not arise. The second reason is that the expression in the denominator of equation 49-122 contains the factorial of the term number. This factorial increases much faster than any of the expressions in the numerator, and therefore successive terms fairly quickly become very small, once the term number exceeds the value of . Hence it only requires a relatively small number of terms in order for the summation to converge to a point such that the sum of the remaining terms is less than the precision of the computer; with standard double-precision number representation, this is approximately 10−16 . Inspection of the values of individual terms reveals that up to = 10; this point is reached at the 46th term in the worst case. To gain some margin, however, all computations were done using 100 terms of the summation expressed in equation 49-122. Again in the worst case (i.e., = 10), the value of the 100th term is 4 86 × 10−63 . From Table 49-3, we see that for values of greater than about 5, the probability of obtaining a value of zero becomes very small. From Figure 49-26 (or, strictly speaking, from the table of values from which Figure 49-26 was plotted) the value of FEr PEr is 1.052283, so that gives us an upper limit of approximately 5% as the amount of distortion that we can expect to be realized in an actual measurement situation. Again, we continue our derivations in our next chapter.

REFERENCES 1. 2. 3. 4. 5. 6. 7. 8. 9. 10.

Mark, H. and Workman, J., Spectroscopy 15(10), 24–25 (2000). Mark, H. and Workman, J., Spectroscopy 15(11), 20–23 (2000). Mark, H. and Workman, J., Spectroscopy 15(12), 14–17 (2000). Mark, H. and Workman, J., Spectroscopy 16(2), 44–52 (2001). Mark, H. and Workman, J., Spectroscopy 16(4), 34–37 (2001). Mark, H. and Workman, J., Spectroscopy 16(5), 20–24 (2001). Mark, H. and Workman, J., Spectroscopy 16(7), 36–40 (2001). Mark, H. and Workman, J., Spectroscopy 16(11), 36–40 (2001). Mark, H. and Workman, J., Spectroscopy 16(12), 23–26 (2001). Mark, H., Workman, J., Statistics in Spectroscopy, 1st ed. (Academic Press, New York, 1991).

This page intentionally left blank

50

Analysis of Noise: Part 11

This chapter is a continuation of a series of chapters starting with Chapter 40 up to 49 [1–10] dealing with the rigorous derivation of the expressions relating the effect of instrument (and other) noise to their effects to the spectra we observe. As we do in each chapter in this section of the book, we again take this opportunity to remind our readers that we are dealing with a continuous series of chapters, and so we again continue our discussion by continuing our equation numbering, figure numbering, use of symbols, and so on as though there were no break, except that when we repeat an equation or series of equations that were derived and presented previously, we retain the original number(s) for those equation(s). We forego summarizing all the previous work, except to note that so far we have treated similarly the cases of detector noise following both the Normal and the Poisson distributions, finding expressions for the noise of transmittance readings, then the noise of absorbance readings, followed by finding the transmittance value at which the optimum analytical accuracy can be obtained (defined as the transmittance corresponding to the minimum relative absorbance S/N), that was followed by the derivation of the expected value of transmittance for the case where the signal falls so low as to be comparable to the detector noise level. We are currently at the point in the treatment of Poisson-distributed noise where, to continue following the procedure set up for the case of Normally distributed noise, we wish to derive the value of the expected noise of transmittance readings when the signal falls so low that the optical signal level is comparable to the detector noise. So we are ready to continue. Before doing so, however, let us remind ourselves of one of the key points we learned during our examination of the properties of the expression for the transmittance of samples when the reference energy is low: since the Poisson distribution is a discrete distribution, when the reference energy is low there is a reasonably high probability that a reading containing zero counts will be obtained. To obtain reading of exactly zero will effectively never occur when a continuous distribution is the governing distribution, so we have a situation that we have not run into before: a high likelihood that a divide-by-zero computation will occur with computing the transmittance. This will give rise to a computed value of infinity for the transmittance. It remains to be seen whether there will also be a similar effect on the computed noise. As we start on this next piece of the pie, we remind ourselves that we wish to build on the work we have done previously, so as to not have to repeat the derivations of alreadyderived expressions. Hence we will note the high points and provide the references to where the interested reader can review these pertinent mathematical steps. We start by following the derivation of the transmittance noise for the constant-noise situation,

314

Chemometrics in Spectroscopy

which we presented in [5]. We began with the definition of transmittance T according to equation 50-6: T=

Es Er

(50-6)

We then applied the Propagation of Uncertainties expression: FC D =

fC D fC D C + D C D

(50-64)

where C = Es and D = Er to obtain T =

Es −Es Er + Er Er 2

(50-66)

and after taking the variance of equation 50-66, applying the two statistical theorems that allow us to simplify the expressions we obtained � � 1 −Es 2 VarT = 2 VarEs + VarEr Er 2 Er

(50-70)

Previously, in the case of constant detector noise, we then set Var(Es and Var(Er equal to the same value. This is the point at which must we now depart from the previous derivation, since in the case of Poisson-distributed noise the sample and reference noise levels will rarely, if ever, be the same. However, we are fortunate in this case that Poisson-distributed noise has a unique and very useful property that we have indeed previously made use of: the variance of Poisson-distributed noise is equal to the mean signal value. Hence we can substitute Es for Var(Es and Er for Var(Er : VarT =

� � 1 −Es 2 E + Er s Er 2 Er 2

(50-131)

VarT =

Es Es 2 + E r 2 Er 3

(50-132)

and setting Es = TEr : T T2 + Er Er � T T2 SDT = + Er Er

VarT =

(50-133)

(50-134)

Figure 50-27 plots the transmittance noise as a function of Er according to equa tion 50-134, for several values of the transmittance. As we observed from inspecting equation 50-134, at all values of Er , the noise increases with T , while for all values of T , the transmittance noise decreases inversely with the square root of the reference signal level. However, we remind our readers that, as we discussed in the previous chapter,

Analysis of Noise: Part 11

315 Transmittance noise from Poisson distribution

7 6

Noise

5 4

T = 0.1 T=1

3 2 1

4.85

4.65

4.45

4.25

4.05

3.85

3.65

3.45

3.25

3.05

2.85

2.65

2.45

2.25

2.05

1.85

1.65

1.45

1.25

1.05

0.85

0.65

0.45

0.25

0.05

0

Er

Figure 50-27 Transmittance noise for Poisson-distributed data as a function of Er at different values of parameter T , from equation 50-134.

values of Er less than 5 provide only a mathematical expectation, and a mathematical fiction, since any value of Er that is small enough to result in an actual zero reading, will give an infinite value for the transmittance, and for the noise level. Nevertheless, equation 50-134 is valid for all values of Er , and therefore, while the plot we constructed includes values that cannot be achieved in reality, both the plot and the equation are valid in the range that can be actually measured. It is also interesting to compare equation 50-134 with equation 50-72, which is the corresponding equation that describes the transmittance noise when the detector noise level is constant [5]: SDT =

T SD E Er

(50-72)

As usual, we will continue in the next chapter, where we will discuss the various aspects of absorbance noise that are of concern.

REFERENCES 1. 2. 3. 4. 5. 6. 7. 8. 9. 10.

Mark, Mark, Mark, Mark, Mark, Mark, Mark, Mark, Mark, Mark,

H. H. H. H. H. H. H. H. H. H.

and and and and and and and and and and

Workman, Workman, Workman, Workman, Workman, Workman, Workman, Workman, Workman, Workman,

J., J., J., J., J., J., J., J., J., J.,

Spectroscopy Spectroscopy Spectroscopy Spectroscopy Spectroscopy Spectroscopy Spectroscopy Spectroscopy Spectroscopy Spectroscopy

15(10), 24–25 (2000). 15(11), 20–23 (2000). 15(12), 14–17 (2000). 16(2), 44–52 (2001). 16(4), 34–37 (2001). 16(5), 20–24 (2001). 16(7), 36–40 (2001). 16(11), 36–40 (2001). 16(12), 23–26 (2001). 17(1), 42–49 (2001).

This page intentionally left blank

51

Analysis of Noise: Part 12

This chapter is one more in the set of 40 through 50 [1–11] dealing with the rigorous derivation of the expressions relating the effect of instrument (and other) noise to their effects to the spectra we observe. As we do in each chapter in this section of the book, we again take this opportunity to remind our readers that we are dealing with a continuous series of chapters, and so we again continue our discussion by continuing our equation numbering, figure numbering, use of symbols, and so on as though there were no break, except that when we repeat an equation or series of equations that were derived and presented previously, we sometimes retain the original number(s) for those equation(s). We can also report that work similar to that in the first few chapters of this “noise” subseries has been reported in Applied Spectroscopy [12]. This paper derives the expres sions for the constant-detector-noise case using a calculus-based approach rather than the algebraic approach used in these chapters [2, 3]. It also includes experimental data that verifies the correctness of the theoretical development reported in these chapters. In the previous chapter, we have examined the situation in regard to determining the effect of noise on the computed transmittance. Now we wish to examine the behavior of the absorbance for Poisson-distributed noise when the reference signal is small. Our starting point for this is equation 51-24, which we derived previously [3] for the case of constant detector noise, but at the point we take it up the equations have not yet had any approximations, or any special assumptions relating to the noise behavior: � � −04343Er Er Es − Es Er A = Er Er + Er Es

(51-24)

Our equation numbering system now causes us to jump from equation 51-24 to 51-135 for our next equation number: � � � � Er Es Es Er 04343Er −04343Er A = + Er Er + Er Es Er Er + Er Es A =

−04343Er Es 04343Er + Es Er + Er Er + Er

(51-135)

(51-136)

and upon taking the variance of A and applying the theorem for the variance of a sum: � VarA = Var

� � � −04343Er Es 04343Er + Var Er + Er Es Er + Er

(51-137)

318

Chemometrics in Spectroscopy

and upon applying the theorem for the variance of a constant times a random variable: � VarA = �

�2

� Var

�2

� Var

−04343Er Es �

SDA =

−04343 T

� � � Es Er 2 + 04343 Var (51-138) Er + Er Er + Er

� � � Es Er + 043432 Var Er + Er Er + Er

(51-139)

Here we again have the problem we previously encountered [4], of not being able to separate the individual variances out of the formulas, because of its occurrence in the denominator along with Er . There is no help for it but to calculate the individual terms Es Er and for all meaningful values of the distributions of Es and Er Er + Er Er + Er and then compute their variance. There are several programming issues involved here, which we discuss a bit later in this chapter. The results are presented in Figure 51-28. Continuing on to ascertain the value of transmittance corresponding to the optimum relative noise, we find that in [5] we demonstrated that Var(A/A was given by the following expression, which is still completely general: � Var

A A

�

� =

1 T ln T

�

�2 Var

Es Er + Er

�

� +

1 ln T

�2

� Var

−Er Er + Er

� (51-77)

The problem we had then, which is the same problem we have now, is that again due to the presence of Er in the denominator of each term in the variance calculations, we cannot further separate the terms, to extract the variances of the sample and reference signals by mathematical analysis. Our solution to this problem previously was to use a Monte-Carlo numerical computer simulation to examine the performance of the noise described by these equations, since we could not do a numerical integration.

Absorbance noise 2.00 1.80 1.60

SD(ΔA)

1.40 1.20 1.00 0.80 0.60 0.40 0.20 0.96

0.91

0.86

0.81

0.76

0.71

0.66

0.61

0.56

0.51

0.46

0.41

0.36

0.31

0.26

0.21

0.16

0.11

0.06

0.01

0.00

T

Figure 51-28 Absorbance noise for Poisson-distributed data at low values of the reference signal.

Analysis of Noise: Part 12

319

In the case of Poisson-distributed noise, we can do a systematic numerical calculation. The reasons we can do this now, when we could not do it for the Normal distribution, are the ones we have discussed previously: 1) The Poisson distribution is discrete, and so is more amenable to numerical compu tation 2) Er is never negative 3) Er occurs in the denominator Er always summed together with Er . Together with point 2, this means that the denominator is never zero as long as the reference energy Er is non-zero. Therefore all terms to be included in the computation are finite. Again we repeat our reminder that the results of these computations are mathematical expectations, in a real measurement situation denominators of zero can be expected to occur when Er is less than approximately five. Equation 51-77 was programmed in MATLAB, using the Poisson distribution for both Er and Es ; the actual distribution used corresponded to the value of Er and Es , respectively. The computations were done for 001 ≤ T ≤ 099, and for values of 1 ≤ Er ≤ 10. The computation is not straightforward (neither was the one for evaluating equation 51-139). The terms whose variance are to be computed have to themselves be computed. For the first term of equation 51-77, this means that all possible combinations of values of Er and Es have to be generated, the terms corresponding to each combination com puted and then each term weighted by its frequency according to the Poisson distribution with appropriate arguments. Since the Poisson distribution gives fractional values of the probabilities of occurrence of each value in the distribution, these probability values have to be multiplied by a number that will then provide an integer number for the values for the terms that would have their variance computed. The first attempt created the actual full lists of the terms in equation 51-77. A few short runs, with small values of the multipliers (of about 100, for each of Er and Es , giving 10,000 terms total), was quickly found to be unsatisfactory: the resulting plots were found to be very ragged and uneven. The number of terms was increased 5×105 , using a multiplier of 500 for Es and a multiplier of 1,000 for Er . It was found that using a larger number of terms than that, although desirable because it made the curve smoother, caused “out of memory” problems in MATLAB. At this number of computation points, although smoother than with fewer points, the curves were still visibly ragged to the eye. The attempt to create the full lists of terms was abandoned. Instead, one term of each combination of values of Es and Er was computed, and the program kept track of how many times that term would appear in the full list. This allowed the programming to use the computation of weighted averages and variances, the weighting factors being the number of times a given term would appear in the full list of terms. While more complicated to program, this scheme allowed the computation of the results for the equivalent of very large lists indeed. The actual results presented here are based on the use of 10,000 values to represent the Poisson distribution for Er and the same number for Es , providing a result equivalent to a list of 108 terms. Another issue that must be kept in mind when setting up the program is that despite appearances, the computation of the variance terms is not independent of the value of T that appears in the coefficients of the variance terms in equations 51-139 and 51-77.

320

Chemometrics in Spectroscopy

The reason is that Er determines the distribution that must be used for Er , and t and Er together determine Es and hence the distribution of Es that must be used in the variance computation. The resulting plot is presented in Figure 51-29. From the plot, and from examining the list of values from which the plot was made, there appears to be no shift in the transmittance corresponding to the optimum value of relative absorbance, as the reference reading varies. As usual, we will continue in the next chapter; we will now start on the derivations of formulas relating to the effects of what we have previously called “scintillation noise”, and which is also called “flicker noise”, “source noise”, and other labels. Basically this

(a)

Relative absorbance noise

2.00 1.80 1.60

Er = 1

ΔA/A

1.40 1.20 1.00 0.80 0.60 0.40 0.20

0.71

0.76

0.81

0.86

0.91

0.96

0.71

0.76

0.81

0.86

0.91

0.96

0.66

0.61

0.51

0.46

0.41

0.36

0.31

0.26

0.21

0.16

0.11

0.06

0.01

0.56

Er = 10

0.00

T (b) Relative absorbance noise 0.50 0.45

Er = 3

ΔA/A

0.40 0.35 0.30 0.25

Er = 10 0.66

0.61

0.56

0.51

0.46

0.41

0.36

0.31

0.26

0.21

0.16

0.11

0.06

0.01

0.20

T

Figure 51-29 Relative absorbance noise for Poisson-distributed data, determined by numerical computation using equation 51-77. Figure 51-29b is an ordinate expansion of Figure 51-29a. (see Color Plate 18)

Analysis of Noise: Part 12

321

refers to noise caused by effects that cause the variations of the signal to be proportional to the signal. Are we having fun yet?

REFERENCES 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12.

Mark, Mark, Mark, Mark, Mark, Mark, Mark, Mark, Mark, Mark, Mark, Mark,

H. and Workman, J., Spectroscopy 15(10), 24–25 (2000). H. and Workman, J., Spectroscopy 15(11), 20–23 (2000). H. and Workman, J., Spectroscopy 15(12), 14–17 (2000). H. and Workman, J., Spectroscopy 16(2), 44–52 (2001). H. and Workman, J., Spectroscopy 16(4), 34–37 (2001). H. and Workman, J., Spectroscopy 16(5), 20–24 (2001). H. and Workman, J., Spectroscopy 16(7), 36–40 (2001). H. and Workman, J., Spectroscopy 16(11), 36–40 (2001). H. and Workman, J., Spectroscopy 16(12), 23–26 (2001). H. and Workman, J., Spectroscopy 17(1), 42–49 (2001). H. and Workman, J., Spectroscopy 17(6), 24–25 (2002). H.L. and Griffiths, P.R., Applied Spectroscopy; 56(5), 633–639 (2002).

This page intentionally left blank

52 Analysis of Noise: Part 13

This chapter is a continuation of the set of Chapters 40 to 51 [1–12] dealing with the rigorous derivation of the expressions relating the effect of instrument (and other) noise to their effects to the spectra we observe. As we do in each chapter in this section of the book we again take this opportunity to remind our readers that we are dealing with a continuous series of chapters, and so we again continue our discussion by continuing our equation numbering, figure numbering, use of symbols, and so on as though there were no break, except that when we repeat an equation or series of equations that were derived and presented previously, we retain the original number(s) for those equation(s). We have now gone through the analysis of two cases pretty thoroughly. It should be apparent to the reader what our approach is, and how the analysis of these situations is attacked. Hopefully, therefore, we can now go a little faster than we have been. In the previous chapter, we pretty much finished up our discussion of noise that was Poisson-distributed. In one sense Poisson-distributed noise is a special case since, for example, when we analyzed the effects of noise that was constant, we did not, until it became pertinent, consider the noise to have any particular distribution. However, since Poisson noise arises naturally out of a particular noise mechanism, and it is one that occurs in several different spectral regions and in conjunction with different technologies, it was appropriate to consider it as an entity unto itself. The next noise source we consider is what we originally called “scintillation noise” and which, as we noted in the previous chapter, is also called by several other labels: source noise, flicker noise, and other labels. The defining characteristic of this noise is that the variability is directly proportional to the intensity of the signal. One way this can arise is through a mechanical vignetting of an optical beam. If a piece of metal were to block, say, 1% of a homogeneous beam, then the intensity of the beam would be reduced by 1% of its total intensity, so that the absolute reduction of the signal from an intense beam would be greater than for a weak beam. Another way this can arise is if, for example, a photoresistive detector is in use, then the detector current will be proportional to the detector voltage as well as the intensity of radiation impinging on it. Then if the detector voltage varies, the change in detector current for a given voltage change will again be proportional to the optical intensity, but with a sensitivity proportional to the detector voltage. In either case, if the amount of beam blockage or the detector voltage is random, then this sensitivity change becomes a source of random variation proportional to the signal intensity. Another characteristic of scintillation noise is that, since it represents the amount of energy in the optical beam, it can never attain a negative value. In this respect it is similar to the Poisson distribution, which also can never attain a negative value. On the other hand, since it is a continuous distribution it will behave the same way as the constant-noise case in regard to achieving an actual zero: any given reading can become

324

Chemometrics in Spectroscopy

infinitesimally close to zero, but there is zero probability of actually achieving an exact value of zero, except in the case of a complete absence of signal. It differs from the Poisson case in two respects, however. First, the distribution of the variations is not predetermined, but depends on the nature of the changes causing the signal variation. Secondly, the magnitude of the changes is not predetermined, but depends on the amount of variation of the cause. At an appropriate point we will have to accommodate this by introducing a constant representing the magnitude of the variations. We will go a bit farther in characterizing the various types of noise sources we consider, and in Table 52-4 we list, for comparison purposes, the corresponding characteristics of the three types of noise we have or are considering: So let us begin our analysis. As we did for the analysis of shot (Poisson) noise [8], we start with equation 52-17, wherein we had derived the expression for variance of the transmittance without having introduced any special assumptions except that the noise was small compared to the signal, and that is where we begin our analysis here as well. For the derivation of this equation, we refer the reader to [2]. So, for the case of noise proportional to the signal level, but small compared to the signal level we have � �2 � � 1 −T 2 VarT = VarEr (52-17) VarEs + Er Er In the derivation of the transmittance noise in the case of Poisson-distributed noise, at this point we noted that the variances of Er and Es were proportional to Er and Es respectively. In the current case, the corresponding relationship is that the standard

Table 52-4 Comparisons between noise characteristics, including the expressions for low-noise behavior Type of noise

Constant detector noise

Shot noise

Scintillation noise

Relation to signal

Independent of signal

Square root

Proportional

Continuous

Yes

No

Yes

Variance locked to signal level?

No

Yes

No

Distribution

Not predetermined

Poisson

Not predetermined

Negativity

Negative values possible

Non-negativity constraint

Non-negativity constraint

Probability of zero value

Zero

Finite

Zero

Expression for transmittance noise Expression for relative absorbance noise: SDA/A

�

√

SDE Er � SDE Es 2 + Er 2 Es Er lnEs /Er 1+T2

T +T2 Er

1 �1 + 1 lnT Es Er

√

2 kT

√

2k lnT

Analysis of Noise: Part 13

325

deviation of the noise on Er and Es is proportional to Er and Es , with a proportionality factor, k, that is related to the magnitude of the physical cause of the noise, i.e.: SDEr = kEr SDEs = kEs The variances of Er and Es , then, are proportional to k2 Er 2 and k2 Es 2 respectively, and substituting these values in equation 52-17 gives � −T 2 2 2 k Es + k Er VarT = Er � 2 2� k Es VarT = + k2 T 2 Er 2 �

1 Er

�

�2

2

2

VarT = 2k2 T 2 √ SDT = 2 kT

(52-140) (52-141)

(52-142) (52-143)

Given the simplistic nature of this relationship, we forbear to plot the function, although we note that there are again a family of functions, corresponding to the various values of k. We note that, in contrast to the previous two cases, the transmittance noise depends on the magnitude of the effect, and on the transmittance of the sample, but does not depend on the energy of the reference beam; in other words, whereas in the previous two cases the signal-to-noise level of the reference beam was a key factor in deter mining the behavior of the transmittance noise, here it does not. This conforms to intuition, since when we state that the noise superimposed on the signal is proportional to the signal, the implicit consequence is that the signal-to-noise (or noise-to-signal) is constant. Let us now, as we normally do, continue to derive the expressions for absorbance noise again referring to our previous chapter [8], we can start with equation 52-29: � VarA =

−04343 Es

�2

� Var Es +

04343 Er

�2 VarEr

(52-29)

Again substituting k2 Er 2 and k2 Es 2 for the two variances in equation 52-29 � VarA =

−04343 Es

�

�2 k2 Es 2 +

04343 Er

VarA = 2 × 043432 k2 √ SDA = 2 × 04343k

�2 k2 Er 2

(52-144)

(52-145) (52-146)

326

Chemometrics in Spectroscopy

Here again, in the low-noise case of scintillation noise, the absorbance noise is again independent of the reference signal level, and is now independent of the sample characteristics, as well, and depends only on the magnitude of the external noise source. In conformance with our regular pattern, we now derive the behavior of the rela tive absorbance noise for the low-noise case. Here we start with equation 52-100, the derivation of which is found in [9]: � Var

A A

� =

1 1 VarEs + VarEr Er lnT 2 Es lnT 2

(52-100)

And once more substituting k2 Er 2 and k2 Es 2 for the two variance terms: �

A Var A

� =

1 1 k 2 Es 2 + k 2 Er 2 Es lnT 2 Er lnT 2 �

�

2k2 lnT 2 √ � � A 2k SD = A lnT

Var

A A

(52-147)

=

(52-148)

(52-149)

Equation 52-149 presents a minor difficulty; one that is easily resolved, however, so let us do so: the difficulty actually arises in the step between equation 52-148 and 52-149, the taking of the square root of the variance to obtain the standard deviation; conventionally we ordinarily take the positive square root. However, T takes values from zero to unity; that is, it is always less than unity. the logarithm of a number less than unity is negative, hence under these circumstances the denominator of equation 52-149 would be negative, which would lead to a negative value of the standard deviation. But a standard deviation must always be positive; clearly then, in this case we must use the negative square root of the variance to compute the standard deviation of the relative absorbance noise. In Figure 52-30 we plot the function −1/ lnT to complete this part of the analysis. We note that there is no minimum to the curve, and the noise from source continu ally improves as the transmittance decreases; in this case the previous, √ conventional derivations agree with our results, although they do not indicate the 2 factor. Noting the transitions from equation 52-140 to 52-142 (and the corresponding portions of the derivation for absorbance noise and relative absorbance noise), we see that this factor arises from the equal noise contributions of the sample and reference channels; therefore we conclude that in this case also, the missing factor is due to the neglect of the reference channel noise contribution. The rate of increase in noise also increases faster as T increases (not surprising for a logarithmic function!), so that working at transmittance values less than, say, 0.7 or 0.8 is prudent. Of course, we must also remember that our derivations are idealizations, and as Ingle and Crouch point out ([13], p. 153), in a real measurement situation, at some point another noise source would become dominant and limit the actual noise observed.

Analysis of Noise: Part 13

327

(a)

–1/ln(T ) 100 90 80

–1/ln(T )

70 60 50 40 30 20 10

0.8

0.88

0.92

0.96

0.88

0.92

0.96

0.76 0.76

0.84

0.72 0.72

0.84

0.68 0.68

0.8

0.6

0.64 0.64

0.56

0.6

0.52

0.56

0.48

0.4

0.44

0.36

0.32

0.28

0.2

0.24

0.16

0.12

0.08

0

0.04

0

T

(b)

–1/ln(T ) 25

–1/ln(T )

20

15

10

5

0.52

0.48

0.44

0.4

0.36

0.32

0.28

0.24

0.2

0.16

0.12

0.08

0

0.04

0

T

Figure 52-30 (a) Plot of −1/ lnT (b) Ordinate expansion of Figure 52-30a.

REFERENCES 1. 2. 3. 4. 5. 6. 7. 8. 9. 10.

Mark, Mark, Mark, Mark, Mark, Mark, Mark, Mark, Mark, Mark,

H. H. H. H. H. H. H. H. H. H.

and and and and and and and and and and

Workman, Workman, Workman, Workman, Workman, Workman, Workman, Workman, Workman, Workman,

J., J., J., J., J., J., J., J., J., J.,

Spectroscopy Spectroscopy Spectroscopy Spectroscopy Spectroscopy Spectroscopy Spectroscopy Spectroscopy Spectroscopy Spectroscopy

15(10), 24–25 (2000). 15(11), 20–23 (2000). 15(12), 14–17 (2000). 16(2), 44–52 (2001). 16(4), 34–37 (2001). 16(5), 20–24 (2001). 16(7), 36–40 (2001). 16(11), 36–40 (2001). 16(12), 23–26 (2001). 17(1), 42–49 (2001).

328

Chemometrics in Spectroscopy

11. Mark, H. and Workman, J., Spectroscopy 17(6), 24–25 (2002). 12. Mark, H. and Workman, J., Spectroscopy 17(12), 38–41, 56 (2002). 13. Ingle, J.D. and Crouch, S.R., Spectrochemical Analysis (Prentice-Hall, Upper Saddle River, NJ, 1988).

53

Analysis of Noise: Part 14

This chapter is a continuation of chapters 40 to 52 [1–13] dealing with the rigorous derivation of the expressions relating the effect of instrument (and other) noise to their effects to the spectra we observe. As we do in each chapter in this section of the book we again take this opportunity to remind our readers that we are dealing with a continuous series of chapters, and so we again continue our discussion by continuing our equation numbering, figure numbering, use of symbols, and so on as though there were no break, except that when we repeat an equation or series of equations that were derived and presented previously, we retain the original number(s) for those equation(s). In the previous chapter we analyzed the effect of scintillation noise that is noise that is proportional to the signal, for the case of low noise (i.e., noise small compared to the signal level). Now we wish to analyze, as we have done previously, the situation when the noise is no longer negligible compared to the signal level. Here again we enter territory where extra care is needed. In the previous cases, we were able to assume that all conditions of the measurement were constant except that the reference energy was reduced until it was of comparable magnitude to the noise level. In the case of scintillation noise, however, that is not an option. As we noted earlier, as the signal level is reduced (in either channel), the noise is reduced correspondingly, leaving the N/S (i.e., the inverse of the S/N, in our notation SDEr /Er ratio constant. Therefore we cannot consider reducing the S/N by reducing the reference signal level. The only way we can change the signal-to-noise ratio is by changing the proportionality parameter k, which expresses the noise as a fraction of the signal. As we did for the low-noise case, we will introduce this parameter as the appropriate point in the derivation. Following our usual sequence, our next step here then, as it was for the previous two cases we treated (constant detector noise and Poisson-distributed noise), is to ascertain the effect of this noise on the expected value of computed transmittance. To do this we start with equation 53-5 (reference [2]): T + T =

Es + Es Er + Er

(53-5)

In the low-noise case we were able to justify separating equation 53-5 into two terms and setting T equal to Es /Er + Er . Here we cannot do that for several reasons: 1) In the large-noise case, Er in the denominator of equation 53-5 is non-negligible and therefore induces an asymmetry that will prevent it from vanishing upon integration. 2) An even larger asymmetry in introduced by a fact we have discussed previously: the physical causes of the error source under consideration preclude both the numerator

330

Chemometrics in Spectroscopy

and the denominator from becoming negative. Thus when we evaluate equation 53-5 to ascertain the expected values, we cannot continue integration below zero; the integration must be truncated at that point. For a physical picture to describe the situation, we can imagine an optical beam, and some opaque component vibrating randomly into and out of the beam. A schematic picture of this is shown in Figure 53-31a. Clearly, the further into the beam the obstruction intrudes, the more the beam is blocked and the less energy reaches the detector. The instantaneous blockage depends on multiple factors: the average position of the obstruction and the magnitude of the vibration. For our purposes we will assume that the position of the obstruction varies around its central location in such a manner that the distribution of energy in the optical beam varies according to a Normal distribution. Anyone who wants to calculate the actual distribution for an optical beam of interest to them is certainly free to do so and follow through on the calculation of the distribution of noise that actual optical geometry will cause. We will certainly appreciate hearing about any efforts in that direction, and the results obtained. For our purposes, however, we simply wish to point out that as the center of the obstruction’s motions moves close to the beam, more and more of the beam is blocked. Also, if the vibrational amplitude of the obstruction’s motion is constant, then the blockage will represent larger and larger fractions of the beam’s energy. However, there is a limit to that: if the obstruction moves so that it completely blocks the beam, then the instantaneous energy transmitted will be zero. As the obstruction continues to move closer to the optical beam, then complete blockage can occur more and more often, or equivalently, for larger and larger fractions of the time, but at no time can the energy transmitted become less than zero. From Figure 53-31a we can also see a corollary: that if the amplitude of the obstruc tion’s vibration is large enough, it will move completely out of the optical beam, resulting in truncation of the Normal distribution due to the fact that there will also be a max imum possible value for the energy, corresponding to the situation when the beam is completely unblocked. For our current analysis, however, we will consider only the case where the average position of the obstruction is within the beam and so the beam is always at least partially blocked. Then the vibrations of the obstruction cause it to block varying amounts of energy, down to complete blockage, but not to complete passage.

Obstruction

Optical beam

SIDE VIEW

END VIEW

Figure 53-31a An obstruction vignetting the beam randomly can affect the signal level, but cannot do more than block the entire beam, reducing the energy to zero.

Analysis of Noise: Part 14

331

–2.09 –1.85 –1.61 –1.36 –1.12 – 0.88 – 0.64 – 0.39 – 0.15 0.09 0.33 0.58 0.82 1.06 1.30 1.55 1.79 2.03 2.27 2.52 2.76 3.00 3.24 3.48 3.73

1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1

0

Energy

Figure 53-31b Once the optical beam is completely blocked, no less light can pass through the optical system. The average light that then can pass is the integral of the shaded area.

The effect of this on the beam energy is indicated in Figure 53-31b. If the distribution of energies is Normal, then the part below the lower limiting edge is truncated, since it is not possible to have less than zero energy. As we did in the analysis of Poisson-distributed noise, we compute the expected value of T as the weighted sum of the transmittance described by equation 49-5 (reference [10]): � Wi FXi i (53-59) TW = � Wi i

The evaluation of equation 53-59 for the current case of scintillation noise carries with it its own set of difficulties and cautions, just as the previous cases did. Some of them, caused by the physical limitation of not allowing the energy to go below zero, were mentioned above. Others mirror the two cases we have previously discussed in the past several chapters; the case of scintillation noise seems to combine some of the more difficult aspects of the two previous cases. Like the Poisson distribution, the value of the function and therefore of the integration does not go below zero, mirroring the physical effect that the actual optical energy cannot go below zero. Unlike the Poisson distribution (which was discrete) on the other hand, the values of the energy form a continuum, as does the Normal distribution we are assuming that the noise follows. Therefore we cannot simply add together the relatively small number of discrete values that the function can assume, but must perform a numerical integration over the range of values that will make appreciable contributions to the result. Another consequence of not following the Poisson distribution is that the noise level is not locked to the energy level. Rather, the value of k, which determines the N/S ratio, is independent of the energy. This precludes any simplification of the equations such as we were able to apply to the Poisson-distributed noise case. On the other hand, neither can we apply some of the simplifications we used in the case of constant detector noise, particularly the fact that the integral of the Normal distribution

332

Chemometrics in Spectroscopy

is unity. Since in the case we deal with now the distribution is truncated, we must perform numerical integrations of the distribution corresponding to the amount of truncation, in order to ascertain the behavior of this situation. This creates another complication in the analysis. While at first glance this limitation of not allowing the signal to go below zero seems like a benefit because it gives us a hard limit for the computation of the integrals, we also have to consider the effect of this limitation on the denominator of equation 53-59, as well as on the numerator. In the previous cases we have considered, the weighting function was a well-behaved mathematical probability function, either the Normal or the Poisson distribution. Both of these distributions evaluated to unity over the range of interest, and therefore we could replace the denominator with unity and ignore it thereafter. Now we wish to use the Normal distribution to describe the behavior of the error contribution we are evaluating, but cannot consider evaluating the integral from − to +, since we have seen that there is a lower limit to the integral. Furthermore, the lower limiting value remains at zero regardless of the value corresponding to the maximum of the distribution.

PRELIMINARY STEPS The evaluation of equation 53-59, therefore, starts with the evaluation of the truncated Normal distribution. This the value of the Normal distribution obtained by integrating the Normal distribution not between − and +, but between the lower cutoff value, whatever that is, and +. The Normal distribution, being simply another name for the error function, is well-known to not be integrable analytically, therefore numeric approximations are needed to ascertain the value. Indeed, it has been computed to high accuracy and the values available in tables, see for example ([14], p. 3). It is necessary, however, for us to be able to perform these computations ourselves. so that we can also use them in evaluating the weighted averages specified by equation 53-59. This is similar to the computations we performed previously for the case of constant (detector) noise, but differs from that computation in that the previous computation was done over the full significant range of the Normal distribution, instead of the truncated distribution. As a test, then, the computation was written in MATLAB (Mathworks, Natick, Mass.) The result of the computation, for a continuum of values of the point of truncation, is shown in Figure 53-32. The accuracy of the integration was evaluated by comparing the values computed from the MATLAB program to the tables available ([14], p. 3) at several selected values of X (where X represents the number of standard deviations at which the truncated SD was evaluated from) as a function of the integration interval. The results are presented in Table 53-5. We also inspect the nature of the function that we will be integrating. In the picture corresponding to the small-noise case, the underlying energy of the optical beam is effectively constant over the range of variation of energy, indeed this is the definition of “small noise”. For a Normally distributed vibration, the energy would thus also be Normally distributed. For the large-noise case, however, the energy varies appreciably over the range of vibration, and the variation increases with k, distorting the shape of the curve. This behavior, which is shown in Figure 53-33, corresponds to the plot in Figure 43-5 (Chapter 43) for the case of constant detector noise (reference [4]). Since

Analysis of Noise: Part 14

333

1.20 1.00 0.80 0.60 0.40 0.20 2.94

2.61

2.28

1.95

1.62

1.29

0.96

0.63

0.3

–0.03

–0.36

–0.69

–1.02

–1.35

–1.68

–2.01

–2.34

–3

–2.67

0.00

Figure 53-32 The integral of the truncated Normal distribution for values of the point of trun cation between −3 and +3 standard deviations. In all cases the integration was continued to +4 standard deviations.

Table 53-5 Accuracy of the integral of the truncated Normal curve for different values of the integration interval Integration interval

Error of integral

0.1 0.01 0.001 0.0001

0019 00019 000017 0000019

1.1 0.9 0.7 0.5

Energy

Truncated normal distribution

0.3 Product

–0.1

0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 105 110 115 120

0.1

Figure 53-33 The relation between the Normal distribution (truncated at −1 SD), the energy variation and the product of the two curves.

334

Chemometrics in Spectroscopy 0.4 0.35 –1

0.3 0.25 –0.6

0.2 –0.3

0.15 0.1

0 +0.3

0.05 0 1 6 11 16 21 26 31 36 41 46 51 56 61 66 71 76 81 86 91 96

Figure 53-34 Family of curves of the Energy-Distribution product, corresponding to various truncation points. The numbers indicate the truncation point of the Normal distribution, as the number of standard deviations from the peak of the Normal distribution.

the nature of the curve will vary as the degree of truncation varies, this also represents a family of curves. Figure 53-34 presents this family of curves, for various values of the point of truncation with respect to the Normal distribution curve. There is a point that we have implied in the forgoing discussion but have not made explicitly, so let us correct that oversight now: in the previous discussions of the math ematics behind the analysis of scintillation noise, we pointed out that, since the noise decreases with the signal, changes in S/N cannot be accomplished by changing the refer ence signal energy (or, for that matter, the sample energy), sine the noise will be reduced proportionately. Therefore, the noise level must be expressed as a multiplier, which we called k, times the signal level. This parameter, k, expresses the standard deviation of the noise as a fraction of the signal energy. Thus, the value at which the Normal distribution becomes truncated can be expresses as a function of k. Thus, for example, if k = 1 (the standard deviation of the energy due to movement of the obstruction equals the energy at the average position of the obstruction), then 95% of the time more energy will be present than the value E − 2k and the cutoff will be at −2k (strictly speaking, at −198k, but we will use the common approximation of 2 since it will be simpler to deal with other cases. Anyone concerned about the discrepancy can adjust the probability levels to compensate). If k = 2, then the corresponding cutoff value will be at −1k. This relates the mathematical quantity k to the properties of the Normal distribution that we will be working with in the evaluations of the integrals. Indeed, there is a “gotcha” to watch out for. In the picture of Figure 53-31a we show an obstruction obscuring part of the optical beam. If the physical amplitude of the vibrations of the obscuration are small, then k will increase as the average position of the obstruction moves closer and closer to the center of the beam, thus obscuring more of it, and reducing the energy by leaving a smaller and smaller crescent of the beam available. This movement of the obstruction corresponds to larger and larger values of k. Assuming that the distribution of positions of the obstruction is Normal, then the value of k varies inversely with the average size of the crescent left available. When the average position of the obstruction corresponds to just being at the edge of the

Analysis of Noise: Part 14

335

beam, then truncation occurs at 0 SD from the maximum of the Normal distribution. But there is nothing to prevent the obstruction from moving even further into obscuring the beam; in such a case light would be passing less than half the time, and the truncation point will have passed the center of the Normal distribution. This behavior is indicated in Figure 53-34, shown as the change in sign of the number of standard deviations corresponding to the point of truncation of the SD in that figure. The “gotcha” is that it would require k to assume an infinite value in order to express the situation where the average position of the obstruction coincided with the edge of the optical beam, a situation which is physically reasonable but mathematically intractable. Therefore our evaluations of the integrals will be based on specifying the truncation point in terms of the standard deviation of the position of the obstruction, rather than in terms of the obscuration of the optical beam.

EVALUATION OF THE FUNCTION We are now ready to evaluate the expressions in equation 53-59 and substitute then into equation 53-5. We will use the same value of k for both sample and reference beams. By having k the same, the results will be independent of the transmittance of the sample, as discussed previously. It also eases our task, since we will not have to compute a family of curves, but only one curve representing the change in computed transmittance as k varies. Evaluating it this way also eliminates the need to perform a double integration; we can simply keep the sample transmittance constant at unity, and plot the variation in computed transmittance. As described above, we do not compute the integral as a function of k directly. Rather, we compute it as a function of the point of truncation of the Normal distribution, which we allow to vary from +3 SDs to −3 SDs as the parameter. Figure 53-35 shows

Transmittance multiplification factor

2.0

Center = –3

1.8 1.6 1.4 1.2 1.0 0.8 0.6 0.4

Center = 3

0.2 –2.8

–2.5

–2.2

–1.9

–1.6

–1.3

–1.0

–0.7

–0.4

–0.1

0.2

0.5

0.8

1.1

1.4

1.7

2.0

2.3

2.6

2.9

0.0

Lower cutoff limit

Figure 53-35 Transmittance multiplication factor as a function of the lower cutoff limit of the Normal distribution, for varying values of the center of the distribution.

336

Chemometrics in Spectroscopy

how the multiplication factor varies, for the various places where the center of the obstruction is. When the noise is small the multiplication factor approaches unity, as we would expect. As we have seen for the previous two types of noise we considered, the non linearity in the computation of transmittance causes the expected value of the computed transmittance to increase as the energy approaches zero, and then decrease again. For the type of noise we are currently considering, however, the situation is complicated by the truncation of the distribution, as we have discussed, so that when only the tail of the distribution is available (i.e., when the distribution is cut off at +3 standard deviations), the character changes from that seen when most of the distribution is used.

Noise To derive the transmittance noise for the case of large scintillation noise, we begin at a somewhat earlier point than we did for the low-noise case, with equation 41-14 [2]: � � � � Er Es −Es Er + Var (53-14) VarT = Var Er Er + Er Er Er + Er Attempting to solve this equation for the scintillation noise situation raises the same difficulties as the previous investigations of noise in the low-signal (high-noise) regime: the inseparability of the Er and Es terms in the denominator, and the generation of infinities in the integrals when attempting to evaluate it. In this case, however, we cannot make the infinities go away or ignore their existence. In the case of constant detector noise, we assumed the infinity away by making the assumption that no measurement would ever coincide with an exact zero value of the noise, since the probability of that would be infinitesimally small. In the case of Poisson noise we were also able to assume the infinity away by making the assumption that since the Poisson distri bution represented a discrete distribution, then even though it could in fact take the value of exactly zero, if this occurred in the denominator of the transmittance compu tation the user would reject that reading, and it would not be included with the data. Therefore we were justified in rejecting readings with zero in the denominator from our calculations. In the case of scintillation noise, however, we cannot do either of those things. By the physical picture we set up to describe the situation, the situation can in fact occur that the obstruction would completely block the optical beam and allow zero energy through, yet since it represents a continuum of values we do not see a justification to arbitrarily reject those readings. Therefore we cannot see a clear path to trying to determine the noise performance of such a system, since it will inevitably come out as infinite in all cases. This seems to be a good stopping point. The title of this chapter is “Chemometrics in Spectroscopy” and for the past several chapters we have departed somewhat from that general topic to discuss in some detail the very specialized question of noise in spectra. While not outside the range of interest covered by the chapter’s intent, it is somewhat near the edges of what might be considered the mainstream purview of the chapter, and it is time to return to a more mainstream discussion, or at least one closer to the center of the topic.

Analysis of Noise: Part 14

337

In creating chemometric calibrations, it is common to transform the spectrum, for any of various reasons, from the measured format, which is usually absorbance, into a different format. One common, widely used transformation is to compute a derivative of the spectrum. First (dA/d) and second (d2 A/d2 ) derivatives are often used. Hence, in our next few chapters we will be discussing the properties and behavior of derivatives.

REFERENCES 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14.

Mark, H. and Workman, J., Spectroscopy; 15(10), 24–25 (2000). Mark, H. and Workman, J., Spectroscopy; 15(11), 20–23 (2000). Mark, H. and Workman, J., Spectroscopy; 15(12), 14–17 (2000). Mark, H. and Workman, J., Spectroscopy; 16(2), 44–52 (2001). Mark, H. and Workman, J., Spectroscopy; 16(4), 34–37 (2001). Mark, H. and Workman, J., Spectroscopy; 16(5), 20–24 (2001). Mark, H. and Workman, J., Spectroscopy; 16(7), 36–40 (2001). Mark, H. and Workman, J., Spectroscopy; 16(11), 36–40 (2001). Mark, H. and Workman, J., Spectroscopy; 16(12), 23–26 (2001). Mark, H. and Workman, J., Spectroscopy; 17(1), 42–49 (2001). Mark, H. and Workman, J., Spectroscopy; 17(6), 24–25 (2002). Mark, H. and Workman, J., Spectroscopy; 17(12), 38–41, 56 (2002). Mark, H. and Workman, J., Spectroscopy; 17(12), 123–125 (2002). Owen, D.B., Handbook of Statistical Tables (Addison-Wesley Publishing Co., Inc., Reading, MA 1962).

This page intentionally left blank

54 Derivatives in Spectroscopy: Part 1 – The Behavior of the Derivative

THE BEHAVIOR OF THEORETICAL DERIVATIVES Derivatives of spectra (dT /d� or dA/d�, and their wavenumber equivalents in FTIR) have been known and used in spectroscopy for a long time. Both first derivatives and second derivatives (d2 T /d�2 or d2 A/d�2 � are in common use in modern spectroscopy, particularly in NIR spectroscopy. We also note that they also enjoy widespread use in some nonoptical spectroscopic techniques, such as NMR and ESR spectroscopies. The mathematics and behavior of the derivative is independent of the particular spectroscopic technique to which it is applied, however. But since our own backgrounds are in optical spectroscopy, where pertinent we will discuss it in terms of the spectroscopy we are familiar with. Studies of the application of derivatives to spectroscopy go back at least as far as 1953 [1–3]. A more recent paper available contains a good bibliography of the work prior to its appearance [4]. Since the advent of NIR spectroscopy becoming a popular analytical technique, the routine use of derivative spectra has burgeoned along with the application to this method of spectroscopic analysis. Along with the increased applicability, interest has grown in the background and behavior of derivatives. Dave Hopkins especially has led the way in understanding the behavior of first and second derivatives, particularly their computation using Savitzky-Golay convolution functions [5, 6]. We do not plan to deal with that aspect too extensively at this time, however. The application of derivatives is not without problems, however, especially when the concern is to accurately represent the derivative of a given data spectrum. Therefore understanding the nature of the problems encountered, so that the proper decisions can be made regarding how the derivative should be calculated is crucial to obtaining optimum results. Figure 54-1 illustrates some of the problems of derivatives. This figure also illustrates some of the basic behaviors underlying the use of the derivatives for spectroscopic analysis. The top curve in Figure 54-1 represents a synthetic spectrum, with two Gaussian (Normal) bands, one of 20 nm bandwidth and one of 60 nm bandwidth. Spectroscopic band shapes are conventionally considered to be either Gaussian or Lorentzian; in this chapter we will concentrate on Gaussian band shapes, therefore all our figures are based on Gaussian-shaped bands. We will, however, treat Lorentzian bands at appropriate points. Therefore in Figure 54-1 we present Normal bands with spacing between wavelength points in Figure 54-1 of 1 nm, a number that will become important later on. The middle curve represents the first “derivative” and the bottom curve the second “derivative” of the absorbance band. We have been putting the term “derivative” in quotes, because they are, in fact, not true derivatives. The definition of a derivative

340

Chemometrics in Spectroscopy

ΔY

ΔY ΔX ΔX

1

0

1499

1480

1461

1442

1423

1404

1385

1366

1347

1328

1309

1290

1271

1252

1233

1214

1195

1176

1157

1138

1119

0

1100

0

Wavelength

Figure 54-1 Two Gaussian absorbance bands and their respective first and second “derivatives” (finite differences). The top spectrum represents a synthetic Gaussian absorbance spectrum, the middle a first “derivative” and the bottom a second “derivative”. Note that the ordinate of the first “derivative” has been expanded by a factor of 10 and the second “derivative” by another factor of 10. The wavelength spacing between data points is 1 nm. The narrow band has a bandwidth (FWHH) of 20 nm, the broad one is 60 nm.

includes the step of taking a limit as differences approach zero. In the real world, with real data we can never calculate a true derivative, since we must compute the differences between finite data points, and these must be taken over finite intervals, so that computed derivatives are approximations to the actual derivative. The absorbance spectrum in Figure 54-1 is made from synthetic data, but mimics the behavior of real data in that both are represented by data points collected at discrete and (usually) uniform intervals. Therefore the calculation of a “derivative” from actual data is really the computation of finite differences, usually between adjacent data points. We will now remove the quotation marks from around the term, and simply call all the finite-difference approximations a derivative. As we shall see, however, often data points that are more widely spread are used. If the data points are sufficiently close together, then the approximation to the true derivative can be quite good. Nevertheless, a true derivative can never be measured when real data is involved. Figure 54-1, however, still shows a number of characteristics that reveal the behavior of derivatives. First of all, we note that the first derivative crosses the X-axis at the wavelength where the absorbance peak has a maximum, and has maximum values (both positive and negative) at the point of maximum slope of the absorbance bands. These characteristics, of course, reflect the definition of the derivative as a measure of the slope of the underlying curve. For Gaussian bands, the maxima of the first derivatives also correspond to the standard deviation of the underlying spectral curve.

Derivatives in Spectroscopy: Part 1

341

The second derivative, in contrast, has its maximum value at the same wavelength as the underlying peak, although in the negative-going direction. The second derivative crosses the X-axis at the point of maximum slope of the underlying (first derivative) curve, and because of that presents a much sharper-appearing band than the underlying absorbance band does. The problem arises, however, that this “sharpening” effect is accompanied by the creation of two artifact peaks, the two positive-going peaks that flank the negative-going portion of the second derivative. In complicated spectra, therefore, it can sometimes be difficult to distinguish true spectral features from the artifacts created by the second derivative calculation. Finally, we note that the magnitude of both the first and the second derivatives of the narrow absorbance band is considerably greater than corresponding magnitudes for the wider absorbance band. This characteristic is a consequence of the fact that the slope of the narrower band really is greater than that of the broader band of the same height, as can be seen in the expanded views of the two absorbance bands in Figure 54-1. For the same �X, the narrow absorbance band has a much larger value of �Y than the broad absorbance band does, therefore �Y/�X (the derivative) is larger for that band. A similar situation is true for the second derivative as well. There is an additional consideration as well, however: the mathematical definition of a Normal curve includes a premultiplying factor of 1/�� × �2��1/2 �, which makes the area under the Normal curve equal to unity. Therefore, the wider the bandwidth, the smaller the maximum value of the curve will be, further reducing the slope as compared to a narrower band. It is interesting and useful to consider this quantitatively. The expression for the Normal distribution is (54-7) Y=

2 1 − 21 � X−� � � e ��2��1/2

(54-1a)

The corresponding expression for the Lorentzian distribution is [8] (see p. 211): Y=

2 × ��

1+

1 2��−X� �

2

(54-1b)

where � is the measure of bandwidth (and equals the standard deviation for the Normal curve); and � is the wavelength corresponding to the peak center. We note parenthetically here that equation 54-1a includes the premultiplying factor for constant area. The expression for a Normal curve of constant maximum height (of unity) will be simply: Y = e− 2 � 1

X−� �

2

�

(54-2)

The first derivative of the Normal distribution, from the expression in Equation 54-1a, then, is 2 2 d 1 X − � dY 1 − 21 � X−� � � = e − (54-3) dX 2 � dX ��2��1/2

342

Chemometrics in Spectroscopy

2 dY 1 1 d 2 − 21 � X−� � � − �X = e − �� dX ��2��1/2 2� 2 dX

(54-4)

2 dY 1 1 − 21 � X−� � � = e − 2 2 �X − �� dX ��2��1/2 2�

(54-5)

2 dY − �X − �� − 21 � X−� � � = 3 e dX � �2��1/2

(54-6a)

Equation 54-6a is derived from the constant-area expression for the Normal curve, from the constant-height expression we obtain 2 dY − �X − �� − 21 � X−� � � e = (54-6b) dX �2 The origin of the features seen qualitatively in Figure 54-1 can be observed in either of equations 54-6a or 54-6b. When X = �, then the derivative is zero, and the sign of the derivative changes from positive when X < � to negative when X > �. The presence of the negative exponential term ensures that the derivative will asymptotically approach zero as X approaches infinity in both directions. Similarly, from equation 54-6a we can derive the expression for the second derivative of the Normal distribution: 2 d 2 d d2 Y 1 X −� 2 − �X − �� − �X − �� − 21 � X−� − 21 � X−� � � � � = 3 e − +e � �2��1/2 dX 2 � dX � 3 �2��1/2 dX 2 (54-7)

2 d2 Y − �X − �� − 21 � X−� 1 −1 1 X−� 2 � � e = 3 − 2 2 �X − �� + e− 2 � � � dX 2 � �2��1/2 2� � 3 �2��1/2 �X − ��2 d2 Y 1 1 X−� 2 = − 3 e− 2 � � � 2 5 1/2 1/2 dX � �2�� � �2�� And from equation 54-6b we similarly obtain � 2 d2 Y �X − ��2 1 − 21 � X− � � = − e �4 �2 dX 2 For the Lorentzian distribution, from equation 54-1b the first derivative is

dY 2 −1 d 2 �� − X� 2 = ×

1+ 2 �

dX �� 2 �� − X� 2 dX 1 +

� dY 2 8� 2 �� − X� = × 2 dX �� � 2 + 4 �� − X�2

(54-8)

(54-9a)

(54-9b)

(54-10)

(54-11)

Derivatives in Spectroscopy: Part 1

343

And then the second derivative of the Lorentzian distribution is ⎧ 2 d 2 2 ⎪ ⎨ �� 8� 2 �� − X� + 4 − X� � 2 2 dY dX = × 4 ⎪ dX 2 �� ⎩ 2 � + 4 �� − X�2 2 ⎫ d 2 ⎬ 8� 2 �� − X� � + 4 �� − X�2 ⎪ dX − 4 ⎪ ⎭ � 2 + 4 �� − X�2

(54-12)

⎧ 2

2

⎪ 2 2 ⎨ 2 �� � −8� + 4 − X� dY 2 = × 4 ⎪ dX 2 �� ⎩ � 2 + 4 �� − X�2 d ⎫ ⎬ � 2 + 4 �� − X�2 ⎪ 8� 2 �� − X� × 2 � 2 + 4 �� − X�2 dX − 4 ⎪ ⎭ � 2 + 4 �� − X�2

(54-13)

⎧ 2 2 ⎪ ⎨ −8� 2 � 2 + 4 �� − X�

dY 2 = × 4

dX 2 �� ⎪ ⎩ � 2 + 4 �� − X�2 2

d ⎫ d 2 2 ⎪ 2 2 ⎪ 4 �� − X� 16� �� − X� � + 4 �� − X� � + ⎬ dX dX (54-14) − 4 ⎪ ⎪ ⎭ � 2 + 4 �� − X�2 2

⎧ 2 ⎪ ⎨ −8� 2 � 2 + 4 �� − X�2 dY 2 = × 4 �� ⎪ dX 2 ⎩ � 2 + 4 �� − X�2 2

⎫ ⎪ 16� 2 �� − X� � 2 + 4 �� − X�2 �−8 �� − X�� ⎬ − 4 ⎪ ⎭ � 2 + 4 �� − X�2 ⎧ ⎫ ⎪ ⎪ 2 ⎨ 3⎬ 16 dY 12� �� − X� − � = × 3 � ⎪ dX 2 ⎩ � 2 + 4 �� − X�2 ⎪ ⎭

(54-15)

2

(54-16)

Going back to equations 62 and 54-11, how do the magnitudes of the derivatives change with � ? Since the maximum first derivative occurs when X − � = �, let us substitute � for X − � in equation 54-6a, for the Normal distribution we get: −1

dY −� −e 2 −1 � 2 = 3 e 2 �� � = 2 1/2 � �2��1/2 dX MAX � �2��

(54-17)

344

Chemometrics in Spectroscopy

and in equation 54-11 for the Lorentzian distribution: 16 dY 2 8� 2 ��� 2 8� 2 = × = 2 = × 2 2 �2 � 25�� dX �� 2 �5� � 2 + 4 ���

(54-18)

For the Normal distribution, the exponential term has become a constant, and we see that the maximum magnitude of the derivative is inversely proportional to � 2 (for the constant area expression) or inversely as � (for the constant height expression). This confirms our observation from figure 54-1. For the Lorentzian distribution, we see that the derivative decreases with the second power of the bandwidth. Similarly, the maximum second derivative occurs when X = �, so inserting this equality into equation 54-9a for the Normal distribution gives us: 1 e0 −1 �� − ��2 d2 Y 1 �−� 2 − e− 2 � � � = 0 − = = 2 1/2 1/2 1/2 dXMAX � 5 �2�� � 3 �2�� � 3 �2�� � 3 �2��1/2 (54-19) And substituting X − � = 0 into equation 54-16 gives us the corresponding value for the Lorentzian distribution: 2 4 2 4 4 2 2 �0� �0� �0� �0� + 128 � −8 � + 8 + 64 + 4 2 dY = × 4 dX 2 MAX �� � 2 + 4 �0�2 =

2 × �−8� 4 � −16 = �� × � 8 �� 5

(54-20)

The negative sign in equations 54-19 and 54-20 reflect the fact that the maximum second derivative is a negative value, which also agrees with Figure 54-1, and it also tells us that the magnitude of the second derivative decreases inversely as the cube of � (for the Normal band shape) and inversely as the fifth power of � (for the Lorentzian band shape), that is as the bandwidth of the absorbance band increases. This explains why the derivatives of the broad absorbance band decrease with respect to the narrow absorbance band as we see in Figure 54-1, and more so as the derivative order increases.

THE BEHAVIOR OF COMPUTED DERIVATIVES Now, equations 54-6 and 54-9 are mathematically exact. But we observed when discussing Figure 54-1 that a representation of a derivative based on finite differences is only an approximation. How good is this approximation, and how quickly does it get bad? That depends somewhat on how the derivative is calculated. We made a point of noting that the derivative in Figure 54-1 was calculated from synthetic data, with abscissa (wavelength) spacing of 1 nm. This value of spacing was chosen so that the two methods of calculation would default to the same result. We note above that the definition of a derivative includes the operation of division by �X (or by dX, in the mathematically exact case). Some computer programs that purport to calculate

Derivatives in Spectroscopy: Part 1

345

derivatives do not include the step of performing that division, while others do. The results will vary considerably in the two cases. We will begin our discussion by consid ering the simpler case, where we do not divide by �X. This provides the numerator term for the derivative definition, and also for the approximation; this allows us to examine the behavior of that term in isolation. In some cases, this is all that is used or needed: it provides a qualitative observation of the overall shape of the spectrum that is of interest, for example. Sometimes it is done this way when the data is used for quantitative or qualitative analysis, and the spectral data from the “unknown” samples, the samples which are to be analyzed on a routine basis are treated the same way as the calibration data. Indeed, since the numerator term differs from the correct derivative approximation only by a scaling factor, it can be difficult to tell just from looking at the derivative curve whether it is a correctly calculated derivative or not, especially if the scale is not present. On the other hand, computing only the numerator term is not recommended when results are to be compared between different instruments or laboratories. It is also not recommended when performing theoretical studies are of interest, or when the results of experiments are to be compared to theoretical expectations, since it does not, in general, reflect the actual value of the true derivative. Given the minor computational burden, however, the proper computation of including the division should always be done. Here we start with the examination of the numerator term alone for its pedagogical value. The question arises: since the definition of the derivative specifies taking a limit as differences approach zero, would not the best results be obtained from using the smallest possible differences? The answer is “yes, but � � � ”. The “but” reflects the fact that while synthetic data is noise-free, real data contains noise. In this chapter we consider only the noise-free synthetic data we create, but it is clear that with real data, containing real and irreducible noise, computing smaller and smaller differences will eventually bring us to the point where the differences equal and then become less than the noise level. Derivative calculations are indeed known to be fraught with noise problems. In the interest of examining the behavior of the derivative, however, we are going to ignore the effect of the noise in this chapter, although we will eventually return to that question. One way to minimize noise effects is to exaggerate the differences, by computing finite differences at larger and larger wavelength intervals, and this is often done in practice. Figure 54-2 illustrates an example of this. In Figure 54-2 we present the results of computing finite difference approximations to a derivative (for the Normal case), using different spacings (i.e., the wavelength difference between the data points we compute the finite difference between; we will sometimes call this �X and freely intermix the two terms). For the derivatives in Figure 54-2, the underlying absorbance curve is the narrower one from Figure 54-1, having a 20 nm bandwidth. We see from Figure 54-2 that, in contrast to the mathematically ideal behavior of a true derivative, the behavior of a finite difference depends on how it is calculated. As Figure 54-2a shows, at small spacings, the shape of the computed difference curve closely mimics the true derivative, and has a magnitude that is proportional to the spacing. Figure 54-2b shows that as the spacing increases, several changes occur. 1) The relationship between the difference spacing and the magnitude of the derivative departs from the degree of proportionality we observe at smaller spacings. As the spacing increases, the maximum value of the computed difference asymptotically approaches the value of unity.

346

Chemometrics in Spectroscopy

2) There is a shift in the wavelength corresponding to the maximum value of the derivative 3) Close examination of Figure 54-2b will reveal a decrease in the slope of the difference curve at the point it crosses the X-axis, even though we are not using the denominator term of the derivative calculation. Figure 54-2c shows that at sufficiently large spacing values, the concept of this being a derivative breaks down entirely. The derivative curve has separated into two features, each of them appearing to be a Normal curve, although one of them is negative. As the spacing continues to increase, the two features move further and further apart. (a) 0.25 0.2

Spacing = 5 nm

First difference

0.15 0.1

Spacing = 1 nm

0.05 0 1 12 23 34 45 56 67 78 89 100 111 122 133 144 155 166 177 188 199 –0.05 –0.1 –0.15 –0.2 –0.25

Wavelength (b) 1 0.8

Spacing = 40 nm

First difference

0.6 0.4 0.2 0 1 11

21

31

41

51

61

71

81

91 101 111 121 131 141 151 161 171 181

–0.2 –0.4 –0.6

Spacing = 5 nm

–0.8 –1

Wavelength

Figure 54-2 First differences calculated using different spacings between the data points used to calculate the finite difference for the numerator term only, as an approximation to the derivative. The underlying curve is the 20 nm bandwidth absorbance band in Figure 54-1, with data points every nm. Figure 54-2a: Difference spacings = 1−5 nm; Figure 54-2b: Spacings = 5−40 nm� Figure 54-2c: Spacings = 40−90 nm. (see Colour Plate 19)

Derivatives in Spectroscopy: Part 1

347

(c) 1 0.8

Spacing = 40 nm

0.6

First difference

0.4 0.2 0 1 11 21 31 41 51 61 71 81 91 101 111 121 131 141 151 161 171 181 191 201 –0.2 –0.4 –0.6

Spacing = 90 nm

–0.8 –1

Wavelength

Figure 54-2 (Continued)

1 0.9 0.8

Absorbance

0.7 0.6 d3

0.5 d4 0.4

d2

0.3 0.2

d1

0.1 0 1

101

201

301

Wavelength

Figure 54-3 Showing a “derivative” computed over a very large spacing explains how the difference approximation to the derivative breaks down. With a large spacing, one point used for the difference is on the baseline, while the other traces over the shape of the curve.

Figure 54-3 shows how this occurs. When the spacing is very wide, that is wider than the breadth of the absorbance band near the baseline, one of the points used to compute the difference is always on the baseline, while the other point “rides” over the peak and traces its shape. As the point of the “derivative” slides along the X-axis, eventually the two points exchange roles, and the other feature is traced out, but with the opposite sign. Now we look at the second derivative similarly. Some of this has been presented previously in the literature [9, 10], although in less detail than we do here. Figures 54 4a to 54-4c present second derivatives calculated using the same spacing as for the

348

Chemometrics in Spectroscopy

differences in Figure 54-2. In Figure 54-4 we see that the second derivative is subject to some of the same effects as the first derivative: • Linear (proportional) change in amplitude at small spacings • Nonlinear change in amplitude at large spacings On the other hand, there is no shift in the wavelength of the central maximum, although Figures 54-4b and 54-4c show that the artifact peaks do change their wave length. Replacing the shift in wavelength, however, is a broadening of the central peak. (a) 0.06 0.04

Second difference

0.02 0 1

101

201

–0.02 –0.04

Spacing = 1 nm

–0.06 –0.08 Spacing = 5 nm

–0.1 –0.12 –0.14

Wavelength (b) 1

Second difference

0.5

0 1

101

201

–0.5 Spacing = 5 nm –1

–1.5

Spacing = 40 nm

–2

Wavelength

Figure 54-4 Second differences calculated using different spacings between the data points used to calculate the finite difference for the numerator term only, as an approximation to the derivative. The underlying curve is the 20 nm bandwidth absorbance band in Figure 54-1, with data points every nm. Figure 54-4a: Difference spacings = 1–5 nm; Figure 54-4b: Spacings = 5–40 nm; Figure 54-4c: Spacings = 40–90 nm. (see Colour Plate 20)

Derivatives in Spectroscopy: Part 1

349

(c) 1 Spacing = 40 nm

Second difference

0.5

0 1

101

201

–0.5 Spacing = 90 nm –1

–1.5

–2

Wavelength

Figure 54-4 (Continued)

We noted above that one characteristic of the second derivative is the narrowing of this peak compared to the underlying absorbance band. As the spacing over which the derivative is computed increases, however, this resolution enhancement effect decreases and eventually disappears. The reason is similar to that for the first derivative, as shown in Figure 54-3; at very large spacings the points used to compute the derivative eventu ally wind up simply tracing over the underlying absorbance band, with the result that, since second derivatives are essentially computed from three points, three copies of the underlying absorbance band are produced, albeit with different signs. In Figure 54-5 we show the variation of the computed derivatives as determined by the spacing used in the computation. Another feature that can be seen in Figure 54-5, 2.5 Second derivative

Derivative value

2

1.5 First derivative

1

0.5

0 0

10

20

30

40

50

60

70

80

90

Spacing

Figure 54-5 Maximum computed derivative magnitude determined by the spacing of the points used in the computation. Note that the sign of the second derivative has been reversed to simplify comparison with the first derivative behavior.

350

Chemometrics in Spectroscopy

which is also observable in Figure 54-4 albeit with some difficulty, is that at small spacing the maximum second derivative value is not simply proportional to the spacing but changes faster than proportionately to the spacing; the overall curve of calculated derivative value versus spacing is sigmoidal. We continue in our next chapter by examining the behavior of the derivative cal culation when the division of the �Y term is divided by the �X term, to form an approximation to the true derivative.

REFERENCES 1. 2. 3. 4. 5. 6. 7. 8.

Singleton, F. and Collier, G.L., Britain 760, 729 (1953). Singleton, F. and Collier, G.F., (London), 1519 (1955). Giese, A.T. and French, C.S., 9, 78 (1955). Low, M.J.D. and Mark, H., 241, 129–130 (1970). Hopkins, D., NIR News 12(3), 3–5 (2001). Hopkins, D., Near Infrared Analysis 2(1–13), (2001). Mark, H. and Workman, J., Spectroscopy 2(9), 37–43 (1987). Ingle, J.D. and Crouch, S.R., Spectrochemical Analysis (Prentice-Hall, Upper Saddle River, NJ, 1988). 9. Ritchie, G.E. and Mark, H. NIR News 13(1), 4–6 (2002). 10. Ritchie, G.E. and Mark, H., NIR News 13(2), 3–5 (2002).

55

Derivatives in Spectroscopy: Part 2 – The “True”

Derivative

We continue where we left off in Chapter 54 [1], and we start with some discussion regarding the observations we made concerning the change in the magnitude of the computed values of the derivatives (first and second) as the wavelength spacing over which they are computed is changed. As we normally do when continuing a subseries, we continue the equation numbering and figure numbering from where we left off in the previous chapter. To recap, we noted that at small spacings, the numerator of the computed approximation to the derivative was a close approximation to the shape of the true derivative, and the magnitude increased as the spacing increased, linearly for the first derivative and faster than linearly for the second derivative. In fairly short order, however, in both cases the rate of increase of the derivative magnitude started falling off as the spacing continued to increase. The falloff in the rate of increase was accompanied by some secondary effects: wavelength shifts of the peak derivative value, and various kinds of distortion of the shape of the derivative. At very large spacings (larger than the bandwidth of the peak) the “derivative” was replaced by what was essentially a tracing of the shape of the underlying peak, a double tracing for the first derivative and triple for the second derivative. At small spacing values, however, it now becomes clear why increasing the spacing is desirable. Since in real data the noise of the measured spectrum in constant (because the underlying spectrum from which various derivative approximations are calculated) is the same spectrum each time, increasing the spacing of the derivative computation increases the “signal” part of the signal-to-noise (S/N) ratio, thereby improving the S/N ratio. As we saw, however, too-large spacing were deleterious, both for distorting the shape of the peak and for producing inaccurate approximations to the derivative numerator. So now the question arises, what is the nature of the way the magnitude increases with spacing? In Figure 55-6a we show an expanded view of the region of the first derivative of the Normal curve, around the region of the maximum of the underlying (Normal) absorbance band. Similarly, Figure 55-6b shows the corresponding view for the second derivative. The first derivative is well-approximated by a straight line in this region. The second derivative is seen to be approximated by a parabola; a not unexpected result when considering that this represents the result obtained from a truncated Taylor series approximation of the curve. Therefore, for a first derivative, as the X spacing increases, the magnitude of the calculated “derivative” increases proportionately. In the case of the second derivative, increasing magnitude of the X spacing causes the magnitude of the calculated derivative to increase as the square of the spacing; this is the source of the initial upward curvature we noted in Figure 54-5 (reference [1]) for the second derivative.

352

Chemometrics in Spectroscopy (a) 0.05 0.04 0.03

First difference

0.02 0.01 0 0

1

51

–0.01 –0.02 –0.03 –0.04 –0.05

Wavelength

(b) 0.0015 Parabola 0.0005

Response

–0.0005 1 5

9

13 17 21 25 29 33 37 41 45 49 53 57 61 65 69 73 77 81

–0.0015

Second derivative

–0.0025 –0.0035 –0.0045 –0.0055

Wavelength

Figure 55-6 Expansions of the first and second derivative curves. Figure 55-6a: The region around the zero-crossing of the first derivative can be approximated with a straight line. Figure 55-6b: The region around the peak of the second derivative can be approximated with a parabola.

BETTER DERIVATIVE APPROXIMATIONS Now we will examine the behavior of the derivative approximation when both the numerator and the denominator terms are used. In Figure 55-7, we present the curves of this computation of the derivative corresponding to the numerator-only computation presented in Figure 54-2 of Chapter 54 [1]. Here we note several differences between

Derivatives in Spectroscopy: Part 2

353

Figure 55-7 and Figure 54-2. In Figure 55-7a we see that there is virtually no difference between any of the five curves, they are all producing essentially the same values, in contrast to Figure 55-2, in which the differences were increasing with spacing. The reason is that for the range of spacings used, all the derivative approximations calculated are reasonably good approximations to the true derivative. Therefore, since they all estimate the same true value, they are all essentially equal to each other. In Figure 55-7b, we notice even more differences from the corresponding part of Figure 55-2. The first thing we notice is one characteristic that is the same: the maximum (a) 0.05 0.04 0.03

First difference

0.02 0.01 0 1 12

23 34

45 56

67 78

89 100 111 122 133 144 155 166 177 188 199

–0.01 –0.02 –0.03 –0.04 –0.05

Wavelength

(b) 0.05 Spacing = 5

0.04 0.03

First difference

0.02 0.01 0 1 11

21

31

41

51

61

71

81

91 101 111 121 131 141 151 161 171 181

–0.01 –0.02 Spacing = 40

–0.03 –0.04 –0.05

Wavelength

Figure 55-7 First derivatives calculated using different spacings for finite difference approxi mation to the true derivative. The underlying curve is the 20 nm bandwidth absorbance band in Figure 54-1, with data points every nm. Figure 55-7a: Difference spacings = 1–5 nm; Figure 55-7b: Spacings = 5–40 nm; Figure 55-7c: Spacings = 40–90 nm. (see Color Plate 21)

354

Chemometrics in Spectroscopy (c) 0.025 0.02

Spacing = 40

0.015

First difference

0.01 0.005 0 1 12 23 34 45 56 67 78 89 100 111 122 133 144 155 166 177 188 199 –0.005 –0.01 –0.015

Spacing = 90

–0.02 –0.025

Wavelength

Figure 55-7 (Continued)

values of the derivative curves shift as the spacing increases. But another difference, that is at least as prominent, is that the maximum value decreases as the spacing increases, this is exactly opposite to the behavior we noticed in the numerator term where the maximum increased with the spacing. A third difference we notice from the corresponding part of Figure 55-2 is that at the point where the first derivative crosses the X-axis, the slope of the derivative also decreases with increasing spacing, while in Figure 55-2b the slope increased with the spacing except for the largest values of spacing included in that plot. Similarly, in Figure 55-7c both the maximum value of the derivative and the slope at the zero-crossing decrease, where as in Figure 55-2c the maximum of the calculated derivative remained constant, although the slope at the zero-crossing decreased. In the three parts of Figure 55-8, we see that the second derivative behaves similarly, except that it starts out smaller than the first derivative does, by almost an order of magnitude. Figure 55-9 confirms this: the second derivative is smaller than the first (remem ber, all this is for the Normal distribution; other distributions may behave differently). Figure 55-9 also shows how the correct computation of the derivative differs from the computation of the numerator only, which we saw in Chapter 54 (initial reference [1]). The “derivative” computed from the numerator term only increased and then leveled off as the spacing increased, whereas Figure 55-9 shows that the correct computation starts out with an (almost) constant value of the derivative, which then decreases, with an asymptotic approach to zero. Can we explain all these effects? Of course we can, and in fact the explanation is almost obvious. When spacings are small, the computed derivative is a good approximation to the true derivative. As long as this is the case, the exact value of X used to compute the derivative is unimportant, because as we saw in Figure 55-5, the first difference Y increases almost linearly with X, therefore all values of X give the same result for the computation, because Y/X is constant regardless of spacing.

Derivatives in Spectroscopy: Part 2

355

As we observed from Figure 55-5, however, as X continues to increase, Y no longer increases proportionately. Strictly speaking, this happens immediately when X becomes finite, and the question of whether the amount is noticeable is a matter of degree, how much difference it makes in a particular application. Nevertheless, whatever point that is, the initial increase in X carries a corresponding increase in Y , and beyond that point it is no longer proportional. At that point, the computed value of the estimate of the true derivative starts to decrease.

(a) 0.003 0.002

Second difference

0.001 0 1

101

201

–0.001 –0.002

Spacing = 5

–0.003 Spacing = 1

–0.004 –0.005

Wavelength (b) 0.003 0.002

Second difference

0.001 0 1

101

201

–0.001

Spacing = 40 –0.002 –0.003

Spacing = 5

–0.004 –0.005

Wavelength

Figure 55-8 Second derivatives calculated using different spacings for the finite difference approximation to the true derivative. The underlying curve is the 20 nm bandwidth absorbance band in Figure 54-1, with data points every nm. Figure 55-8a: Difference spacings = 1–5 nm; Figure 55-8b: Spacings = 5–40 nm; Figure 55-8c: Spacings = 40–90 nm. (see Color Plate 22)

356

Chemometrics in Spectroscopy (c) 0.0008 0.0006

Second difference

0.0004 0.0002 0 1

101

201

–0.0002 –0.0004 Spacing = 90

–0.0006 –0.0008 –0.001

Spacing = 40

–0.0012 –0.0014

Wavelength

Figure 55-8 (Continued)

0.045 0.04

Derivative magnitude

0.035 First derivative

0.03 0.025 0.02 0.015 0.01

Second derivative

0.005 0 0

10

20

30

40

50

60

70

80

90

Spacing

Figure 55-9 Maximum magnitudes of first and second derivative approximations as the spacing is varied.

Furthermore, as we also noted last time, at sufficiently large spacings (X) the numerator term ceased to increase. As we noted before, at this point the various points used for the computation are each individually tracing out the shape of the underlying curve. However, as X in denominator continues to increase, we can expect that the quotient, Y/X will decrease, and this is the behavior we observe. One final point to note: we see from Figure 55-9 that, as we noted before, the true value of the second derivative of a Normal curve (at its maximum) is roughly an order of magnitude smaller than the first derivative (or at least, the largest value of the first

Derivatives in Spectroscopy: Part 2

357

derivative). In the presence of noise, therefore, the S/N ratio will be degraded by this factor, from this cause alone. We also have noted before that adding or subtracting noisy data causes the variance to increase as the number of data points added together [2]. The noise of the first derivative, therefore, will be larger than that of the underlying absorbance band by a factor of the square root of two. We also showed previously that if a random variable (i.e., a measurement contaminated with noise) is multiplied by a constant (c, say), then the variance of the product is increased by a factor of c2 [3]. A second derivative calculation is equivalent to using coefficients 1, −2, 1 to multiply three data points spaced at the desired X-spacing by. The variance of the spectrum, then, is multiplied by 12 + 22 + 12 √ = 6. Therefore the standard deviation of the noise contribution to a first derivative is 2 greater√than the noise of the spectrum, while the noise contribution to the second derivative √ is 6 times the noise of the spectrum. Therefore the noise of the second derivative is 3, or roughly 1.7 times that of the first derivative. So from both aspects, the S/N ratio of the second derivative is worse than that of the first. The increase of the noise is clearly the lesser of the contributions, compared to the full order of magnitude reduction of the “signal” part of the S/N ratio. Second derivatives have become de rigeur as a data treatment of choice for spectral data, and there are reasons for that, which we have discussed But they also carry with them the burden of a severely reduced S/N ratio compared to first derivatives. When selecting a data treatment, therefore, one should know the disadvantages as well as the benefits of each one. While derivative treatments have been in long use for analysis of spectroscopic data, the quantitative study of the derivative transform has not previously been widely disseminated, but is worth having. There may be times when a second derivative transform is not giving adequate results, and in some of those cases, using a first derivative transform may be preferable.

REFERENCES 1. Mark, H. and J. Workman, J., Spectroscopy 18(4), 32–37 (2003). 2. Workman, J. and Mark, H., Spectroscopy 3(3), 40–42 (1988). 3. Mark, H. and Workman, J., Spectroscopy 3(8), 13–15 (1988).

This page intentionally left blank

56

Derivatives in Spectroscopy: Part 3 – Computing the

Derivative

In Chapters 54 and 55 [1, 2], we discussed the theoretical aspects of using derivatives in the analysis of spectroscopic data. Here we consider some of the practical aspects. The first one we will consider is, in the presence of some arbitrary but more-or-less (presumably) constant amount of noise, what is the optimum spacing of data at which to compute a difference to give the highest signal-to-noise ratio (S/N)? In the face of constant noise, this obviously reduces to the question: what is the spacing (for a Normal distribution) that gives the largest value for the numerator term? Note that the criterion for “best” has changed from our previous discussions, where “best” was considered to be the closest approximation to the true derivative. We have noted that the largest value of the true first derivative occurs when X − = . Therefore the largest differences between two points will occur when they are varied from + (or − ) by some amount , the spacing, which we need to determine. Therefore we need to determine the largest difference of

e

++/2

2

− e ++/2

2

(56-21)

The first question we need to ask is whether there is, in fact, a maximum value? That there is can be seen from noting that the Normal absorbance band approaches zero as X approaches infinity in both directions. Therefore if → the difference will approach zero. At small values of the difference will be finite, while as → 0 the difference will again approach zero, therefore there must be a maximum somewhere between 0 and . To get some idea of where that maximum is, in Figure 56-10 we show a plot of the difference as a function of , for the Normal absorbance band of 20 nm bandwidth we have shown in Figure 54-1. For a more precise result we must solve equation 56-21, but since it is transcendental, we must solve it by successive approximations. The result of doing so is max = 3428 nm. Since the bandwidth of the underlying absorbance band is 20 nm, the spacing needed for maximizing the first derivative S/N for any Normal absorbance band is therefore 3428/20 = 1714 times the bandwidth. However, this analysis is based on considering a single peak in isolation; as we will see for the second derivative, at some point it becomes necessary to take into account the presence and nature of whatever other materials exist in the sample. The second derivative is both simpler and more complicated to deal with. As we saw, the second derivative is maximum at the wavelength of the peak of the underlying absorbance curve, and we noted previously that the numerator term at that point increases

360

Chemometrics in Spectroscopy 0.9 0.8 0.7

Difference

0.6 0.5 0.4 0.3 0.2 0.1 0 0

10

20

30

40

50

60

70

80

90

100

Spacing

Figure 56-10 The difference between the ordinates of two points equally spaced around + as a function of the spacing. In this figure the underlying absorbance curve has a bandwidth of 20 nm.

monotonically with the spacing (see Figures 54-4 and 54-5 in [1]. Therefore we expect the S/N of the second derivative to improve continually as the spacing becomes larger and larger. While the “signal” part of the second derivative increases with the spacing used, the noise of the computed second derivative is independent of the spacing. It is, however, larger than the noise of the underlying spectrum. As we have shown [3], from elementary statistical considerations multiplying a random variable X by a constant A causes the variance of the product AX to be multiplied by A2 compared to the variance of X itself. Now, regardless of the spacing of the terms used to compute the second derivative, the operative multipliers for the data at the three wavelengths used are 1, −2, 1. Therefore the multiplier for the variance of the √ derivative is 12 + 22 + 12 = 6, and the standard deviation of the derivative is therefore 6 times the standard deviation of the spectrum, but nevertheless independent of the derivative spacing. The signal-to-noise ratio of the second derivative is therefore determined solely by the magnitude of the computed numerator value, which as we have seen, increases with spacing. In real samples, however, the wider the spacing the more likely it becomes that one of the points used for the derivative computation will be affected by the presence of other constituents in the sample, and the question of the optimum spacing for the derivative computation becomes dependent on the nature of the sample in which it is contained.

METHODS OF COMPUTING THE DERIVATIVE The method we have used until now for estimating the derivative, simply calculating the difference between absorbance values of two data points spaced some distance apart (and dividing by that X, of course), is probably the simplest method available. As we discussed in out previous chapter [4], however, there is a disadvantage associated with

Derivatives in Spectroscopy: Part 3

361

this method. This method causes a decrease in the S/N as compared with the underlying absorbance band, and this decrease has two sources. The lesser source is the increase in the noise level due to the addition of variances that occurs when numbers are added or subtracted. The far larger effect is that due to the fact that the derivatives are much smaller than the absorbance, and the second derivative is much smaller (by an order of magnitude) than the first. The net result is that, the closer the theoretical approximation to the true derivative is, the noisier the actual computed derivative becomes. Several methods have been devised to circumvent this characteristic of the process of taking derivatives. One of the very common methods is to reduce the initial noise of the spectrum by computing averages: averaging the spectral data over some number of wavelengths before estimating the derivative by calculating the difference between the resulting averages. This process is sometimes called “smoothing” since it smoothes out the noise of the spectrum. However, since we are not discussing smoothing, we will not consider this any further here. The next common method of computing derivatives is the use of Savitzky–Golay convolution functions. The application to spectroscopy is based on what is one of the most often-cited papers in the literature [5]. This classic paper presents the concept underlying this method for computing derivatives (including the zero-order derivative, which reduces to what is basically a smoothing operation), Figure 56-11A shows this diagrammatically. The assumption is that the mathematical nature of the underlying spectral curve is unknown, but can be represented over some finite region by a polynomial; “polynomial” in this sense in general and includes straight lines. If the equation for the polynomial is known, then the derivative of the spectrum can be calculated from the properties of the fitted polynomial. The key to all this is the fact that the nature of the polynomial can be calculated from the spectral data, by doing a least-squares fit of the polynomial to the data in the region of interest, as shown in Figure 56-11b. Figure 56-11a shows that various polynomials may be used to approximate the derivative curve at the point of interest, and Figure 56-11b shows that when the derivative curve is based on data that has error, the polynomials can be computed using a least-square fit to the data. At the point for which the derivative is computed, all three lines in Figure 56-11 are tangent to each other. The Savitzky–Golay approach provides for the use of varying numbers of data points to be used in the computation of the fitting polynomial. We will discuss the effect of changing the number of data points shortly. So the steps that Savitzky and Golay took to create their classic paper was as follows: 1) Fit a polynomial curve of the desired type (degree) to the data, using least-square curve fitting. 2) Compute the desired order of derivative of that polynomial 3) Evaluate the expression for the derivative of that polynomial at the point for which the derivative is to be computed. In the Savitzky–Golay paper, this is the central point of the set used to fit the data. As we shall see, in general this need not be the case, although doing so simplifies the formulas and computations. 4) Convert those formulas into a set of coefficients that can be used to multiply the data spectrum by, to produce the value of the derivative according the specified polynomial fit, at the point of the center of the set of data. As we shall see, however, their paper ignores some key points.

362

Chemometrics in Spectroscopy (a) Second derivative

0.0015 0.0005

Response

–0.0005 1 5 –0.0015

9

13 17 21 25 29 33 37 41 45 49 53 57 61 65 69 73 77 81 Linear derivative fit

Quadratic derivative fit

–0.0025 –0.0035 –0.0045 –0.0055

Concentration

(b) 0.0015 0.0005

Response

–0.0005 1

3

5

7

11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 Data

–0.0015 –0.0025 –0.0035

9

Linear fit Quadratic fit

–0.0045 –0.0055

Concentration

Figure 56-11 The Savitzky–Golay method of computing derivatives is based on a least-squares fit of a polynomial to the data of interest. In both parts of this figure the underlying second derivative curve is shown as the black line, while the linear (first degree) and quadratic (second degree) polynomials are shows as mauve and blue lines respectively. Figure 56-11a: here we show linear and quadratic fits to a Normal spectral curve. Figure 56-11b: an expansion of the Figure 56-11a shows how the polynomials are determined using a least-squares fit to the actual data in the region where the derivative is computed, when the data is contaminated with noise. Red dots represent the actual data. (see Color Plate 23)

And finally, while this work was all of very important theoretical interest, Savitzky and Golay took one more step that turned the theory into a form that could be easily put to practical use.

Derivatives in Spectroscopy: Part 3

363

5) For a good number of sets of derivative orders, fitting polynomials and numbers of data points, they calculated and printed in their paper tables of the coefficients needed for the cases considered. Thus the practicing chemist needed to be neither a heavy-duty theoretician nor more than a minimal computer programmer in order to make use of the results produced. Unfortunately there are also several caveats that have to go along with the use of the Savitzky–Golay results. The most important and also the best-known caveat is that there are errors in the tables in their paper. This was pointed out by Steinier [6] in a paper that is invariably cited along with the original Savitzky–Golay paper, and which should be considered a “must read” along with the original paper by anyone taking an interest in the Savitzky–Golay approach to computation of derivatives. The Savitzky–Golay coefficients provide a simplified form of computation for the derivative of the desired order at a single point. To produce a derivative spectrum the coefficients must be applied successively to sets of spectral data, each set offset from the previous one by a single wavelength increment. This is known as the convolution of the two functions. Having done that, the result of all the theoretical development and computation is that the derivative spectrum so produced simultaneously is based on a smoothed version of the spectrum. The amount of smoothing depends on the number of data points used to compute the least-squares fit of the polynomial to the data, use of more data points is equivalent to performing more smoothing. Using higher-degree poly nomials as the fitting function, on the other hand, is equivalent to using less smoothing, since high-order polynomials can twist and turn more to follow the details of the data.

LIMITATIONS OF THE SAVITZKY–GOLAY METHOD The publication of the Savitzky–Golay paper (augmented by the Steinier paper) was a major breakthrough in data analysis of chemical and spectroscopic data. Nevertheless, it does have some limitations, and some more caveats that need to be considered when using this approach. One limitation is that the method as originally described is applicable only to compu tations using odd numbers of data points. This was implied earlier when we discussed the fact that a derivative (of any order) is computed at the central point (wavelength) of the set used. Another limitation is that, also because of the computation being applicable to the central data point, there is an “end effect” to using the Savitzky–Golay approach: it does not provide for the computation of derivatives that are “too close” to the end of the spectrum. The reason is that at the end of the spectrum there is no spectral data to match up to the coefficients on one side or the other of the central point of the set of coefficients, therefore the computation at or near the ends of the spectrum cannot be performed. Of course, an inherent limitation is the fact that only those combinations of parameters (derivative order, polynomial degree and number of data points) that are listed in the Savitzky–Golay/Steinier tables are available for use. While those cover what are likely to be the most common needs, anyone wanting to use a set of parameters beyond those supplied is out of luck.

364

Chemometrics in Spectroscopy

A caveat to the use of the Savitzky–Golay tables is that, even after Steinier’s cor rections, they apply only to a special case of data, and do not, in general, produce the correct value of the true derivative. The reason for this is similar to the problem we pointed out in out first chapter dealing with computation of derivatives [1]: applying the Savitzky–Golay coefficients to a set of spectral data is equivalent to assuming that the data is separated by unit X distance, and therefore is equivalent to computing only the numerator term of a finite difference computation, without taking into account the X (spacing) to which the computed Y corresponds. Therefore, in order to compute the Savitzky–Golay estimate of a true derivative, the value computed using the Savitzky– Golay coefficients must be divided by (Xn , where n is the order of the derivative. Another limitation is perhaps not so much a limitation as, perhaps, a strange characteristic, albeit one that can catch the unwary. To demonstrate, we consider the simplest S–G derivative function, that for the first derivative using a 5-point quadratic fitting function. The convolution coefficients (after including the normalization factor) are −02 −01 0 01 02 Suppose we compute a second derivative by applying this first derivative func tion twice? The effect is easily shown to be equivalent to applying the convolution coefficients: 004 004 001 −004 −01 −004 001 004 004 a collection of nine coefficients that produces a second derivative, based on the S–G first derivative coefficients. However, this collection of convolution coefficients appears nowhere in the S–G tables. The nine-point S–G second derivative with a Quadratic or Cubic polynomial fit has the coefficients: 00606 00152 −00173 −00368 −00433 −00368 −00173 00152 00606 And the nine-point S–G second derivative with a Quartic or Quintic polynomial fit has the coefficients: −08811 25944 10559 −14755 −25874 −14755 10559 25944 −08811 The original S–G paper [5] describes how to compute other S–G convolution coeffi cients from given ones; these other coefficients are also functions that follow the basic concepts of the S–G procedure: the derivative of a least-square, best-fitting polynomial function. Since they do not produce the convolution coefficients we generated by apply ing the S–G first derivative coefficients twice, however, we are forced to the conclusion that even though the coefficients for the first derivative follow the S–G concepts, apply ing them twice (or multiple) times in succession does not produce a set of convolution coefficients that is part of the S–G collection of convolution functions. This seems to be generally true for the S–G convolution coefficients as a whole.

Derivatives in Spectroscopy: Part 3

365

EXTENSIONS TO THE SAVITZKY–GOLAY METHOD Several extensions have been developed to the original concept. First we will consider those that do not change the fundamental structure of the Savitzky–Golay approach, but simply make it easier to use. The main development along this line is the elimination of the tables. On the one hand, tables of coefficients are easy to deal with conceptually, because they can be applied mechanically – just copy down the entries and use them to multiply the data by. In fact, our initial foray into the world of Savitzky–Golay involved writing just such a program. The task was tedious, but having done it once and verified the numbers it should never be necessary to do it again. However, as noted above this approach has the inherent limitation of including only those conditions that are listed in the Savitzky–Golay tables, extensions to the derivative order, polynomial degree, or number of data points used are excluded. Therefore an extension of this idea was presented in a paper by Hannibal Madden [7]. Instead of presenting the already-worked-out numbers, Madden derived formulas from which the coefficients could be computed, and presented a table of those formulas in this paper. This is definitely a step up, since it confers several advantages: 1) Through the use of these formulas, Savitzky–Golay convolution coefficients could be computed for a convolution function using any odd number of data points for the convolution. 2) Since the coefficients are being computed by the computer, there is no chance for typographical errors occurring in the coefficients. Madden’s paper, however, also has limitations: 1) The paper contains formulas for only those derivative orders and degrees of polyno mials that are contained in the original Savitzky–Golay paper, therefore we are still limited to those derivative orders and polynomial degrees. 2) The coefficients produced still contain the implicit assumption that the value of X = 1. Therefore to produce correct derivatives, it will still be necessary to divide the results from the formulas by (Xn , as above. 3) The formulas are at least as complicated, difficult and tedious to enter as the tables they replace, and as fraught with the possibility of typos during their entry. This is exacerbated by the fact that, being a formula in a computer program, everything must be just so, and all the parentheses, and so on must be in the right places, which, for formulas as complicated as those are, is not easy to do. Nevertheless, as with the tables, once it is done correctly it need not be done again (but make sure you back up your work!). However, for the real kick in the pants, see the next item on this list. 4) There is an error in one of the formulas! While writing the program to implement the formulas in Madden’s paper, despite the tedium, most of the formulas (ten of the eleven given) in the program were working correctly in fairly short order – “correctly” in this case meaning that the coefficients agree with those of Savitzky– Golay or of Steinier, as appropriate. There was a problem with one of the formulas, however; the one for the third derivative using a quintic (fifth degree) polynomial

366

Chemometrics in Spectroscopy

fitting function. The coefficients produced were completely unreasonable, as well as being wrong. The coding of the formula was checked a couple of ways. First that formula was rewritten again, starting from scratch and using a different scheme to convert the printed formula to computer code, the same wrong answers were obtained both times. Then a buddy (Dave Hopkins), who was working with me on a project, was asked to check the coding; he reported not being able to find any discrepancies between the printed formula and what was coded. This left two possi bilities: either the printed formula was wrong or the corresponding Steinier table was wrong. We first tried to contact Hannibal Madden since the paper gave his affiliation as Sandia National Laboratory, but he was no longer there and the Human Resources department had no information as to his current whereabouts. Finally the problem was posted to an on-line discussion group (the discussion group for the International Chemometrics Society), asking if anybody had information relating to this problem. Fortunately, Premek Lubal, one of the members of the group, had run into this problem previously, while checking the derivations in Madden’s paper and knew the solution (P. Lubal, 2002, private communication). To save grief on the part of anybody who might want to code these formulas for themselves, here is the solution: in the formula for the case involved, the quintic fitting function for the third derivative, the term (50 ∗ m) has the wrong sign. The sign in the printed formula is negative −, and it should be positive +. After changing the sign of that term, the program produced the correct coefficients. So now the question presents itself: is there a more general method of computing coefficients for any arbitrary set of combinations of derivative order, polynomial degree and number of data points to fit? That is, is there an automated method for computing Madden’s formulas, or at least the Savitzky–Golay convolution coefficients? The answer turns out to be Yes. From the same on-line discussion that produced the solution to the problem in the Madden paper, Chris Brown pointed out some pertinent literature citations [8, 9], and summarized them into the general solution the we discuss below (C. Brown, 2002, private communication). Is the solution as “simple” as the tables in Savitzky–Golay/Steinier or the formulas in Madden? This is a matter of perception. If this general solution was presented to the chemical/spectroscopic community in 1964 (at the time of the original Savitzky–Golay paper), it would have been considered far beyond what a “mere” chemist would be expected to know, and would never have gained the popularity it currently enjoys. With the advent of modern software tools, however, tools such as MATLAB and even the older language, APL, matrix operations can be coded directly from the matrix-math expressions, and then it becomes near-trivial to create and solve the matrix equations on-the-fly, so to speak, and calculate the coefficients for any derivative using any desired polynomial, and computed over any odd number of data points. Wentzell et al. [9] present this scheme in a very clear way, the same way that Chris Brown gave it to me. We start by creating a matrix. This matrix is based on the index of coefficients that are to be ultimately produced. Savitzky and Golay labeled the coefficients in relation to the central data point of the convolution, therefore a three-term set of coefficients are labeled −1, 0, 1. A five-term set is labeled −2, −1, 0, 1, 2 and so forth.

Derivatives in Spectroscopy: Part 3

367

The matrix (M) is set up like this table (this, of course, is only one example, for expository purposes): 1 1 1 M= 1 1 1 1

−3 −2 −1 0 1 2 3

−27 −8 −1 0 1 8 27

9 4 1 0 1 4 9

(56-22)

What are the key characteristics of this matrix, that we need to know? The first one is that each column of the matrix contains the set of index numbers raised to the n − 1 power, where n is the column number in the table. Thus the first chapter contains the zeroth power, which is all 1s, the second column contains the first power, which is the set of index numbers themselves, and the rest of the columns are the second and third powers of the index numbers. What determines the number of rows and columns? The number of rows is determined by the number of coefficients that are to be calculated. In this example, therefore, we will compute a set (sets, actually, as we will see) of seven coefficients. The number of columns is determined by the degree of the polynomial that will be used as the fitting function. The number of columns also determines the maximum order of derivative that can be computed. In our example we will use a third-power fitting function and we can produce up to a third derivative. As we shall see, coefficients for lower-order derivatives are also computed simultaneously. The matrix M is then used as the argument for the following matrix equation: Coefficients = MT M−1 MT

(56-23)

where, by convention the boldface M refers to the matrix we produced, the superscript T refers to the transpose of the matrix and the superscript −1 means the matrix inverse of the argument. Let us evaluate this expression. The matrix M is given above, as equation 56-22. The transpose, then, is 1 −3 M = 9 −27 T

1 −2 4 −8

1 −1 1 −1

1 0 0 0

1 1 1 1

1 2 4 8

1 3 9 27

(56-24)

We then need to multiply these two matrices together to form MT M (rules for matrix multiplication are given in many books, including [10]): 7 0 M M= 28 0 T

0 28 0 196

28 0 196 0

0 196 0 1588

(56-25)

368

Chemometrics in Spectroscopy

Then we compute the matrix inverse of equation 56-25 (in MATLAB, this is just:: inv(m))

T

M M

−1

0333333 0 = −00476190 0

0 −0262566 0 −00324074

−00476190 0 001190476 0

0 −00324074 0 000462962

(56-26)

Finally, multiplying equation 56-26 by equation 56-24 gives MT M−1 MT = −009523 0142857 0285714 0333333 0285714 0142857 −009523 0087301 −02658730 −0230158 0 0230158 02658730 −0087301 00595238 0 −00357142 −00476190 −00357142 0 00595238 −00277777 002777777 002777777 0 −00277777 −00277777 00277777

(56-27) Equation 56-27 contains scaled coefficients for the zeroth through third derivative con volution functions, using a third degree polynomial fitting function. The first row of equation 56-27 contains the coefficients for smoothing, the second row contains the coefficients for the first derivative, and so forth. Equation 56-27 gives the coefficients, but there is a scaling factor missing. Therefore there is one more final computation that needs to be performed to create the correct coefficients; each row must be multiplied by the scaling factor. The scaling factor is (p − 1)! where p is the row number. Therefore the scaling factors for the first two rows are unity, since 0! and 1! are both unity, the scaling factor for the third row is two and for the fourth row is six. The final set of coefficients therefore is MT M−1 MT corrected for scaling = −009523 0142857 0285714 0333333 0285714 0142857 −009523 0087301 −0265873 −0230158 0 0230158 02658301 −0087301 0119047 0 −00714285 −0095238 −00714285 0 01190476 −0166666 0166666 0166666 0 −0166666 −0166666 0166666

(56-28) Finally, for those who are facile with the matrix math, Bialkowski [8] also shows how the “end effect” can be obviated, as well as allowing the use of even numbers of data points, but the advanced considerations involved are beyond the scope of our chapter. When this material was first published as an article, our respondents pointed out that the magnitudes of the various derivatives, and especially the relative magnitudes of derivatives of different orders, depend on the units used, particularly the units used to describe the X-axis. Now, while in fact we did not specify any units in our discussion (see, e.g., Figure 54-1 in [1], where the X-axis contains only the label “Wavelength”), given our backgrounds it is true enough that we implicitly had nanometers in mind for our X-units. In the case of real spectra, however, if spectra were measured using, say, microns as the units for the X-axis, the same spectrum would have a calculated value for the first derivative that was 1000 times what would be calculated for an “nm-based” derivative. In that case, the first derivative (for a 10 nm wide band, which would be a

Derivatives in Spectroscopy: Part 3

369

0.01 micron wide band) would be 100 times greater than the maximum spectral value, rather than being 1/10 of it, as the value computed using nanometers for the X-scale came out to. The second derivative would then be 106 times what we calculated and therefore 10,000 times greater than the maximum spectral value, instead of being 1/100 of it, the value we showed. In principle this is all correct. In practice, however, if we ignore FTIR and specialty technologies such as AOTF, then the vast majority of instruments in use today for modern NIR spectroscopy (still primarily diffraction grating based instruments) use nanometers as their wavelength unit, and usually collect data at some small integer number of nanometers. Furthermore, the vast majority of those have a 10-nm bandpass, so that 10 nm is the minimum bandwidth that would be measured. Also, even for instruments with higher resolution, the natural bandwidths of many, or even most, absorbance bands of materials that are commonly measured are greater than 10 nm in the NIR. Given all this, the use of a 10 nm figure to represent a “typical” NIR absorbance band is not unrealistic, and gives the reader a realistic assessment of what a “typical” user can expect from the NIR spectra he measures, and their derivatives. The choice of units, of course, does not affect the instrumental characteristic of signal-to-noise. which is what is important, and which we discuss in Chapter 57 [11]. If we consider FTIR instrumentation then the situation is trickier, since the equivalent resolution in nm varies across the spectrum. But even keeping the spectrum in its “natural” wavenumber units, we again find that, except for rotational fine structure of gases, the natural bandwidth of many (most) absorbance bands is greater than 10 wavenumbers. So again, using that figure shows the “typical” user how he can expect his own measured spectra to behave. We thank Todd Sauke, Peter Watson, and (again) Colin Christy for pointing out the errors and for general comments and discussion.

REFERENCES 1. 2. 3. 4. 5. 6. 7. 8. 9.

Mark, H. and J. Workman, J., Spectroscopy 18(4), 32–37 (2003). Mark, H. and Workman, J., Spectroscopy 18(9), 25–28 (2003). Mark, H. and Workman, J., Spectroscopy 3(8), 13–15 (1988). Mark, H. and Workman, J., Spectroscopy 18(12), 106–111 (2003). Savitzky, A. and Golay, M.J.E., Analytical Chemistry 36(8), 1627–1639 (1964). Steinier, J., Termonia, Y. and Deltour, J., Analytical Chemistry 44(11), 1906–1909 (1972). Madden, H.H., Analytical Chemistry 50(9), 1383–1386 (1978). Bialkowski, S.E., Analytical Chemistry 61(11), 1308–1310 (1989). Wentzell, P.D. and Brown, C.D., “Signal Processing in Analytical Chemistry”; in Encyclo pedia of Analytical Chemistry, Meyers, R. A. (Ed.) (John Wiley & Sons, Chichester, 2000), pp. 9764–9800. 10. Mark, H. and Workman, J., Statistics in Spectroscopy, 1st ed. (Academic Press, New York, 1991). 11. Mark, H. and Workman, J., Spectroscopy 19(1), 44–51 (2004).

This page intentionally left blank

57

Derivatives in Spectroscopy: Part 4 – Calibrating with

Derivatives

In Chapters 54–56 [1–3] contained discussion of the theoretical aspects of using deriva tives in the analysis of spectroscopic data, followed by a discussion of the development of the Savitzky–Golay method of using convolution functions to compute derivatives, concluding with the presentation of a general method to create the set of convolution coefficients for any desired order of derivative, using any degree of polynomial fitting function and number of data points. When performing quantitative calibrations using a derivative transform, several pos sible problems can arise. We have already noted that one of these is the possibility that the data used to compute the derivative will be affected by interfering materials. There is little we can do in a chapter such as this to deal with such arbitrary and sampledependent issues. Therefore we will concentrate on those issues which are amenable to mathematical analysis; this consists mostly of the behavior of the computed derivative when there is noise on the data. Most of our discussion so far has centered on the use of the two-point-difference method of computing an approximation to the true derivative, but since we have already brought up the Savitzky–Golay method, it is appropriate here to consider both ways of computing derivatives, when considering how they behave when used for quantitative calibration purposes. In fact, the two-point method can be considered a special case of the more general S–G concept, since it can be considered the application of the set of convolution coefficients: −1, 0, 1 to the data. Of course, these convolution coefficients were created ad hoc, and not according to the general scheme that produces the S–G set. Nevertheless, it is convenient to group them together for the purpose of further examination. We are also indebted to David Hopkins for invaluable discussions concerning the properties of the S–G convolution coefficients (D. Hopkins, 2002, personal Communication). In our previous chapter we derived the expressions for the first and second derivatives of both the Normal and Lorentzian band shapes [1]. For the following discussion, however, we will address only the Normal case, as we will see, the Lorentzian case will parallel it closely. In that previous chapter, we used the standard generic formula for the Normal distri bution, ignoring the aspect of using it to describe the situation for quantitative analysis. The quantity of concern now is the S/N of the data that we will use to perform the calibration calculations. In order to deal with this systematically, the S/N must now be divided into two parts: the magnitude of the signal, and the magnitude of the noise. Then different situations can be compared by independently computing the signal and noise contributions to the final S/N that is operative on the calibration. We start with the simpler case, the signal. By investigating the behavior of the theoretical, ideal derivative, we avoid issues having to do with the different ways of an

372

Chemometrics in Spectroscopy

approximation to the derivative can be obtained. The various approximations that can be obtained through the use of constructs such as the Savitzky–Golay convolutions allow us to make tradeoffs between maximizing the signal, faithfully reconstructing the true derivative, and creating artifacts, but these issues are all obviated by considering the behavior of the theoretically ideal case. When we come to consider the noise, then as we shall see, the nature of the approximating method becomes very important, but for now we will ignore that. If the concentration of a material can vary, however, then according to Beer’s law, the absorbance at any given wavelength will also be proportional to C, the concentration. Therefore to take the concentration into account we must modify (including changing the generic Y variable to A, to indicate absorbance) equations 54-1a, 54-6a and 54-9a (found in Chapter 54) to A=C

1 1 X− 2 e− 2 1/2 2

(57-29)

Whereupon the first derivative becomes 2 dA − X − − 21 X− e =C 3 1/2 dX 2

(57-30)

And the second derivative is � � d2 A X − 2 1 1 X− 2 =C − 3 e− 2 2 5 1/2 1/2 2 2 dX

(57-31)

The “signal” part of the S/N ratio that concerns us is the way these expressions vary with the concentration of the analyte. Therefore, from equation 57-29 we obtain, for the absorbance signal: � � 2 dA d 1 1 1 X− 2 − 21 X− = C e = e− 2 (57-32) 1/2 1/2 2 dC dC 2 For the first derivative we obtain � � � � 2 2 d dA d − X − − 21 X− − X − − 21 X− e e = C 3 = 3 21/2 dC dX dC 21/2 And for the second derivative we obtain � � � � � � 2 d d2 A d X − 2 1 − 21 X− = C − e dC dX 2 dC 5 21/2 3 21/2 � � 2 X − 2 1 − 21 X− = − e 5 21/2 3 21/2

(57-33)

(57-34)

As we see from these equations, we have recovered the original expressions for the absorbance and the derivatives with respect to wavelength. The expression we used

Derivatives in Spectroscopy: Part 4

373

for the Normal curve was the constant-area expression, but the continuation of the derivation for the change of the signal with respect to concentration will follow for the constant-height case, and for the Lorentzian curve, also. As we saw in the previous chapter [1], when compared to the rate of change of the absorbance, the maximum value of the first derivative decreases as 2 (i.e., 3 for the derivative divided by for the absorbance) and the second derivative similarly decreases as 4 and therefore their derivatives with respect to concentration (which is the sensitivity to concentration changes) also decreases that way. Therefore we now turn to the “noise” part of the S/N ratio. As we saw just above, the two-point derivative approximation can be put into the framework of the S–G convolution functions, and we will therefore not treat them as separate methods. We have derived previously [4, 5] that the following expression relates the noise on data to the noise of a constant multiple of that data: VaraX = a2 VarX

(57-35)

and, of course, we know that variances add. Therefore, if we have several variables, each of them contaminated with some noise (whose variance is Var(X)), and they are multiplied by some constants, then the variance of the result is VarXnet = a21 VarX + a22 VarX + a23 VarX +

(57-36)

Therefore, if X represents the spectrum, the various ai represent convolution coefficients and Var(X) represents a noise source that gives a constant noise level to the spectral values, then equation 57-36 gives the noise variance expected to be found on the com puted resultant value, whether that is a smoothed spectral value, or any order derivative computed from a Savitzky–Golay convolution. For a more realistic computation, an interested (and energetic) reader may wish to compute and use the actual noise that will occur on a spectrum, from the information determined in the previous chapters: [6–7] instead of using a constant-noise model. But for our current purposes we will retain the constant-noise model; then equation 57-36 can be simplified slightly: � (57-37) SDXnet = SDX a21 + a22 + a23 + The expression under the radical gives the multiplying factor for the noise standard deviation for the computed derivative (or smoothed spectrum, but that is not our topic here, we will address only the question of the effect on derivatives), and can be computed solely from the convolution coefficients themselves, independently of the effect of the convolution on the “signal” part of the S/N ratio. The nature of the convolution function matters, however, and so do the details of the way it is computed. To see this, let us begin by considering the two-point derivative we have been dealing with in most of this sub-series of chapters. For our first examination of the effect, let us consider that we are computing the derivative from adjacent data points spaced 1 nm apart (such as in our initial discussion of derivatives [1]). As we mentioned, the two-point first derivative is equivalent to using the convolution function {−1, 1}. We also treated this in our previous chapter, but it is worth repeating here. Therefore the multiplying factor of the spectral noise variance is −12 + 12 = 2,

374

Chemometrics in Spectroscopy

√ and the multiplying factor for the noise standard deviation is 2. Similarly, the second derivative based on adjacent data points is equivalent to a convolution function of {1, −2, 1}, making the multiplying factor for the standard deviation of the derivative √ calculated this way equal 6. Since we have noted above that the magnitude of the “signal” parts of the S/N ratio dC/ddX/d decreases with increasing derivative order, at this point it would appear that since the signal decreases and the noise increases when you take a derivative, you wind up losing from both parts of the S/N ratio. But things are not so simple. In this examination we have so far looked only at a derivative calculated from adjacent data points. What happens when we calculate a twopoint derivative based on non-adjacent data points? In fact we have already considered this question qualitatively in our previous chapter [3], when we noted that using the optimum spacing will result in an improved S/N ratio for the derivative. Of course, “improved” in this case is in comparison to the derivative computed using adjacent data points, it must be determined on a case-by-case basis whether the improvement is sufficient to exceed that of the actual direct absorbance signal. The improvement can also be expressed semi-quantitatively in a graph, as we do in Figure 57-12. Here we show true spectrum as the straight line representing the true derivative, and the measured absorbance data as the large Xs. Since the measured data are contaminated with random noise, they do not fall on the line representing the true spectrum. The diagram is set up in such a way, however, that the “noise” on the data from the two wavelengths representing spacing = 1 and spacing = 2 is the same. It is clear from this diagram that the computed approximation to the true derivative is better for the case of spacing = 2, even though the noise is the same. There are several ways to express this in words. One way is to note that the error is “spread” over a larger X distance, and therefore has less effect at any one point. Another way is to note that for a derivative computation, the effective “signal” is the value of

True derivative

Deriv error 2 Deriv error 1

ΔX = 1

ΔX = 2

Figure 57-12 This diagram shows how, as the spacing at which the derivative is computed increases, the error in the approximation to the true derivative decreases, even for the same error in the data.

Derivatives in Spectroscopy: Part 4

375

Y , and when X = 2, Y is double the value of Y when X = 1. Since the noise is the same, the S/N therefore improves with an increase in the spacing. We learned in our prior chapter [3], however, that the improvement is linear with spacing only at very small values of X, at large values it decreases, levels off, and then eventually starts to get worse again. From a mathematical point of view, we can let X be the increment between adjacent measurement wavelengths. Then, X = n × X, where n is the number of wavelength increments over which the derivative is calculated. Then, computed derivative =

Y Y = X n X

(57-38)

And applying equation 57-35 to find the variance of the computed derivative we obtain � � 2 Y Varderivative = 2 Var (57-39) n

X where the multiplier of two comes from the fact that a derivative is calculated from two data points, as we just showed from the above discussion, and since X is a constant (with an assumed value of unity), and therefore its variance is zero, equation 57-39 becomes: Varderivative =

2 Var Y n2

(57-40)

Converting to standard deviations: √ SDderivative =

2 SD Y n

(57-41)

A similar expression can be developed for the second derivative, but we leave that as an exercise for the reader. We turn now to the effect of using the Savitzky–Golay convolution functions. Table 57-1 presents a small subset of the convolutions from the tables. Since the tables were fairly extensive, the entries were scaled so that all of the coefficients could be presented as integers; we have previously seen this. The nature of the values involved caused the entries to be difficult to compare directly, therefore we recomputed them to eliminate the normalization factors and using the actual direct coefficients, making the coefficients more easily comparable; we present these in Table 57-2. For Table 57-2 we also computed the sums of the squares of the coefficients and present them in the last row. One trend is obvious: the more data included in the computation, the smaller the variance multiplying factor. This is expected for the case of smoothing; we know that the more data included in even an ordinary running smooth (i.e., a running arithmetic average), the smaller the variance of the smoothed (averaged) result (reducing as the square root of the number of data point included in the average). Therefore it is not surprising to find it also happening with a weighted average, such as we find with a Savitzky–Golay smooth. We see a similar effect from the first derivative; this can also be considered to be extended from the case of the two-point derivative, where we showed above that the

376

Chemometrics in Spectroscopy

Table 57-1 Some of the Savitzky–Golay convolution coefficients using a quadratic fitting function Index −4 −3 −2 −1 0 1 2 3 4 Normal. factor

5-point smooth

7-point smooth

9-point smooth

−3 12 17 12 −3

−2 3 6 7 6 3 −2

−21 14 39 56 59 54 39 14 −21

35

21

231

5-point 1st deriv

7-point first deriv

9-point first deriv

−2 −1 0 1 2

−3 −2 −1 0 1 2 3

−4 −3 −2 −1 0 1 2 3 4

10

28

60

Table 57-2 The Savitzky–Golay convolution coefficients multiplied out. All coefficients are for a quadratic fitting function. See text for meaning of SSK Index −4 −3 −2 −1 0 1 2 3 4 SSK

5-point smooth

7-point smooth

9-point smooth

−00857 034285 048571 034285 −00857

−00952 014825 028571 033333 028571 014825 −00952

−00909 006060 016883 023376 025541 023376 016883 006060 −00909

048571

0333333

025541

5-point 1st deriv

7-point first deriv

9-point first deriv

−02 −01 0 01 02

−010714 −007142 −003571 0 003571 007142 010714

−006666 −005 −003333 −001667 0 001667 003333 005 006666

014

003571

0016667

farther apart the points used are the smaller the variance of the resulting derivative value. In the case of the Savitzky–Golay convolution functions, however, the mechanism leading the reduced variance is slightly different than that of the two-point derivative. In the S-G case, the reduced variance is caused by both the use of a wider wavelength range for the derivative computation and the implicit smoothing effect of computing the function over multiple data points, just as it is in the case of explicit smoothing. There are several directions that the convolutions can be varied; one is the increase the amount of data used, by using longer convolution functions as we demonstrated above. Another is to increase the degree of the fitting polynomial, and the third is to compute higher-order derivatives. In Table 57-3, we present a very small selection of the effect of potential variations.

Derivatives in Spectroscopy: Part 4

377

Table 57-3 More Savitzky–Golay convolution coefficients. See text for meaning of SSK Index

7-point smoothing with quartic fitting function

5-point first derivative with cubic fitting function

5-point second derivative, quadratic fitting function

−3 −2 −1 0 1 2 3

0.02164 −0.12987 0.32467 0.56709 0.32467 −0.12987 0.02164

0.083333 −0.66667 0 0.666667 −0.083333

0.2857 −0.14285 −0.2857 −0.14285 0.2857

SSK

0.5670

0.9027

0.2857

What can we learn from Table 57-3? We can compare those sums of squared coeffi cients with the corresponding one in Table 57-2 using the same number of data points, and either: 1) The same order derivative with a lower-degree fitting polynomial, or 2) The same degree polynomial, for a lower-order derivative. For comparison 1, we find two cases: 7-point smooth with quadratic versus quartic fitting function, and 5-point first derivative with quadratic versus cubic fitting function. From these two comparisons we find that the noise multiplier of the derivative (of the same order and number of data points) increases as the degree of the fitting function increases. For comparison 2, we find one case: five-point first derivative versus five-point second derivative, both using a quadratic fitting function. Here again, the noise multiplier increased with increasing derivative order. In fact, we see that the five-point first derivative using a cubic fitting function will have almost as high a noise level as the original data. Couple this with the fact we saw above, that the sensitivity to concentration of the first derivative is reduced compared to the sensitivity of the absorbance data itself, and we see that in this particular case, depending on the value of for the absorbance band, use of this form of computing the derivative may be worse than using the absorbance data, while using a different computation, such as a quadratic fitting function may be better than the absorbance data. Therefore, the effect of using derivatives will depend very much, on a case-by-case basis, whether a particular computation will be beneficial or detrimental. For this reason, the reader will find another very interesting exercise to compute the sums of the squares of the coefficients for several of the sets of coefficients, to extend these results to both higher order derivatives and higher degree polynomials, to ascertain their effect on the variance of the computed derivative for extended versions of these tables. Hopkins [8] has performed some of these computations, and has also coined the term “RSSK/Norm” for the ((coeff/Normalization factor)2 in the S–G tables. Since here we pre-divide the coefficients by the normalization factors, and we are not taking the square roots, we use the simpler term SSK (sum squared coefficients) for our equivalent quantity. Hopkins in the same paper has also demonstrated how the two-point

378

Chemometrics in Spectroscopy

computation of derivatives can also have an equivalent value of the RSSK/Norm, with results essentially equivalent to the ones we present above. Table 57-3 in [8], particularly, shows how differences in the application of the derivative computation can cause the noise level of the computed derivative to be either greater or less than the noise of the absorbance spectrum from which they are computed.

ACKNOWLEDGEMENT The authors thank David Hopkins for valuable discussions regarding several aspects of the behavior of Savitzky–Golay derivatives, and also for making sure we spelled “Savitzky” and “Steinier” correctly!

REFERENCES 1. 2. 3. 4. 5.

Mark, H. and Workman, J., Spectroscopy 18(4), p.32–37 (2003). Mark, H. and Workman, J., Spectroscopy 18(9), 25–28 (2003). Mark, H. and Workman, J., Spectroscopy 18(12), 106–111 (2003). Mark, H. and Workman, J., Spectroscopy 3(8), 13–15 (1988). Mark, H. and Workman, J., Statistics in Spectroscopy, 1st ed. (Academic Press, New York, 1991). 6. Mark, H. and Workman, J., Spectroscopy 15(10), 24–25 (2000). 7. Mark, H. and Workman, J., Spectroscopy 18(1), 38–43 (2003). 8. Hopkins, D., Near Infrared Analysis 2(1–13) (2001).

58

Comparison of Goodness of Fit Statistics for Linear

Regression: Part 1 – Introduction

The scope of this chapter-formatted mini-series is to provide statistical tools for compar ing two columns of data, X and Y . With respect to analytical applications such data may be represented for simple linear regression as the concentration of a sample (X) versus an instrument response when measuring the sample (Y ). X and Y may also denote a comparison of the reference analytical results (X) versus predicted results (Y ) from a calibrated instrument. At other times one may use X and Y to represent the instrument response (X) to a reference value (Y ). Whatever data pairs one is comparing as X and Y , there are several statistical tools that are useful to assess the meaning of a change in Y as a function of a change in X. These include, but are not limited to: correlation (r), the coefficient of determination R2 , the slope k1 , intercept K0 , the z-statistic, and of course the respective confidence limits for these statistical parameters. The use of graphical representation is also a powerful tool for discerning the relationships between X and Y paired data sets. The specific software used for this pedagogical exercise is MathCad 2001i (© MathSoft Engineering & Education, Inc., 101 Main Street, Cambridge, MA 02142-1521), which we find particularly useful for describing the precise mathematics employed behind each set of examples. The mathematical tools used here may be employed when ever the assumptions of linear correlation are suspected or assumed for a set of X and Y data. The data set used for this example is from Miller and Miller ([1], p. 106) as shown in Table 58-1. This dataset is used so that the reader may compare the statistics calculated and displayed using the formulas and figures described in this reference with respect to those shown in this series of chapters. The correlation coefficient and other goodness of fit parameters can be properly evaluated using standard statistical tests. The Worksheets provided in this chapter series can be customized for specific applications providing the optimum information for particular method comparisons and validation studies. When performing X and Y linear regression computations there are several general assumptions. One is assuming that if the correlation between X and Y is significantly large then some cause-and-effect relationship could possibly exist between changes in X, and changes in Y . However, it is important to remember that probability alone tells us only if X and Y “appear” to be related. If no cause-effect relationship exists between X and Y , the regression model will have no true predictive importance. Thus knowledge of cause-and-effect creates a basis for decision making when using regression models. Limitations of inferences derived from probability and statistics arise from limited knowledge of the characteristics and stability of: the nature and origins of the set of samples used for X and Y comparison; the characteristics of the measuring instrument(s) used for collecting both X and Y data; the set of operators performing the measurements; and the precise set of measurement or experimental conditions.

380

Chemometrics in Spectroscopy

Table 58-1 Data used for this study of regression and correlation

Y:

X:

Y :=

X := 0

0

0

2.1

0

0

1

5

1

2

2

9

2

4

3

12.6

3

6

4

17.3

4

8

5

21

5

10

24.7

6

12

6

Source: Miller & Miller Date (p. 106).

One must note that probability alone can only detect “alikeness” in special cases, thus cause-effect cannot be directly determined – only estimated. If linear regression is to be used for comparison of X and Y , one must assess whether the five assumptions for use of regression apply. As a refresher, recall that the assumptions required for the application of linear regression for comparisons of X and Y include the following: (1) the errors (variations) are independent of the magnitudes of X or Y , (2) the error distributions for both X and Y are known to be normally distributed (Gaussian), (3) the mean and variance of Y depend solely upon the absolute value of X, (4) the mean of each Y distribution is a straight-line function of X, and (5) the variance of X is zero, while the variance of Y is exactly the same for all values of X. The requirement for a priori knowledge useful for providing a scientific basis for comparison of X and Y data poses several questions for the statistician or analyst when using regression as a comparative tool: 1) Is X a true predictor of Y , does cause-effect exist? 2) If X is a true predictor of Y , what is the optimum mathematical relationship to describe a measurement device response with respect to the reference data? such information defines the optimum mathematical tools to use for comparison) 3) What are the effects of operator and measurement or experimental conditions on the change in X relative to Y ? 4) What are the effects on X and Y of making measurements on multiple instruments with multiple operators? 5) What is the theoretical response for the X with respect to the Y ? 6) What is the Limit of Detection (LOD) relative to changes in X and Y ? Is this limit acceptable for the intended application? In routine comparisons of X and Y data for spectroscopic analysis, when X and Y denote a comparison of the reference analytical results (X) versus instrument response (Y ), at least three main categories of modeling problems are found:

Comparison of Goodness of Fit Statistics: Part 1

381

1) The technique is not optimal: the instrument response (Y ) is a predictor of analyte values (X). The limitation for modeling is in the representation of calibration set chemistry, sample presentation, and unknown variations of instrument and operator during measurement. 2) There is no clear, specific analyte signal: the instrument response (Y ) does not change adequately with a variation in the analyte value (X). This phenomenon indicates that small changes in analyte concentration are not detected by the measurement instrument. Different or additional instrument response information is required to describe the analyte (the problem is underdetermined). 3) The instrument response (Y ) changes dramatically with little or no change in analyte value (X). In this example additional clarification is required to define the relation ship between the analyte value and the spectroscopic/chemical data for the sample, as interfering factors other than analyte concentration are affecting the instrument response. Factors affecting the integrity of spectroscopic data include the variations in sample chemistry, the variations in the physical condition of samples, and the variation in mea surement conditions. Calibration data sets must represent several sample “spaces” to include compositional space, instrument space, and measurement or experimental con dition space (e.g., sample handling and presentation spaces). Interpretive spectroscopy where spectra-structure correlations are understood is a key intellectual process in approaching spectroscopic measurements if one is to achieve an understanding in the X and Y relationships of these measurements. The main concept addressed in this new multi-part series is the idea of correlation. Correlation may be referred to as the apparent degree of relationship between variables. The term apparent is used because there is no true inference of cause-and-effect when two variables are highly correlated. One may assume that cause-and-effect exists, but this assumption cannot be validated using correlation alone as the test criteria. Correlation has often been referred to as a statistical parameter seeking to define how well a linear or other fitting function describes the relationship between variables; however, two variables may be highly correlated under a specific set of test conditions, and not correlated under a different set of experimental conditions. In this case the correlation is conditional and so also is the cause-and-effect phenomenon. If two variables are always perfectly correlated under a variety of conditions, one may have a basis for cause-and-effect, and such a basic relationship permits a well-defined mathematical description. For example, the volume of a cube is perfectly correlated to the length of each side as V = s3 . Likewise the volume of a sphere is perfectly correlated to its radius as V = 4/3r 3 . However, the mass of such objects will be highly correlated to s or r only when the density (d) of the materials used to form the shapes are identical, since d = mass/volume. There is no correlation of mass to s or r when vastly different densities of material are used for comparison. Thus a first-order approximation for s and r vs. mass for widely different materials would lead one to believe that there is not a relationship between volume and mass. Conversely, when working with the same material one would find that volume and mass are perfectly correlated and that there is a direct relationship between volume and mass irrespective of shape. This simple example points to the requirements for a deeper understanding of the underlying phenomena in order to draw conclusions regarding cause and effect based on correlation.

382

Chemometrics in Spectroscopy

In spectroscopic problems one may observe a high correlation with several data sets, whereas there is poor correlation with other data sets. The underlying cause can often be rich in information content and will lead to a deeper understanding of the problem and underlying phenomena involved. Simply using correlation will not produce this learning if one looks no deeper. However, there are statistical tests which may be applied when using correlation that will help one assess the significance and meaning of correlation for specific test cases. It should be pointed out that when only two variables are compared for correlation, this is referred to as simple correlation. However, when more than two variables are compared for correlation this is termed multiple correlation. In spectroscopy correlation is used in two main ways: (1) for calibration of the instrument response (Y ) at one or more channels as absorbance or reflectance of the sample at some wavelength or series of wavelengths to the known analyte property (X) for that sample; and (2) following calibration the predicted analyte concentration (Y ) is compared (using correlation) to the known analyte concentration (X). Although correlation contains information regarding the relationship between two or more variables, a powerful visual tool indicating the relationship between variables is given in the use of scatter diagrams. Scatter diagrams indicate correlation, bias, nonlinearity, outliers, and subclasses. With practice one may train the eye to identify these potential effects quite easily. For example, observe the four figures (58-1a through 58-1d) below. The scatter plot illustration in Figure 58-1 demonstrates the power of visual aid to qualitatively assess the potential relationship between two or more variables. Figure 58 1a illustrates a positive, high correlation between X and Y . Figure 58-1b indicates no real correlation between the variables. Figure 58-1c demonstrates a high, negative correlation between the variables. Figure 58-1d shows several phenomena in the relationship between X and Y . An initial observation indicates that there are three potential outlier samples, one above the line in the upper left hand corner, and two beneath the line in the lower

(a)

(b)

(c)

(d)

Figure 58-1 An illustration of the use of scatter plots for gleaning visual information with respect to the correlation between variables X (abscissa) and Y (ordinate).

Comparison of Goodness of Fit Statistics: Part 1

383

right hand corner. These three data points possibly represent two types of samples that are unlike the majority of the samples near the line. If the reference data are accurate these three samples may be outliers and represent some unexplained phenomena. The majority of the samples are plotted near the regression line and potentially represent a nonlinear relationship between X and Y . Thus a scatter plot of X versus Y with a linear regression line overlay is useful as a powerful data analysis tool. The quantitative description of the relationship between two or more variables is often addressed using a least squares regression line referred to as linear regression. Linear regression, as and example of Y on X linear regression, between two data sets involves the relationship Y = K1 X + K0

(58-1)

where Y is the dependent variable as the estimated or predicted value, X is the indepen dent variable or often the measured value, K1 is the slope or linear regression coefficient, and K0 is the intercept for the regression line. The statistical tools used here are pro vided as a MathCad 2001i Professional Worksheet, which can be further customized for specific applications. The Worksheet includes graphical comparisons of the correlation coefficient (r), the coefficient of determination R2 , standard deviation of the calibration samples (Sr), and the standard error of estimate (SEE). Also included is a method for computing the confidence limits for the correlation coefficient; a method for comparing correlation coefficients for different size populations; and a method for computing the confidence limits for the slope and intercept of a data set. All these statistical parameters are computed for user-selected confidence levels. The program provides the required tools for goodness of fit confidence testing when developing validated methods for X and Y comparisons. The use of linear regression as a statistical tool is a standard technique for comparison of two sets of data X and Y where a linear relationship between a change in X X and a change in Y Y is suspected. Calibration problems associated with instrumental methods often use this technique over a linear dynamic range. This set of chapters and the accompanying MathCad program (shown later) provides the required tools for goodness of fit confidence testing when working with regression for multiple purposes, including developing validation of analytical methods. The use of statistics to calculate the coefficient of determination (R-squared, R2 ), the correlation coefficient (r), slope, and intercept is routine and uncomplicated, yet for some reason equally elementary statistics such as significance testing for these statistical parameters is not often demonstrated in analytical papers or reports. Varying parameters such as the level of confidence, the number of samples (n) in the calibration set, the standard error of estimate (SEE), and the standard deviation of the range of data (Sr) will have dramatic effects on the meaning or interpretation of “goodness of fit” statistics such as the coefficient of determination and correlation. This series of articles provides several sets of tools useful for evaluating all of the aforementioned statistics at user selected confidence levels. The general statistical tools to be described are 1) A graphical comparison of the correlation coefficient (r), the coefficient of determi nation R2 , with the standard deviation of the calibration sample analyte values (Sr)

384

2) 3) 4) 5)

Chemometrics in Spectroscopy

as compared to the standard error of estimate (SEE) Note: Sr is a MathCad program symbol. A graphical comparison of the correlation coefficient (r) and the standard error of estimate (SEE) for a calibration model. A Worksheet for computing the confidence limits for the correlation coefficient at user selected confidence levels. A method and Worksheet for comparing correlation coefficients for different size populations at user selected confidence levels. A method and Worksheet for computing the confidence limits for the slope and intercept of a data set at user-selected confidence levels.

REFERENCE 1. Miller, J.C. and Miller, J.N., Statistics for Analytical Chemistry, 2nd ed. (Ellis Horwood, New York, 1992).

59

Comparison of Goodness of Fit Statistics for Linear

Regression: Part 2 – The Correlation Coefficient

This chapter is a continuation of Chapter 58 describing the use of goodness of fit statistical parameters [1]. When developing a calibration for quantitative analysis one must select the analyte range over which the calibration is performed. For a given standard error of analysis the size of the range will have a direct affect on the magnitude of the correlation coefficient. The standard deviation of Y also has a direct affect. This is obviously the case as demonstrated by noting the computation for correlation between X and Y , in matrix notation, denoted as r=

covarX Y stdevX · stdevY

(59-2)

Note for this example that covar(X, Y ) represents the covariance of (X, Y ), stdev(X) is the standard deviation of the X data, and stdev(Y ) is the standard deviation of the Y data. For the MathCad program (© 1986-2001 MathSoft Engineering & Education, Inc., 101 Main Street Cambridge, MA 02142-1521), the stdev(X) is represented by the variable symbol Sr, which can be thought of as the set of many possible standard deviations for a set of data X. Thus a comparison of the correlation coefficient between two or more sets of X, Y data pairs cannot be adequately performed unless the standard deviations of the two data sets are nearly identical or unless the correlation coefficient confidence limits for the data sets are compared. In summary, if one Set A of X, Y paired data has a correlation of 0.95, this does not necessarily indicate that it is more highly correlated than a second Set B of X, Y paired data with a correlation of say 0.90. The meaning of this will be described in greater detail later. Let us look at seven slightly different equations (r1 through r7 , or Equations 59-7 through 59-13) for calculating correlation between X (known concentration or analyte data for a set of standards) and Y (instrument measured data for those standards) using MathCad function or summation notation nomenclature. First we must define the calculation of the standard error of performance, also termed the standard error of prediction (SEP), and the calculations for the slope (K1 and the intercept (K0 ) for the linear regression line between X and Y . The regression line for estimating the ˆ ) is given as: concentration denoted by (PredX or X ˆ = K1 Y + K0 PredX = X

(59-3)

386

Chemometrics in Spectroscopy

The standard error of performance, also termed the “standard error of prediction” (SEP), which represents an estimate of the prediction error (1 sigma) for a regression line is given as: � SEP =

� �� ˆ −X 2 X n

(59-4)

The slope of the line (K1 ) for this regression line is given as: � � Y · X − Y · X K1 = � � n Y 2 − Y 2 �� 2 � � � � Y · X − Y · Y · X K0 = � � n Y 2 − Y 2 n·

�

(59-5)

(59-6)

The seven ways (r1 through r7 ) for calculating correlation as the square root of the ratio of the explained variation over the total variation between X (concentration of analyte data) and Y (measured data) are described using many notational forms. For example, many software packages provide built-in functions capable of calculating the coefficient of correlation directly from a pair of X and Y vectors as given by r1 (Equation 59-7). r1 = corrX Y

(59-7)

[This is the built-in MathCad correlation function] Several software packages contain simple command lines for performing matrix computations directly and thus are conveniently capable of computing the correlation coefficient, for example as in r2 (Equation 59-8). r2 =

covarX Y stdevX · stdevY

(59-8)

[Equation 59-8 denotes the ratio of the covariance of X on Y to the standard deviation of X times the standard deviation of Y ] If the software is capable of using summation notation, such as in the standard capabilities of MathCad, then one may use this algebraic form for calculating the correlation as in r3 and r4 (Equations 59-9 and 59-10, respectively). � � ��� ˆ −X 2 � X � r3 = � � �2 X −X

(59-9)

Comparison of Goodness of Fit Statistics: Part 2

387

[Equation 59-9 is the square root of the ratio comprised of the sum of the squared differences between each predicted X and the mean of all X, to the sum of the squared differences between all individual X values and the mean of all X.] � �� � � � ˆ − X2 X (59-10) r4 = �1 − � � �2 X −X [Equation 59-10 denotes the square root of one minus the ratio comprised of the sum of the squared differences between each predicted X and its corresponding X, to the sum of the squared differences between all individual X values and the mean of all X.] And if the software allows you to assign variable names as needed for specific computations, such as SEP or standard deviations, then you may proceed to use such computational descriptions such as r5 and r6 (Equations 59-11 and 59-12, respectively) to compute the correlation. � � � SEP2 r5 = 1 − (59-11) stdevX2 [Equation 59-11 indicates that the correlation coefficient is represented by the square root of one minus the ratio comprised of the square of the standard error of performance, to the square of the standard deviation of all X]. � � � SEP 2 r6 = 1 − (59-12) stdevX [Equation 59-12, of course, is simply the algebraic equivalent of the equation found above.] Other computational methods for correlation is given in Miller and Miller, (reference [2], p. 105) as r7 shown in Equation 59-13. � xi − x yi − y i r7 = �� (59-13) �� �� 21 � � 2 2 yi − y xi − x i

i

You may be surprised that for our example data from Miller and Miller ([2], p. 106), the correlation coefficient calculated using any of these methods of computation for the r-value is 0.99887956534852. When we evaluate the correlation computation we see � �� � �� ˆ −X , that given a relatively equivalent prediction error represented as: X −X , X or SEP, the standard deviation of the data set (X) determines the magnitude of the correlation coefficient. This is illustrated using Graphics 59-1a and 59-1b. These graphics allow the correlation coefficient to be displayed for any specified Standard error of prediction, also occasionally denoted as the standard error of estimate (SEE). It should be obvious that for any statistical study one must compare the actual computational recipes used to make a calculation, rather than to rely on the more or less non-standard terminology and assume that the computations are what one expected.

388

Chemometrics in Spectroscopy 1

Correlation coefficient

0.86 0.71 0.57

r(Sr) 0.43 0.29 0.14 0

0

0.57

1.14

1.71

2.29

2.86

3.43

4

Sr Standard deviation of range

Graphic 59-1a r versus Sr of data range.

For a graphical comparison of the correlation [r(Sr)] and the standard deviation of the samples used for calibration (Sr), a value is entered for the SEP (or SEE) for a specified analyte range as indicated through the standard deviation of that range (Sr). The resultant graphic displays the Sr (as the abscissa) versus the r (as the ordinate). From this graphic it can be seen how the correlation coefficient increases with a constant SEP as the standard deviation of the data increases. Thus when comparing correlation results for analytical methods, one must consider carefully the standard deviation of the analyte values for the samples used in order to make a fair comparison. For the example shown, the SEE is set to 0.10, while the correlation is scaled from 0.0 to 1.0 for Sr values from 0.10 to 4.0. 1

Correlation coefficient

0.999 0.997 0.996

r(Sr) 0.994 0.993 0.991 0.99

0

0.57

1.14

1.71

2.29

2.86

3.43

4

Sr Standard deviation of range

Graphic 59-1b r versus Sr of data range.

This figure demonstrates the correlation range above 0.99 for the figure in Graphic 59-1a. Note that the correlation begins to flatten when the Sr is over an order of magnitude times the SEE.

Comparison of Goodness of Fit Statistics: Part 2

389

1

Correlation coefficient

0.98 0.96 0.94

r(Sr) 0.92 0.9 0.88 0.86 0.2

0.26

0.31

0.37

0.43

0.49

0.54

0.6

Sr Standard deviation of range

Graphic 59-1c r versus Sr of data range.

Note from this figure (Graphic 59-1c) that at a certain value for standard deviation of X (denoted as Sr), small change in the Sr results in a large apparent change in the correlation. For example, in this case where the SEE is set to 0.10, the correlation changes from 0.86 to 0.95 when the Sr is changed only from 0.20 to 0.32. As is the general case, using correlation to compare analytical methods requires identical sample analyte standard deviations, or comparison of the confidence limits for the correlation coefficients in order to interpret the significance of the different correlation values.

Coefficient of determination

1

R 2(Sr) 0.5

0

0

1

2

3

4

Sr

Standard deviation of range

Graphic 59-2 R2 versus Sr of data range.

For a graphical comparison of the coefficient of determination (R2 ) and the standard deviation of the calibration samples (Sr), a value is entered for the SEE for a specified range of Sr. The resultant graphic displays the Sr (abscissa) versus R2 (ordinate). From this graph it can be seen how the coefficient of determination increases as the standard deviation of the data. The SEE is set at 0.10 as in the examples shown in Graphics 59-1a and 59-1b. Note that the same recommendation holds whether using r or R2 , that being

390

Chemometrics in Spectroscopy

relative comparisons for this statistic should not be used unless the standard deviations of the comparative data sets are identical.

Correlation coefficient

1

0.98

r(Sr)

0.96

0

10

20

30

40

R(Sr)

Ratio of Sr/SEE

Graphic 59-3 r versus Sr/SEE.

A subsequent Graphic 59-3 shows the relative ratio of the range (Sr) to the SEE (abscissa) as compared to the correlation coefficient r as the ordinate. From this graph it can be seen that the correlation coefficient continues to increase as the ratio of Sr/SEE even when the ratio approaches more than 60. Note that when the ratio is greater than 10 there is not much improvement in the correlation.

Correlation coefficient

1

r(SEE) 0.5

0

0

1

2

3

4

SEE Standard error of estimate

Graphic 59-4 r versus SEE.

A graphical comparison of the correlation coefficient (r) versus the standard error of estimate (SEE) is shown in Graphic 59-4. This graphic clearly shows that when the Sr is held constant (Sr = 4) the correlation decreases as the SEE increases.

Comparison of Goodness of Fit Statistics: Part 2

391

Correlation coefficient

1

r(SEE) 0.5

0

0

0.2

0.4

0.6

0.8

1

R(SEE) Ratio of SEE/Sr

Graphic 59-5 r versus SEE/Sr.

This graphic shows the relationship between correlation and the ratio of SEE/Sr, as the SEE increases relative to the Sr the correlation decreases rapidly.

REFERENCES 1. Workman, J. and Mark, H., “Chemometrics in Spectroscopy: Comparison of Goodness of Fit Statistics for Linear Regression – Part 1, Introduction”, Spectroscopy 19(4), 32–35 (2004). 2. Miller, J.C. and Miller, J.N., Statistics for Analytical Chemistry, 2nd ed. (Ellis Horwood, New York, 1992).

This page intentionally left blank

60

Comparison of Goodness of Fit Statistics for Linear

Regression: Part 3 – Computing Confidence Limits for the

Correlation Coefficient

In this chapter as a continuation of Chapters 58 and 59 [1, 2], the confidence limits for the correlation coefficient are calculated for a user-selected confidence level. The user selects the test correlation coefficient, the number of samples in the calibration set, and the confidence level. A MathCad Worksheet (© MathSoft Engineering & Education, Inc., 101 Main Street, Cambridge, MA 02142-1521) is used to calculate the z-statistic for the lower and upper limits and computes the appropriate correlation for the z-statistic. The upper and lower confidence limits are displayed. The Worksheet also contains the tabular calculations for any set of correlation coefficients (given as ). A graphic showing the general case entered for the table is also displayed. For n pairs of values (X, Y ) the set of pairs may be interpreted as a subset of the entire population of X and Y values throughout some larger population of samples. For example, X and Y may constitute all possible combinations of an instrument response (Y ) and an analyte concentration (X) in a specific solvent matrix. The population correlation coefficient may be referred to as the Greek letter rho (), which may be estimated using the correlation coefficient computed for a specific subset of values, designated as (r). It is known that tests of significance can be performed on a measured r to determine if is it is significantly different from another r calculated from a different subset of X, Y values. The significance between any specific r calculated from a subset of X, Y values may also be compared to the estimated population correlation for all such possible samples, . When a hypothesis test is used to calculate whether is statistically equal to zero, the distribution is approximated using the Student’s t distribution. When is tested to be not equal to zero the use of the Fisher transformation produces a statistic which is normally distributed. This transformation is referred to as Fisher’s Z transformation (i.e., the Z-statistic). The z-statistic for testing a non-zero population correlation is given by equation 60-14 as Z1 , where e = 271828. A good discussion of this is found in reference [3]. � Z1 = 05 · loge

1+r 1−r

� (60-14)

A more standard form (equation 60-15) used for computational purposes is � Z1 = 11513 · log10

1+r 1−r

� (60-15)

394

Chemometrics in Spectroscopy

The confidence limits for a correlation coefficient for a given number of X, Y pairs (n) at a specified confidence limit is calculated as Z2 (Equation 60-16). � Z2 = 11513 · log10

1+r 1+r

�

� ±z· √

1

�

n−3

(60-16)

Note that the z-statistic is computed as z or is available from standard statistical tables as the Student’s t distribution such that confidence levels as 0.90, 0.95, 0.98, and 0.99 corresponding to t050 , t025 , t010 , and t005 , respectively. At infinite n of X, Y pairs the corresponding z-values are 1.645, 1.960, 2.326, and 2.576. For a specific example problem, we may calculate the confidence limits for r as 0.8, n as 21 at a 95% confidence interval [3]. Then Z2 for this problem is as (equation 60-17). � Z2 = 11513 · log10

� � � 1 + 080 1 ± 196 √ = 06366 to 15606 1 − 080 21 − 3

(60-17)

Then it follows that solving for using 0.6366 and 1.5606 substituted individually into the ZLL and ZUL equations below (i.e., equations 60-18 and 60-19), we calculate 0.563 and 0.920 as the lower and the upper confidence limits, respectively, for the

correlation coefficient of 0.80 and n = 21 as shown in the equations (i.e., ZLL and ZUL

and Graphics 60-6a and 60-6b).

Lower Limit:

� ZLL = 06366 = 11513 · log10

1 + LL 1 − LL

� ⇒ LL = 05626

(60-18)

⇒ UL = 09155

(60-19)

Upper Limit: � ZUL = 15606 = 11513 · log10

1 + UL 1 − UL

�

A graphic or tabular data display can be generated for any z-statistic value given a population correlation coefficient, . This is accomplished by using the Fisher’s Z transformation (i.e., the Z-statistic) computation as (equation 60-20) � Z = 11513 · log10

1+ 1−

� (60-20)

In summary, for any stated value of the population correlation ( the z statistic is denoted as Z, and the corresponding correlation confidence limits can be determined. For our example, the Z statistic of 0.6366 corresponding to the lower correlation coeffi cient confidence limit is shown in the graphic below (Graphic 60-6a) as having a value of 0.562575; this represents the lower confidence limit for the correlation coefficient for this example.

Comparison of Goodness of Fit Statistics: Part 3

395

0.63663

0.63662

z-statistic

0.63661

Z (ρ)

0.6366

0.63658

0.63657

0.63656 0.56255 0.562558 0.562567 0.562575 0.562583 0.562592 0.5626

ρ

Correlation coefficient

Graphic 60-6a The z statistic is denoted as Z, and the corresponding correlation confidence ( lower limit can be graphically displayed for our example.

Likewise for this example, the Z statistic of 1.5606 corresponding to the upper correlation coefficient confidence limit is shown in the graphic below (Graphic 60) as having a value of 0.91551; this represents the upper confidence limit for the 0.80 correlation example problem. Finally then, for the example problem the correlation confidence limits are from 0.562575 to 0.91551 (i.e., 0.56 to 0.92).

1.5611

1.5609

z-statistic

1.5608

Z (ρ) 1.5606 1.5604

1.5602

1.5601 0.91543 0.91546 0.91549 0.91551

0.91554 0.91557

0.9156

ρ Correlation coefficient

Graphic 60-6b The z statistic is denoted as Z, and the corresponding correlation confidence ( upper limit can be graphically displayed for our example.

396

Chemometrics in Spectroscopy

TESTING CORRELATION FOR DIFFERENT SIZE POPULATIONS The following description and corresponding MathCad Worksheet allows the user to test if two correlation coefficients are significantly different based on the number of sample pairs (N ) used to compute each correlation. For the Worksheet, the user enters the confidence level for the test (e.g., 0.95), two comparative correlation coefficients, r1 and r2 , and the respective number of paired (X, Y ) samples as N1 and N2 . The desired confidence level is entered and the corresponding z statistic and hypothesis test is performed. A Test result of 0 indicates a significant difference between the correlation coefficients; a Test result of 1 indicates no significant difference in the correlation coefficients at the selected confidence level. Again we will use a standard example [3] where r1 is 0.5, with n1 as 28; r2 is 0.3 with n2 of 35. The typical confidence level is 0.95 and the z-value statistic for this level is 1.96. Note here again that the z-statistic is computed as z or is available from standard statistical tables as the Student’s t distribution such that confidence levels of 0.90, 0.95, 0.98, and 0.99 correspond to t050 , t025 , t010 , and t005 , respectively. At infinite n (i.e., greater than 120) of X, Y pairs the corresponding z-values are 1.645, 1.960, 2.326, and 2.576. The test statistic for this problem is given as equation 60-21. � � � � �� 1 + r2 1 + r1 − 11513 · log10 11513 · log10 1 − r2 1−r Zn = (60-21) � 1 1 1 + n1 − 3 n2 − 3 The null hypothesis test for this problem is stated as follows: are two correlation coefficients r1 and r2 statistically the same (i.e., r1 = r2 )? The alternative hypothesis is then r1 = r2 . If the absolute value of the test statistic Zn is greater than the absolute value of the z-statistic, then the null hypothesis is rejected and the alternative hypothesis accepted – there is a significant difference between r1 and r2 . If the absolute value of Zn is less than the z-statistic, then the null hypothesis is accepted and the alternative hypothesis is rejected, thus there is not a significant difference between r1 and r2 . Let us look at a standard example again (equation 60-22). � � � � �� 1 + 05 1 + 03 11513 · log10 − 11513 · log10 1 − 05 1 − 03 Zn = (60-22) � 1 1 + 28 − 3 35 − 3 And Zn = 089833, therefore Zn, the test statistic, is less than 1.96, the z-statistic, and the null hypothesis is accepted – there is not a significant difference between the correlation coefficients. In a second example, which may be more typical, let us see what happens when r1 is 0.87 and r2 is 0.96, with n1 as 20, and n2 as 25. At a confidence level test of 0.95, we use the above equations for Z(n) and find that there is not a significant difference (e.g., Zn = 18978, which is less than 1.96). The use of this statistical test emphasizes the

Comparison of Goodness of Fit Statistics: Part 3

397

point that comparison of correlation coefficients for small numbers of sample pairs is definitely “risky” business when confidence limits and statistical hypothesis testing are not used. In our experience we have seen analytical techniques and methods accepted or rejected by large research organizations using the “correlation eye-balling” test, where the method is accepted or rejected solely on a relative comparison of correlation coefficients, without the benefit of computing the confidence limits! This is a somewhat common, but easily preventable, mistake.

REFERENCES 1. Workman, J. and Mark, H., “Chemometrics in Spectroscopy: Comparison of Goodness of Fit Statistics for Linear Regression – Part 1, Introduction,” Spectroscopy 19(4), 32–35 (2004). 2. Workman, J. and Mark, H., “Chemometrics in Spectroscopy: Comparison of Goodness of Fit Statistics for Linear Regression – Part 2, The Correlation Coefficient,” Spectroscopy 19(6), 29–33 (2004). 3. Spiegel, M.R. Statistics (McGraw-Hill Book Company, New York, 1961).

This page intentionally left blank

61 Comparison of Goodness of Fit Statistics for Linear Regression: Part 4 – Confidence Limits for Slope and Intercept

For this chapter we continue to describe the use of confidence limits for comparison of X, Y data pairs. This subject has been addressed in Chapters 58–60 first published as a set of articles in Spectroscopy [1–3]. A MathCad Worksheet (© 1986-2001 MathSoft Engineering & Education, Inc., 101 Main Street Cambridge, MA 02142-1521) provides the computations for interested readers. This will be covered in a subsequent chapter or can be obtained in MathCad format by contacting the authors with your e-mail address. The Worksheet allows the direct calculation of the t-statistic by entering the desired confidence levels. In addition the confidence limits for the calculated slope and intercept are computed from the original data table. The lower limits for the slope and the intercept are displayed using two different sets of equations (and are identical). The intercept confidence limits are also calculated and displayed. For calculations of slope and intercept two sets of equations will be shown, one as a summation notation set useful for application in MathCad software, and a second set as shown from reference [4], pp. 100–111. For these formulas, X represents the concentration and Y represents the instrument response. This is to demonstrate that the two computational formula sets yield the same precise answer. To begin, the following summation notation may be used to calculate the slope (k1 ) of a linear regression line given a set of X, Y paired data (equation 61-23). n · X · Y − X · Y (61-23) k1 = 2 n· X2 − X The summation notation formula for calculating the intercept (k0 of a linear regression line given a set of X, Y paired data is as equation 61-24. 2 X · Y − X · X · Y k0 = (61-24) 2 n· X2 − X In reference [4], p. 109, Miller and Miller use the following for the slope (b) calculation (equation 61-25) xi − x¯ yi − y¯ i b= (61-25) xi − x¯ 2 i

400

Chemometrics in Spectroscopy

The intercept (a) is given by the same authors [4] as (equation 61-26) a = y¯ − bx¯

(61-26)

The reader may be surprised to learn that for the selected data the slope using either method computes to a value of 1.93035714285714, while the intercept for both methods of computation have values of 1.51785714285715 (summation notation method) versus 1.51785714285714 for the Miller and Miller cited method (this, however, is the probable result of computational round-off error). The confidence limits for the slope and intercept may be calculated using the Student’s t statistic, noting Equations 61-27 through 61-30 below. The slope (k1 ) confidence limits are computed as shown in Equations 61-27 through 61-30. ⎛ ⎞ Y − Yˆ 2 t Limits = k1 ± ⎝ √ · (61-27) ⎠ ¯ 2 n−2 X −X Miller and Miller, pp. 110 and 111 in reference [4], cite the following equations for calculation of the slope (b) confidence limits.

sy/x =

⎧ ⎫ 21 2 ⎪ ⎪ ⎨ yi − yˆ i ⎬ i

⎪ ⎩

sb =

⎪ ⎭

n−2 sy/x

i

xi − x¯

2

21

Limits = b ± t · sb

(61-28)

(61-29)

(61-30)

As the reader may suspect by now, these methods of computation yield precisely the same answer as LL = 182521966597124; and UL = 203549461974305. The intercept (k0 confidence limits are computed as equation 61-31 2 Y − Yˆ · X 2 Limits = k0 ± t · (61-31) ¯ 2 n − 2 · n · X −X Miller and Miller, pp. 111 and 112 in reference [4] cite the following Equations for calculation of the intercept (a) confidence limits.

sy/x =

⎧ ⎫ 21 ⎪ yi − yˆ i 2 ⎪ ⎨ ⎬ i

⎪ ⎩

n−2

⎪ ⎭

(61-32)

Comparison of Goodness of Fit Statistics: Part 4

sa = sy/x

⎧ ⎨ ⎩n

401

i

i

xi2

⎫ 21 ⎬

xi − x¯ 2 ⎭

Limits = a ± t · sa

(61-33)

(61-34)

Again the methods of computation shown yield precisely the same values for LL = 0759700015087087; and UL = 227601427062721. We will be discussing a more detailed interpretation for the slope and intercept confidence limits in later chapters. However, the reader will note that the regression line for any X, Y paired data rotates at the epicenter point designated by the mean X and mean Y data point. Thus the farther from the mean of X and Y a data point along a line occurs, the less the overall confidence in the relative position of the line. A more detailed description of the confidence limits surrounding any regression line using the F -distribution will be discussed later.

REFERENCES 1. Workman, J. and Mark, H., “Chemometrics in Spectroscopy: Comparison of Goodness of Fit Statistics for Linear Regression – Part 1,” Spectroscopy 19(4), 32–35 (2004). 2. Workman, J. and Mark, H., “Chemometrics in Spectroscopy: Comparison of Goodness of Fit Statistics for Linear Regression – Part 2, The Correlation Coefficient,” Spectroscopy 19(6), 29–33 (2004). 3. Workman, J. and Mark, H., “Chemometrics in Spectroscopy: Comparison of Goodness of Fit Statistics for Linear Regression – Part 3, Computing Confidence Limits for the Correlation Coefficient,” Spectroscopy 19(7), 31–33 (2004). 4. Miller, J.C. and Miller, J.N., Statistics for Analytical Chemistry, 2nd ed. (Ellis Horwood, New York, 1992).

Supplement

MathCad Worksheets for Correlation, Slope and Intercept

The attached worksheet from MathCad (© 1986–2001 MathSoft Engineering & Education, Inc., 101 Main Street Cambridge, MA 02142–1521) is used for computing the statistical parameters and graphics discussed in Chapters 58 through 61, in refer ences [b-1–b-4]. It is recommended that the statistics incorporated into this series of Worksheets be used for evaluations of goodness of fit statistics such as the correlation coefficient, the coefficient of determination, the standard error of estimate and the use ful range of calibration standards used in method development. If you would like this Worksheet sent to you, please request this by e-mail from the authors.

R-Squared Study (Y on X) − − − − − − − − − − − − − − − − − − −−

Y:

X:

Y:=

Y = k1X + k0

X:= 0

0

An Example of Y on X Regression

0

2.1

0

0

1

5

1

2

2

9

2

4

3

12.6

3

6

4

17.3

4

8

5

21

5

10

6

24.7

6

12

n:= rows(X)

Correlation: cvar (X, Y) stdev (X)·stdev (Y)

= 0.99888

Miller & Miller Data (page 106)

Methods for computing the Correlation Coefficient (r): n = rowsX

Slope k1x =

n·

−−−−→ Y · X − Y · X −− −−→ n· Y2 − Y2

Intercept − − −−→ −−−−→ Y2 · X − Y · Y · X k0x = −− −−→ n· Y2 − Y2

Comparison of Goodness of Fit Statistics: Part 4

403

PredX = k1x · Y + k0x

Predicted Values for X:

SEP: SEP =

−−−−−−−−→2 PredX − X n

Correlation v1: r1X = corrX Y

Correlation v2: r2X =

r1X = 099887956534852

cvarX Y stdevX · stdevY

r2X = 099887956534852

−−−−−−−−−−−−−−→2 PredX − meanX Correlation v3: r3X = −−−−−−−−−−→ X − meanX2

r3X = 099887956534852

⎛ ⎞ −−−−−−−−→2 PredX − X Correlation v4: r4X = 1 − ⎝ −−−−−−−−−−→ ⎠ X − meanX2 Correlation v5: r5X =

1−

Correlation v6: r6X =

Correlation v7:

r4X = 099887956534852

SEP2 stdevX2

1−

SEP stdevX

r5X = 099887956534852

2 r6X = 099887956534852

−−−−−−−−−−−−−−−−−−−−−−−→ X − meanX · Y − meanY

−−−−−−−−−−−→2 −−−−−−−−−−−→2 X − meanX · Y − meanY

r7X = 099887956534852

r7X =

Comparison of Correlation Coefficient (r) and the Standard Deviation of Calibration Data: Enter Data:

SEE = 01

Data Manually Entered Sr = 01 02stdevX

CALCULATIONS: stdev X = 4

SEE2 = 001

rSr =

1−

SEE2 Sr 2

404

Chemometrics in Spectroscopy Graphic 1A: r versus Sr of data range 1

Correlation Coefficient

0.86 0.71 0.57

r(Sr) 0.43 0.29 0.14 0

0

0.57

1.14

1.71

2.29

2.86

3.43

4

Sr Standard Deviation of Range Graphic 1B: r versus Sr of data range 1

Correlation Coefficient

0.999 0.997 0.996

r(Sr) 0.994 0.993 0.991 0.99

0

0.57

1.14

1.71

2.29

2.86

3.43

4

Sr Standard Deviation of Range Graphic 1C: r versus Sr of data range 1

Correlation Coefficient

0.98 0.96 0.94

r(Sr) 0.92 0.9 0.88 0.86 0.2

0.26

0.31

0.37

0.43

0.49

Sr Standard Deviation of Range

0.54

0.6

Comparison of Goodness of Fit Statistics: Part 4

R2Sr =

405

SEE2 Sr 2

Graphic 2: R2 versus Sr of data range

Coefficient of Determination

1

R2(Sr) 0.5

0

0

1

2

3

4

Sr

Standard Deviation of Range

RSr =

Sr SEE

Graphic 3: r versus Sr/SEE

Correlation Coefficient

1

0.98

r(Sr)

0.96

0

10

20

30

R(Sr)

Ratio of Sr/SEE

Comparison of Correlation Coefficient (r) and SEE: Enter Data:

Sr = stdevX

CALCULATIONS:

Data Manually Entered

SEE = 01 02Sr

Sr = 4 − −−−−2→ SEE rSEE = 1 − Sr 2

40

406

Chemometrics in Spectroscopy Graphic 4: r versus SEE

Correlation Coefficient

1

r(SEE) 0.5

0

0

1

2

3

4

SEE Standard Error of Estimate

RSr =

Sr SEE

Graphic 5: r versus SEE/Sr

Correlation Coefficient

1

r(SEE) 0.5

0

0

0.2

0.4

0.6

0.8

1

R(SEE) Ratio of SEE/Sr

Computing Confidence Limits for Correlation Coefficient (at selected con fidence limits) Enter Data:

= 080

Enter Confidence level as 2

n = 21

Minimum n = 5

2 = 095

Comparison of Goodness of Fit Statistics: Part 4

407

CALCULATIONS:

2 + 1 2 z = qt 1 100000

Calculate z-table value:

1

z − value z = 196 1+ 1 1+ 1 Zn = 11513 log −z· √ Zp = 11513 log + z · √

1− 1− n−3 n − 3

Zn = 06366

Zp = 15606

Table of Exact Values for � given Z�, as Zp and Zn, at Specified Confidence Limit: − → −−−−−−−−− −−−−− 1+ Z = 11513 log 1−

= 000001 000002250000 Graphic 6a

0.63663

0.63662

z-statistic

0.63661 Z (ρ)

0.6366

0.63658

0.63657

0.63656 0.56255 0.562558 0.562567 0.562575 0.562583 0.562592 0.5626

ρ

Correlation Coefficient Graphic 6b 1.5611

1.5609

z-statistic

1.5608

Z (ρ) 1.5606 1.5604

1.5603

1.5601 0.91543

0.91546

0.91549

0.91551

ρ

0.91554

Correlation Coefficient

0.91557

0.9156

408

Chemometrics in Spectroscopy

Correlation coefficient confidence limits estimates for selected confidence level are: a = 077261189 · 2Zn0710540889

b = 076468768 · 3Zn0441013741

c = 0864765533 · 5Zn0137899811

d = 0772611892 · 2Zp0710540889

e = 076468768 · 3Zp0441013741

f = 086476533 · 5Zp0137899811

a if 050 ≤ �Zn� < 1 b if 1 ≤ �Zn� < 15 LL = c if 15 ≤ �Zn� ≤ 29 1000 if �c� ≥ 1

d if 050 ≤ �Zp� < 1 e if 1 ≤ �Zp� < 15 UL = f if 15 ≤ �Zp� ≤ 29 1000 if �f� ≥ 1

Correlation coefficient confidence limits estimated for selected confidence level are: Lower Limit:

Upper Limit:

LL = 056

UL = 092

Testing Correlation for Different Size Populations Are two correlation coef ficients (r1 and r2 different based on a difference in the number of obser vations for each (N)? Enter Data:

r1 = 097

Enter Confidence level as

N1 = 28 �2

r2 = 099

N2 = 28

= 095

CALCULATIONS: Calculate Test Statistic:

11513 log 1+r1 − 11513 log 1+r2 1−r1 1−r2 ZN = 1 1 + N2−3 N1−3 ZN = −195996

NOTE: If Z(N) is greater than the absolute value of the z-statistic (Normal Curve onetailed) we reject the null hypothesis and state that there is no significant difference in r1 and r2 at the selected significance level.

Calculate the Z-statistic at selected confidence limit: Calculate z-table value: 1 =

2+1 2

z = qt 1 100000 z-Value statistic:

z = 196

Comparison of Goodness of Fit Statistics: Part 4

409

The hypothesis test conclusion at the specified level of significance: 1 if �ZN� < �Z� Test = 0 otherwise

Test = 1

0 = reject hypothesis – there IS a significant difference 1 = accept hypothesis – there is NOT a significant difference Confidence Limits for Slope and Intercept: �2

Enter Confidence level as

2 = 095

n = rowsX

CALCULATIONS:

− −−−−−−−−−−−−−→ X − meanX2 Sx = n−2

Slope and Intercept Calculations: X = 42

Y = 917

− −−−→ X2 = 364

n = rowsX −−−−→ X · Y = 7664

Slope −−−−→ n · X · Y − X · Y k1 = − − −−→ n· X2 − X2 k1 = 1.93035714285714

Miller and Miller, p. 109 −−−−−−−−−−−−−−−−−−−−−−−→ X − meanX · Y − meanY

bX = X − meanX2 bX = 193035714285714

410

Chemometrics in Spectroscopy

Intercept: −− −−→ −−−−→ X2 · Y − X · X · Y k0 = −− −−→ n· X2 − X2 k0 = 1.51785714285715 Miller and Miller, p. 109 aX = meanY − bX · meanX aX = 151785714285714 meanX = 6 meanY = 131 bX = 193035714285714

Calculated z-table value: Calculate z-table value

1 =

2+1 2

t = qt 1 n t-value statistic

t = 25706

Ye = k1 · X + k0

Syx = Standard Error of Estimate:

Slope Confidence Limits: ⎛ ⎜ t LLk1 = k1 − ⎝ √ n−2

−−−−−−−−→ − Y − Ye2

n−2

Syx = 04328 Method 1

·

⎞ − −−−−−−−−→ 2 Y − Ye ⎟ ⎠ − −−−−−−−−−−−−→ X − meanX2

·

⎞

LLk1 = 182521966597124 ⎛ ⎜ t ULk1 = k1 + ⎝ √ n−2

− −−−−−−−−→ Y − Ye2 ⎟ − −−−−−−−−−−−−→ ⎠ X − meanX2

ULk1 = 203549461974305

Comparison of Goodness of Fit Statistics: Part 4

411

Slope Confidence Limits:

Method 2 t Syx LL = k1 − √ · n − 2 Sx

t

Syx UL = k1 + √ · n − 2 Sx

Slope confidence limits at selected confidence level are: Lower Limit:

LL = 182521966597124 Upper Limit: UL = 203549461974305

Using Miller and Miller Formulas (pp. 100–111)

syx =

−−−−−−−−→2 Y − Ye n−2 syx = 0433

sb =

syx

− −−−−−−−−−−−−→ X − meanX2

Csb = t · sb

sb = 0041

Lower Limit: k1 − Csb = 182521966597124 Upper Limit: 203549461974305

k1 + Csb =

Intercept confidence limits at selected confidence level are: Method 1 LLk0 = k0 − t ·

−−−−−−−−→ − −− −−→ 2 · X2 Y − Ye −−−−−−−−−−−−→2 n − 2 · n · X − meanX

LLk0 = 0759700015087087

ULk0 = k0 + t ·

− −−−−−−−−→ −− −−→ 2 · X2 Y − Ye −−−−−−−−−−−−→2 n − 2 · n · X − meanX

ULk0 = 227601427062721

412

Chemometrics in Spectroscopy

Using Miller and Miller Formulas (pp. 100–111) sa = syx·

−− −−→ X2 −− −−−−−−−−−−−→ n · X − meanX2

Csa = t · sa

sa = 02949

Lower Limit: k0 − Csa = 0759700015087087 Upper Limit: k0 + Csa = 227601427062721

REFERENCES b-1. Workman, J. and Mark, H., “Chemometrics in Spectroscopy: Comparison of Goodness of Fit Statistics for Linear Regression – Part 1, Introduction,” Spectroscopy 19(4), 32–35 (2004). b-2. Workman, J. and Mark, H., “Chemometrics in Spectroscopy: Comparison of Goodness of Fit Statistics for Linear Regression – Part 2, The Correlation Coefficient,” Spectroscopy 19(6), 29–33 (2004). b-3. Workman, J. and Mark, H., “Chemometrics in Spectroscopy: Comparison of Goodness of Fit Statistics for Linear Regression – Part 3, Computing Confidence Limits for the Correlation Coefficient,” Spectroscopy 19(7), 31–33 (2004). b-4. Workman, J. and Mark, H., “Chemometrics in Spectroscopy: Comparison of Goodness of Fit Statistics for Linear Regression – Part 4, Confidence Limits for Slope and Intercept,” Spectroscopy 19(10), 30–31 (2004).

62

Correction and Discussion Regarding Derivatives

The previous Chapters 54 through 57 dealing with the analysis of derivatives of spectra were first published as [1–4]. It seems that, unfortunately, those columns contained some errors. Although those errors were corrected in Chapter 54, we wanted to include the thought process and comments that went into those corrections. This chapter described one of the errors which was caught early and we were able to get the correction into the subsequent column [2]. Some of the others were not detected until some time had passed and various people had the opportunity (and time, and inclination) to check the equations in detail. Some of the errors were relatively minor (typographical errors in tables, for example), but some were substantive (and substantial). However, to get a complete set of corrections in one place, we here list all the errors found (and the corrections). Equation numbering follows that of the original chapter numbers and corresponding equations. First, in going from equation 54-3 to equation 54-4 [1], when we factored the constants from the derivative we should have taken out 1/ 2 , whereas we factored out 1/. Therefore several equations from equation 54-4 on are off by a factor of . The correct equations are 2 dY 1 1 d 2 − 21 X− = e − 2 X − dX 21/2 2 dX 2 dY 1 1 − 21 X− X = e − 2 − dX 21/2 2 2

(54-4) (54-5)

2 dY − X − − 21 X− = 3 e 1/2 dX 2

(54-6a)

2 dY − X − − 21 X− e = 2 dX

(54-6b)

Similarly, the correct equations for the second derivative of the Normal distribution are 2 d 2 d 1 X − 2 − X − − X − − 21 X− d2 Y − 21 X− e +e − = 3 dX 2 21/2 dX 2 dX 3 21/2 (54-7) 2 d2 Y 1 −1 − X − − 21 X− 1 X− 2 X e− 2 = e − 2 − + (54-8) 2 3 1/2 2 3 1/2 dX 2 2 2

414

Chemometrics in Spectroscopy

d2 Y X − 2 1 1 X− 2 = − 3 e− 2 1/2 1/2 2 5 dX 2 2 2 d2 Y X − 2 1 − 21 X− = − e 2 4 dX 2

(54-9a)

(54-9b)

Next, going from equation 54-10 to equation 54-11 for the Lorentzian distribution (in the same chapter 54) there were a couple of errors, including a missed sign change and not correctly bringing 2 inside the brackets containing an expression that was itself squared. Again, all the subsequent equations derived from equation 54-11 were themselves then also in error. The corrected derivation follows. This time, we present the derivation in much smaller and more detailed steps than initially. In doing this, we give intermediate equations letters, so that the equations labeled with pure numbers correspond to the original equation with the same number, and can be compared with it: d 2 − X 2 2 −1 dY = × 2 × dX 1 + dX 2 − X 2 1+ d 2 − X 2 2 × 0 + dX 2 − X 2 1+

2 − X d 2 − X −1 × 2 × 2 dX 2 − X 2 1+ −1 4 − X 2 d × − X × × 2 dX 2 − X 2 1+ −1 8 − X × × 0 − 1 2 × 2 2 − X 2 1+ −1 −8 − X × × 2 2 2 − X 2 1+ −1

dY 2 = × dX

dY 2 = dX

dY 2 = dX

2 dY = dX

2 dY = dX

dY 2 = × dX

8 − X 2 2 − X 2

1+ × 2

(54-10)

(54-10a)

(54-10b)

(54-10c)

(54-10d)

(54-10e)

(54-10f)

Correction and Discussion Regarding Derivatives

dY 2 8 − X = × 2 dX 2 − X 2 1+

415

(54-10g – this step is where the error crept in previously – you can’t be too careful)

dY 2 8 − X = × 2 dX 2 − X 2 + dY 2 = × dX

8 − X 2 − X2 +

2

dY 2 8 − X = × 2 dX 4 − X2 + 8 − X dY 2 2 = × 2× 2 dX 4 − X2 + dY 2 8 2 − X = × 2 dX 2 + 4 − X2

(54-10h)

(54-10i)

(54-10j)

(54-10k)

(54-11)

The error in equation 54-11 then propagated through to the rest of the equations for the Lorentzian distribution. The correct formulas are as follows: ⎧ 2 d

⎪ 2 + 4 − X2 ⎨ 8 2 − X dY 2 dX = ×

4 ⎪ dX 2 ⎩ 2 + 4 − X2 2

2 ⎫ d 2 ⎪ 8 2 − X + 4 − X2 ⎬ dX − 4 ⎪

⎭ 2 + 4 − X2

(54-12)

⎧ 2

⎪ 2 + 4 − X2 × 8 2 × d − X ⎨ dY 2 dX = × 4 ⎪ dX 2 ⎩ 2 2 + 4 − X 2

d ⎫ ⎬ 2 + 4 − X2 ⎪ 8 2 − X × 2 2 + 4 − X2 dX − 4 ⎪ ⎭ 2 + 4 − X2

(54-12a)

416

Chemometrics in Spectroscopy

⎧ 2 ⎪ 2 + 4 − X2 × 8 2 × 0 − 1 ⎨ dY 2 = × 4 ⎪ dX 2 ⎩ 2 + 4 − X2 2

d ⎫ ⎬ 2 + 4 − X2 ⎪ 8 2 − X × 2 2 + 4 − X2 dX − 4 ⎪ ⎭ 2 + 4 − X2

(54-12b)

⎧ 2 ⎪ −8 2 2 + 4 − X2 ⎨ dY 2 = × 4 ⎪ dX 2 ⎩ 2 + 4 − X2 2

d ⎫ ⎬ 2 + 4 − X2 ⎪ 8 2 − X × 2 2 + 4 − X2 dX − 4 ⎪ ⎭ 2 + 4 − X2

(54-13)

⎛

2

−8 2 2 + 4 − X2

dY 2 ⎜ = ×⎝ 4 dX 2 2 + 4 − X2 2

d ⎞ d 2 2 2 2 16 − X + 4 − X 4 − X + ⎟ dX dX ⎟ (54-14) − 4 ⎠ 2 2 + 4 − X 2

⎛

2

2

2 2 + 4 − X −8 dY 2 ⎜ = ×⎝ 4 dX 2 2 + 4 − X2 2

⎞

d − X 16 2 − X 2 + 4 − X2 0 + 4 × 2 − X ⎟ dX ⎟ − 4 ⎠ 2 2 + 4 − X (54-14a) ⎧ 2

2

⎪ 2 2 ⎨ 2 −8 + 4 − X dY 2 = × 4 ⎪ dX 2 ⎩ 2 + 4 − X2 ⎫ d ⎪ − X ⎪ 8 − X 16 − X + 4 − X ⎬ dX − 4 ⎪ ⎪ ⎭ 2 + 4 − X2 2

2

2

(54-14b)

Correction and Discussion Regarding Derivatives

417

⎧ 2

2

⎪ 2 2 ⎨ −8 + 4 − X dY 2 = × 4

⎪ dX 2 ⎩ 2 + 4 − X2 2

⎫ ⎪ 16 − X + 4 − X 8 − X 0 − 1 ⎬ − 4 ⎪ ⎭ 2 + 4 − X2 2

2

2

(54-14c)

⎧ 2 ⎪ −8 2 2 + 4 − X2 ⎨ dY 2 = × 4 2 ⎪ dX ⎩ 2 + 4 − X2 2

⎫ ⎬ 16 2 − X 2 + 4 − X2 −8 − X ⎪ − 4 ⎪ ⎭ 2 + 4 − X2

(54-15)

⎧ 2 ⎫ 2 2 2 ⎪ ⎪ 2 2 2 2 ⎨ −128 − X + 4 − X ⎬ −8 + 4 − X d2 Y 2 = × − 4 4 ⎪ dX 2 ⎪ ⎭ ⎩ 2 + 4 − X2 2 + 4 − X2 (54-15a) ⎧ 2 ⎫ 2 2 2 ⎪ ⎪ 2 2 2 2 ⎨ 128 − X + 4 − X ⎬ −8 + 4 − X d2 Y 2 + = × 4 4 ⎪ ⎪ dX 2 ⎩ ⎭ 2 + 4 − X2 2 + 4 − X2 (54-15b) ⎧ ⎫ ⎪ −8 2 2 + 4 − X2 2 ⎪ ⎨ ⎬ 2 dY 128 − X 2 = × 3 + 3 2 ⎪ ⎪ ⎩ dX 2 + 4 − X2 ⎭ 2 + 4 − X2 2

⎧ ⎫ ⎪ − 2 2 + 4 − X2 ⎪ 2 ⎬ ⎨ 2 16 − X dY 16 = × 3 3 + ⎪ ⎪ dX 2 ⎩ 2 + 4 − X2 2 + 4 − X2 ⎭

(54-15c)

2

⎧ ⎫ 2 2⎪ ⎪ 2 ⎨ − + 4 − X + 16 − X ⎬ d2 Y 16 = × 3 ⎪ ⎪ dX 2 ⎩ ⎭ 2 + 4 − X2 ⎧ ⎫ ⎪ − 3 − 4 − X2 + 16 − X2 ⎪ ⎬ ⎨ dY 16 = × 3 ⎪ ⎪ dX 2 ⎩ ⎭ 2 + 4 − X2

(54-15d)

(54-15e)

2

(54-15f)

418

Chemometrics in Spectroscopy

⎧ ⎫ ⎪ ⎪ 2 ⎨ 3⎬ 16 12 − X − dY = × 3 ⎪ ⎪ 2 ⎩ dX 2 + 4 − X2 ⎭ 2

(54-16)

This correction also propagates to equation 54-18 when we set (X − ) equal to : dY 2 8 2 16 2 8 2 = × = 2 = × 2 25 2 5 2 dX 2 + 4 2

(54-18)

Third, an error in evaluating the exponential in equation 54-19 led to the incorrect constant multiplier. The corrected expression is − 2 1 e0 −1 d2 Y 1 − 2 = − e− 2 = 0 − = 2 1/2 1/2 1/2 5 3 3 3 dXMAX 2 2 2 21/2 (54-19) We see, therefore, that the derivative decreases with the third power of , the same rate as the derivative of the Normal distribution. Next, the matrices in Chapter 56 [3] contain several erroneous entries. There are a number of sign errors, and some errors in values, mostly resulting from formatting problems in the manuscript. Here we present the corrected matrices for those. For equation 56-25, the fourth entry on the fourth line had a formatting problem; the correct value is 1588.

56-26 MT M−1 =

0 333333 0 −0 0476190 0 0 +0 262566 0 −0 0324074 −0 0476190 0 0 01190476 0 0 −0 0324074 0 0 00462962

56-27 MT M−1 MT = −0 095238 0 14285714 0 28571428 0 333333 0 28571428 0 14285714 −0 0952381 0 087301 −0 2658730 −0 2301587 0 0 23015873 0 2658730 −0 0873015 0 059523 0 −0 0357143 −0 047619 −0 0357143 0 0 05952381 −0 027777 0 0277777 0 02777777 0 −0 0277777 −0 0277777 0 02777777

56-28 MT M−1 MT (corrected for scaling) = −0 095238 0 1428571 0 2857143 0 333333 0 285714 0 0873016 −0 265873 −0 230158 0 0 230158 0 1190476 0 −0 071428 −0 09523 −0 071428 −0 166666 0 166666 0 1666666 0 −0 166666

0 142857 0 265830 0 −0 166666

−0 095238 −0 087301 0 1190476 0 1666666

The next (and final) item is, perhaps, not so much an error as a question of possible differences in interpretation of the results and meanings of some of the derivative

Correction and Discussion Regarding Derivatives

419

computations presented. One of our respondents pointed out that the magnitudes of the various derivatives, and especially the relative magnitudes of derivatives of different orders, depend on the units used, particularly the units used to describe the X-axis. Now, while in fact we did not specify any units in our discussion (see, e.g., Figure 54-1 in Chapter 54 [1], where the X-axis contains only the label “Wavelength”), given our backgrounds, it is true enough that we implicitly had nanometers in mind for our X-units. In the case of real spectra, however, if spectra were measured using, say, microns as the units for the X-axis, the same spectrum would have a calculated value for the first derivative that was 1000 times what would be calculated for a “nm-based” derivative. In that case, the first derivative (for a 10 nm wide band, which would be a 0.01 micron wide band) would be 100 times greater than the maximum spectral value, rather than being 1/10 of it, as the value computed using nanometers for the X-scale came out to. The second derivative would then be 106 times what we calculated and therefore 10,000 times greater than the maximum spectral value, instead of being 1/100 of it, the value we showed. In principle this is all correct. In practice, however, if we ignore FTIR and specialty technologies such as AOTF, then the vast majority of instruments in use today for modern NIR spectroscopy (still primarily diffraction grating based instruments) use nanometers as their wavelength unit, and usually collect data at some small integer number of nanometers. Furthermore, the vast majority of those have a 10-nm bandpass, so that 10 nm is the minimum bandwidth that would be measured. Also, even for instruments with higher resolution, the natural bandwidths of many, or even most, absorbance bands of materials that are commonly measured are greater than 10 nm in the NIR. Given all this, the use of a 10 nm figure to represent a “typical” NIR absorbance band is not unrealistic, and gives the reader a realistic assessment of what a “typical” user can expect from the NIR spectra he measures, and their derivatives. The choice of units, of course, does not affect the instrumental characteristic of signal-to-noise, which is what is important, and which we discuss in part IV of the sub-series [4]. If we consider FTIR instrumentation, then the situation is trickier, since the equivalent resolution in nm varies across the spectrum. But even keeping the spectrum in its “natural” wavenumber units, we again find that except for rotational fine structure of gases, the natural bandwidth of many (most) absorbance bands is greater than 10 wavenumbers. So again, using that figure shows the “typical” user how he can expect his own measured spectra to behave. We thank Todd Sauke, Peter Watson and (again) Colin Christy for pointing out the errors and for general comments and discussion.

REFERENCES 1. 2. 3. 4.

Mark, Mark, Mark, Mark,

H. H. H. H.

and and and and

Workman, Workman, Workman, Workman,

J., J., J., J.,

Spectroscopy Spectroscopy Spectroscopy Spectroscopy

18(4), 32–37 (2003). 18(9), 25–28 (2003). 18(12), 106–111 (2003). 19(1), 44–51 (2004).

This page intentionally left blank

63

Linearity in Calibration: Act III Scene I – Importance

of Nonlinearity

Here we go again. We seem to come up with the same themes. There are two reasons for that: first, there is so much to say and second, because the format of these chapters, which is an open-ended discussion of all manner of things chemometric, give us the opportunity to expand on a topic to any extent we consider necessary and desirable, sometimes after having discussed it in lesser detail previously, or not having discussed a particular aspect. Having previously discussed linearity in Chapters 27 and 29–33 to a considerable extent [1–6], you might think that there was little more to say. Hardly! In this chapter we will discuss where linearity considerations fit into the larger scheme of calibration theory, then we will discuss methods of testing data for linearity (or, more accurately, nonlinearity) and what can be done about it. This is not the first time we have addressed nonlinearity. In fact, the first time either of us addressed the issue was quite a long time ago, although from a purely qualitative point of view [7]. More recently others, particularly in the NIR community, have been starting to take an interest as well, mainly from the point of view of detecting nonlinearity in the data. Chuck Miller described some of the sources of nonlinearity in an article in NIR News [8]. Our good colleague and friend Tom Fearn, who writes a column in the British journal NIR News, has recently tackled this somewhat thorny topic [9]. A bit farther back, Tony Davies also addresses this topic, although in a more general context [10].

WHY IS NONLINEARITY IMPORTANT? Discussions dealing with quantitative spectroscopic analysis often list many sources of error. This is particularly true in the case of NIR analysis, where the error sources are often categorized into subheadings, such as errors due to the instrument (e.g., noise, drift, etc.), errors due to the sample (inhomogeneity, etc.), errors due to chemistry/physics (interactions, etc.) and data handling (outliers, intercorrelation, etc.). Indeed, we have often done this ourselves. Breaking down the error sources into the smallest pieces that contribute to the total error of the analysis and categorizing them is an exercise of great importance, since it is only through identifying and classifying errors this way can we devise methods to reduce them and so improve our analyses. However, for our current purposes we want to approach the situation somewhat differently. What we want to do here is to consider that, after all the samples are prepared, after all the experiments are performed, after all the data is collected, what we end up with is a table (or maybe more than one table) of numbers – even if that table exists only in a computer memory somewhere. Everything that affects the data, for good or bad, is

422

Chemometrics in Spectroscopy

reflected one way or another in that table. All the dozens of individual effects that are described in the more detail tables of error sources are, in the end, only effective by the way they are manifested in the spectrum and therefore in the spectral data. Therefore, everything that affects the performance of our spectroscopic analyses can be distilled into the effect that they have on the data, and the effects that are manifested in the calibration results. There are surprisingly few of these, if considered generally enough. This is essentially the opposite of the detailed breakdowns described above, it is the lumping together of effects into a very few categories. While some may disagree, all the effects described in the detailed listings can be classified into one of the following categories, and shown to be manifested in the data through one (or more) of these characteristics (or at least, this is one way to categorize them): (1) Characteristics that act on the X data or the Y data alone: a. Random error b. Drift & other systematic error (2) Characteristics that affect the relationship between X and Y : a. Poor choice of algorithm and/or data transformation b. Incorrect choice of factors/wavelengths c. Nonlinearity. As indicated, the first two items on this condensed listing include those aspects of mea surement that contribute error to measurements of the spectral data or of the constituent information, while the last group includes all those aspects that affect the relationships between them. From this list we see that nonlinearity is one of the fundamental limiting characteristics that makes it through this (rather brutal) screening process. For a long time, however, the contribution of nonlinearity to the error of spectroscopic calibrations was not generally recognized by the spectroscopic (or the chemometric) community. Much attention was given to issues of random noise, choice of factors (for PCR and PLS calibration) and wavelengths (for MLR calibrations) and investigations into the “best” data transform. Innumerable papers were written, and presentations were made concerning empirical methods of trying to improve calibration performance, but to a large extent they only addressed these three characteristics. These efforts could largely be summarized by the following template that can be applied to specific cases by replacing the terms in angle brackets with the specific term used in a given paper: “Calibration for

Linearity in Calibration: Act III Scene I

423

methods used (i.e., those that have obtained the approval of the regulatory agency, a term which essentially means the FDA, in this writing) are mostly inherently univariate. There are publications available that provide the official specifications for characteristics that an analytical method must meet in order to be accepted by the regulatory agency; these specifications are all designed to accommodate the characteristics of the univariate methods. The US Pharmacopoeia provides the official specifications for the United States, and the FDA requires that all analytical methods used for products under their supervision meet those specifications. Other countries have equivalent agencies. In order to reduce the burden on the many pharmaceutical companies that are international in scope, there exists an organization called the International Committee on Harmonization (ICH) that advises individual countries’ agencies with a view toward having uniform requirements. (We are grateful to Gary Ritchie for verifying the accuracy of statements regarding the structure, mechanisms and meaning of the regulatory processes (G. Ritchie, 2002, personal communication).) The FDA is very conservative, and for good reason. And we, at least, are very glad of that whenever we go to the drug store to buy some antibiotics, painkillers, anticholesterol drugs or any other medicine. Reading the required specifications for analytical methods makes it abundantly clear that they were written with univariate analytical methods in mind. The conservatism of the regulatory agencies means that it will be difficult to make the sweeping changes that we might like to see happen, that will permit NIR and other analytical methods based on multivariate methods of analysis to be used. Nevertheless, by the time you read this chapter, the FDA will have convened several meetings of interested scientists, to advise them on whether, and how, these methods can become approved. But in order to understand what needs to be changed, we first need to understand the current situation. In order for a pharmaceutical company to use any analytical method for certifying the properties (efficacy, potency, etc.) of their products, the analytical method has to be validated. “Validation”, in the parlance of the FDA, is a far cry from what we usually call “validation” when developing a multivariate spectroscopic method. In fact, what we call “validation” in spectroscopic calibration (which usually means calculating an SEP, or an SECV) is a far cry from the dictionary definition of “validate”, which is “to make legally valid”, where “valid” is defined as “having legal efficacy or force” [11]. The meaning of “validation” as used by the FDA is much closer to the dictionary definition (not surprising, since the FDA is an entity very much concerned with the legal as well as the technical issues concerning validation of analytical methods) than it is to the spectroscopic concept of validation, but still differs considerably even from that. While still very general, the FDA’s definition of “validation” is much more specific than the dictionary definition. The bottom line of the FDA meaning of “validation” is essentially to thoroughly demonstrate scientifically (meaning: to “prove” in a manner that is both scientifically and legally defensible) that the method is “suitable for its intended purpose”. In the world of the FDA, anything having to do with the manufacture of pharmaceutical products (equipment, chemicals, processes, etc.) must be validated in the described sense, including the analytical methods used for testing them. When developing an analytical method to meet the requirements of being validatable, the burden is on the developer of the method to show that it is, in fact, “suitable for its intended purpose”. The Pharmacopoeia and ICH specifications include a “laundry

424

Chemometrics in Spectroscopy

list” of characteristics or “validation parameters” that must be tested. In this chapter we are not going to discuss the general topic of validating an analytical method for FDA approval; among other reasons is that they do not all fall under the umbrella of “chemometrics in spectroscopy”. We are only interested in the more limited topic of nonlinearity, therefore it suffices for us to simply point out that one of the param eters that must be tested and demonstrated for an analytical method is its linearity. The burden is on the developer of a method to demonstrate linearity between the response of the method and the concentration of the analyte that the method purports to measure. What does that mean? Any analytical method, whether based on wet chemistry, chromatography, or spectroscopy (or other technology: electrochemistry, for example) provides, as its final, ultimate output, a number. This number, which we claim represents the amount of the analyte in the sample (whether that is a concentration, total amount, or some other characteristic) we can call the response of the method to the analyte. The guidelines provide variant descriptions of the meaning of the term “linearity”. One definition is, “ ability (within a given range) to obtain test results which are directly proportional to the concentration (amount) of analyte in the sample” [12]. This is an extremely strict definition, one which in practice would be unattainable when noise and error are taken into account. Figure 63-1a schematically illustrates the problem. While there is a line that meets the criterion that “test results are directly proportional to the concentration of analyte in the sample”, none of the data points fall on that line, therefore in the strictest sense of the phrase, none of the data representing the test results can be said to be proportional to the analyte concentration. In the face of nonlinearity of response, there are systematic departures from the line as well as random departures, but in neither case is any data point strictly proportional to the concentration. Less strict descriptions of linearity are also provided. One recommendation is visual examination of a plot (unspecified, but presumably also of the method response versus the analyte concentration). Another recommendation is to use “statistical methods”, calculation of a regression line is advised. If regression is performed, the correlation

Test results

(b)

Test results

(a)

0

0 0 Analyte concentration

0 Analyte concentration

Figure 63-1 Linear and nonlinear data. Figure 63-1a: Even when the overall trend of the data is to follow a straight line none of the data points meet the strict criterion of having the test results strictly proportional to the analyte concentration. Figure 63-1b shows that for nonlinear data there are systematic departures from the straight line as well as random departures.

Linearity in Calibration: Act III Scene I

425

coefficient, slope, y-intercept and residual sum of squares are to be reported. These requirements are all in keeping with their background of being applied to univariate methods of analysis. There is no indication given as to how these quantities are to be related to linearity, only that they be reported. The recommendations all have difficulties. In the first place, there is a specification that a minimum of five concentrations are to be used. However, reflecting the background of the guidelines in a world of univariate analyses, the different concentrations are to be created using dilution techniques. This method of creating samples is generally unsuitable for spectroscopic (especially NIR) analysis. Visual examination of the plot is fraught with possible errors of interpretation. Since visual examination of a plot is inherently subjective, different analysts might come to different conclusions from the same data plot. The recommended statistical quantities to be reported from the regression analysis have nothing to do with linearity (or much of anything else, for that matter). R2 is rather strongly recommended, but the problem with using R2 to assess linearity was nicely illustrated by Tom Fearn [13], who showed that random error can cause linear data to have a lower value of R2 than nonlinear data with less random error, making the test actively misleading. Furthermore, there is a problem with all the statistics mentioned, this problem is demonstrated by the work of Anscombe [14] in a fascinating paper that everyone doing any sort of statistical calibration work should read. Anscombe’s work was also the basis of a more recent paper dealing with how misunderstanding the statistics can cause someone to become mislead [15]. We will not repeat Anscombe’s presentation, but we will describe what he did, and strongly recommend that the original paper be obtained and perused (or alternatively, the paper by Fearn [15]). In his classic paper, Anscombe provides four sets of (synthetic, to be sure) univariate data, with obviously different characteristics. The data are arranged so as to permit univariate regression to be applied to each set. The defining characteristic of one of the sets is severe nonlinearity. But when you do the regression calculations, all four sets of data are found to have identical calibration statistics: the slope, y-intercept, SEE, R2 , F -test and residual sum of squares are the same for all four sets of data. Since the numeric values that are calculated are the same for all data sets, it is clearly impossible to use these numeric values to identify any of the characteristics that make each set unique. In the case that is of interest to us, those statistics provide no clue as to the presence or absence of nonlinearity. So the fact of the matter is that the reason the recommended statistics do not tell us about linearity is that, as Anscombe shows, they cannot tell us about linearity. In fact, the recommendations in the official guidelines, while well-intended, are them selves not suitable for their intended purpose in this regard, not even for univariate methods of analysis. For starters, they do not provide a good definition of linearity, that can be used as the basis for deciding whether a given set conforms to the desired criterion of being linear. Therefore, let us start by proposing a definition, one that can at least serve as a basis for our own discussions. Let us define linearity as “The property of data comparing test results to actual concentrations, such that a straight line provides as good a fit (using the least-squares criterion) as any other mathematical function.” We continue in out next chapter with a discussion of using the Durbin-Watson Statistic for testing for nonlinearity.

426

Chemometrics in Spectroscopy

REFERENCES 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15.

Mark, H. and Workman, J., Spectroscopy 13(6), 19–21 (1998). Mark, H. and Workman, J., Spectroscopy 13(11), 18–21 (1998). Mark, H. and Workman, J., Spectroscopy 14(1), 16–17 (1998). Mark, H. and Workman, J., Spectroscopy 14(2), 16–27,80 (1999). Mark, H. and Workman, J., Spectroscopy 14(5), 12–14 (1999). Mark, H. and Workman, J., Spectroscopy 14(6), 12–14 (1999). Mark, H., Applied Spectroscopy 42(5), 832–844 (1988). Miller, C.E., NIR News 4(6), 3–5 (1999). Fearn, T., NIR News 12(6), 14–15 (2001). Davies, T., Spectroscopy Europe 10(4), 28–31 (1998). Webster’s Seventh New Collegiate Dictinoary (G. & C. Merriam Co., Springfield, MA, 1970). ICH-Q2A, Food and Drug Adminsitration, March 1, 1995. Fearn, T., NIR News 11(1), 14–15 (2000). Anscombe, F.J., The American Statistician 27, 17–21 (1973). Fearn, T., NIR News 7(1), 3, 5 (1996).

64

Linearity in Calibration: Act III Scene II – A Discussion

of the Durbin-Watson Statistic, a Step in the

Right Direction

As we left off in Chapter 63, we had proposed a definition of linearity. Now let us start by delving into the ins and outs of the Durbin-Watson statistic [1–6] and looking at how to use it to test for nonlinearity. In fact, we have talked about the Durbin-Watson statistics in previous chapters, although a long time ago and under a different name. Quite a while ago we published a column titled “Alternative Ways to Calculate Standard Deviation” [7]. One of the alternative ways described was the calculation by Successive Differences. As we shall see, that calculation is very closely related indeed to the Durbin-Watson Statistic. More recently we described this statistic (more directly named) in a sidebar to an article in the American Pharmaceutical Review [8]. To relate the Durbin-Watson Statistic to our current concerns, we go back to the basics of statistical analysis and remind ourselves how statisticians think about Statistics. Here we get into the deep thickets of statistical theory and meaning and philosophy. We will try to keep it as simple as possible, though. Let us start with two of the formulas for Standard Deviation presented in earlier chapters and columns [7]. One of the formulas is the “ordinary” formula for standard deviation: � �� n � �X − X �2 i � i=1 (64-1) SD1 = n−1 The other formula is the formula for calculating Standard Deviation by successive Differences: � � � � n−1 � Xi+1 − Xi 2 � i=1 (64-2) SD2 = 2n − 1 Now we ask ourselves the question: “If we calculate the standard deviation for a set of data (or errors) from these two formulas, will they give us the same answer?” And the answer to that question is that they will, IF (that’s a very big “if ”, you see) the data and the errors have the characteristics that statisticians consider “good” statistical properties: random, independent (uncorrelated), constant variance, and in this case, a Normal distribution, and for errors, a mean ( of zero, as well. For a set of data that meets all these criteria, we can expect the two computations to produce the same answer (within the limits of what is sometimes loosely called “Statistical variability”).

428

Chemometrics in Spectroscopy

So under conditions where we expect the same answer from both computations, we expect the ratio of the computations to equal 1 (unity). Basically, this is a general description of how statisticians think about problems: compare the results of two com putations of what is nominally the same quantity when all conditions meet the specified assumptions. Then if the comparison fails, this constitutes evidence that something about the data is not conforming to the expected characteristic (i.e., is not random, is corre lated, is heteroscedastic, is not Normal, etc.). The Durbin-Watson statistic is that type of computation, stripped to its barest essentials. Dividing equation 64-2 by equation 64-1 above, canceling similar terms, noting that the mean error is zero and ignoring the constant factor (64-2) we arrive at � ei+1 − ei 2 DW = (64-3) � 2 e Because of the way it is calculated, particularly the way the constant factor is ignored, the expected value of DW is two, when the data does in fact meet all the specified criteria: random, independent errors, etc. Nonlinearity will cause the computed value of DW to be statistically significantly less than two. (Homework assignment for the reader: what characteristic will make DW be statistically significantly greater then two?) Figures 64-1 and 64-2 illustrate graphically what happens when you inspect the residuals from a calibration. When you plot linear data, the data are evenly spread out around the calibration line as shown in Figure 64-1a. When plotting the residuals, the line representing the calibration line is brought into coincidence with the X-axis, so that the residuals are evenly spread out around the X-axis, as shown in Figure 64-1b. For nonlinear data, shown in Figure 64-2a, a plot of the residuals shows that although the calibration line still coincides with the X-axis, the data does not follow that line. Therefore, although the residuals still have equal positive and negative values, they are no longer spread out evenly around the zero line because the actual function is no longer a straight line. Instead the residuals are evenly spread out around some hypothetical curved line (shown) representing the actual (nonlinear) function describing the data. In both the linear and the nonlinear cases the total variation of the residuals is the sum of the random error, plus the departure from linearity. When the data is linear, the variance due to the departure from nonlinearity is effectively zero. For a nonlinear set of data, since the X-difference between adjacent data points is small, the nonlinearity of the function makes minimal contribution to the total difference between adjacent residuals; and most of that difference contributing to the successive differences in the numerator of the DW calculation is due to the random noise of the data. The denominator term, on the other hand, is dependent almost entirely on the systematic variation due to the curvature, and for nonlinear data this is much larger than the random noise contribution. Therefore the denominator variance of the residuals is much larger than the numerator variance when nonlinearity is present, and the Durbin-Watson statistic reflects this by assuming a value less than 2. The problem we all have is that we all want answers to be in clear, unambiguous terms: yes/no, black/white, is/isn’t linear, and so on while Statistics deals in probabilities. It is certainly true that there is no single statistic: not SEE, not R2 , not DW, nor any other that is going to answer the question of whether a given set of data, or residuals, has a linear relation. If we wanted to be REALLY ornery, we could even argue that “linearity” is, as with most mathematical concepts, an idealization of a property that

Linearity in Calibration: Act III Scene II

429

(a) 12.15 12.1

Test value

12.05 12 11.95 11.9 11.85 11.8

12.09

12.07

12.05

12.03

12.01

11.99

11.97

11.95

11.93

11.91

11.89

11.87

11.85

11.75

Concentration (b) 0.15

0.1

Residual

0.05

0 11.85

11.9

11.95

12

12.05

12.1

12.15

–0.05

–0.1

–0.15

Concentration

Figure 64-1 A graphic illustration of the behavior of linear data. Figure 64-1a – Linear data spread out around a straight line. Figure 64-1b – the residuals are spread evenly around zero.

NEVER exists in real data. But that is not productive, and does not address the real-world issues that confront us. What are some of these real-world issues? Well, you might want to check out the following paper: Anscombe, F.J., “Graphs in Statistical Analysis” [9]. I will describe his results again, but it really is worth getting hold of and reading the original paper anyway, it is quite an eye-opener. What Anscombe presents are four sets of synthetic data, representing four simple (single X-variable) regression situations. One of the data sets represents a reasonably well-behaved set of data: uniform distribution of data along the X-axis, errors are random, independent and Normally distributed, and in all respects has all the properties that statisticians consider “good”. The other three sets show very gross departures, of varying kinds (including one that is severely nonlinear),

430

Chemometrics in Spectroscopy (a) 12.15 12.1 12.05

Test value

12 11.95 11.9 11.85 11.8 11.75

12.07 11.26

12.09

12.05 11.24

12.03

12.01

11.99

11.97

11.95

11.93

11.91

11.89

11.87

11.85

11.7

Concentration (b) 0.1 0.08 Operative difference for denominator

0.06

Operative difference for numerator

11.3

11.28

11.22

11.2

11.18

11.16

11.14

11.12

11.1

11.08

11.06

–0.02

11.04

0

11.02

0.02

11

Residual

0.04

–0.04 –0.06

Wavelength

Figure 64-2 A graphic illustration of the behavior of nonlinear data. Figure 64-2a – Nonlinear data does not surround a straight line evenly. Figure 64-2b – The residuals from nonlinear data are not spread out around zero.

from this well-behaved data set. So what is the big deal about that? The big deal is that, by design, all four sets of data have identical values of all the common regression statistics: coefficients, SEE, R2 , and so on. The intent is, of course, to show that no set of statistics can unambiguously diagnose all possible problems in all situations. It is immediately clear, when you look at the graphs of the four data sets on the other hand, which is the “good” one and which ones have the problems, and what the problems are. Any statistician worth his salt will tell you that if you are doing calibration work, you should examine the residual plots, and any others that might be informative.

Linearity in Calibration: Act III Scene II

431

But the FDA/ICH guidelines do not promote that approach even though they are mentioned. To the contrary, they emphasize calculating and submitting the numerical results from the line fitting process. Under ordinary circumstances, that is really not too bad, as long as you understand what it is you are doing, which usually means going back to basic statistical theory. This theory says that IF data meets certain criteria, criteria that (always) include the fact that the errors that are random and independent, and (usually) Normally distributed, then certain calculations can be done and PROB ABILISTIC statements made about the results of those calculations. If you make the calculation and the value turns out to be one of low probability, then that is taken as evidence that your data fail to meet one or more of the criteria that they are assumed to meet. Note that the calculation alone does not tell you which criterion is not met; the criterion that it does not meet may or may not be the one you are concerned with. The converse, however, is, strictly speaking, not true. If your calculated result turns out to be a high-probability value, that does NOT “prove” that the data meet the criteria. That is what Anscombe’s paper is demonstrating, because there is a (natural) tendency to forget that point, and assume that a “good” statistic means “good” data. So where does that leave us? Does it mean that statistics are useless, or that the FDA is clueless? No, but it means that all these things have to be done with an eye to knowing what can go wrong. I strongly suspect that the FDA has taken the position it does because it has found that, even though numerical statistics are not perfect, they provide an objective measure of calibration performance, and they have found through hard experience that the subjective interpretation of graphs is even more fraught with problems than the use of admittedly imperfect statistics. For similar reasons, the statement “If the Durbin-Watson test demonstrates a correla tion, then the relationship between the two assays is not linear” is not exactly correct, either. Under some circumstances, a linear correlation can also give rise to a statistically significant value of DW. In fact, for any statistic, it is always possible to construct a data set that gives a high-probability value for the statistic, yet the data clearly and obviously fail to meet the pertinent criteria (again, Anscombe is a good example of this for a few common statistics). So what should we do? Well, different statistics show different sensitivities to particular departures from the ideal, and this is where DW comes in. The key to calculating the Durbin-Watson statistic is that prior to performing the calculation, the data must be put into a suitable order. The Durbin-Watson statistic is then sensitive to serial correlations of the ordered data. While the serial correlation is often thought of in connection with time series, that is only one of its applications. Draper and Smith [1] discuss the application of DW to the analysis of residuals from a calibration; their discussion is based on the fundamental work of Durbin, et al., in the references listed at the beginning of this chapter. While we cannot reproduce their entire discussion here, at the heart of it is the fact that there are many kinds of serial correlation, including linear, quadratic and higher order. As Draper and Smith show (on p. 64), the linear correlation between the residuals from the calibration data and the predicted values from that calibration model is zero. Therefore if the sample data is ordered according to the analyte values predicted from the calibration model, a statistically significant value of the Durbin-Watson statistic for the residuals in indicative of high-order serial correlation, that is nonlinearity. Draper and Smith point out that you need a minimum of fifteen samples in order to get meaningful results from the calculation of the Durbin-Watson statistic [1]. Since the

432

Chemometrics in Spectroscopy

Anscombe data set contains only eleven readings, statistically meaningful statements cannot be made, nevertheless it is interesting to see the results of the Durbin-Watson statistic applied to the nonlinear set of Anscombe data; the value of the statistic is 1.5073. For comparison, the Durbin-Watson statistic for the data set representing normal “good” data is 2.4816. Is DW perfect? Not at all. The way it is calculated, the highest-probability value (the “expected” value) for DW is, as we saw above, 2. Yet it is possible to construct a data set that has a DW value of 2, and is clearly and obviously not linear, as well as being non-random. That data set is 0 1 0 −1 0 1 0 −1 0 1 0 −1 0 1 0 −1 0 Data set1 But for ordinary data, we would not expect such a sequence to happen. This is the reason most statistics work as general indicators of data performance: the special cases that cause them to fail are themselves low-probability occurrences. In this case the problem is not whether or not the data are nonlinear, the problem is that they are nonrandom. This is a perfect example of the data failing to meet a criterion other than the one you are concerned with. Therefore the Durbin-Watson test fails, as would any statistical test fail for such data; they are simply not amenable to meaningful statistical calculations. Nevertheless, a “blind” computation of the Durbin-Watson statistic would give an apparently satisfactory value. But this is a warning that other characteristics of the data can cause it to appear to meet the criteria. And you have to know what CAN occur. But the mechanics of calculating DW for testing linearity is relatively simple, once you have gone through all the above: sort the data set according to the values predicted from the calibration model, then do the calculation specified in Equation 64-3. Note that, while the sorting is done using the predicted values from the model, the DW calculations are done using the residuals. But anyone doing calibration work should read Draper and Smith anyway, it is the “bible” of regression analysis. The full reference is given in the reference list [1]. The discussions of DW are on p. 69 and 181–192 of Draper and Smith (third edition – the second edition contains a similar but somewhat less extensive discussion). They also include an algorithm and tables of critical values for deciding whether the correlation is statistically significant or not. You might also want to check out page 64 for the proof that the linear correlation between residuals and predicted values from the calibration is zero. So DW and R2 test different things. As a specific test for nonlinearity, what is the relative utility of DW versus R2 for that purpose? Basically, the answer was that when done according to the way Draper and Smith (and I) described, then DW is specifically sensitive to nonlinearity in the predictions. So, for example, in the case of the Anscombe data, all the other statistics (including R2 might be considered satisfactory, and since they are the same for all four sets of data then all four sets would be considered satisfactory. But if you do the DW test on the data showing nonlinearity, it will flag it as having a low value of the statistic, Anscombe did not provide enough samples worth of synthetic data in his sets, however, for the calculated statistics to be statistically meaningful. We also note that as a practical matter, meaningful calculation of the Durbin-Watson Statistic requires many samples worth of data. We noted above that for fewer than

Linearity in Calibration: Act III Scene II

433

fifteen samples, critical values for this statistic are not listed in the tables. The reason for requiring so many samples, is that we are essentially comparing two variances (or, at least, two measures of the same variance). Since variances are distributed as 2 , for small numbers of samples this statistic has a very wide range of values indeed, so that comparisons become virtually meaningless because almost anything will fall within the confidence interval, giving this test low statistical power. On the other hand, characterizing R2 as a general measure of how good the fit is does not make us flinch, either; it is one of the standard statistics for doing that evaluation. Quite the contrary, when we saw it being specified as way to test linearity, we wondered why that was chosen by the FDA and ICH, since it is so NON-specific. We still do not know why, except for the obvious guess that they did not know about DW. We are in favor of keeping the other statistics as measures of the general “goodness of fit” of the model to the data, but in the specific context of trying to assesess linearity, We still have to promote DW over R2 as being more suited for that special purpose, although we will eventually discuss in our next few chapters an even better method for assessing linearity – after all, it was the section on “Linearity” where this all came up. As for testing other characteristics of a univariate calibration, there are also ways to test for statistical significance of the slope, to see whether unity slope adequately describes the relationship between test results and analyte concentration. These are described in the book Principles and Practice of Spectroscopic Calibration [10]. The Statistics are described there, and are called the “Data Significance t” test and the “Slope Significance t” test (or DST and SST tests!). Unless the DST is statistically significant, the SST is meaningless, though. In principle, there is also a test for the intercept. But since the expected value for the intercept depends on the slope, it gets a bit hairy. It also makes the confidence interval so large that the test is nigh on useless – few statisticians recommend it. But let us add this coda to the discussion of DW: the fact that DW is specifically sen sitive to nonlinearity does not mean that it is perfect. There may be cases of nonlinearity that will not be detected (especially if it is marginal amount), linear data will occasion ally be flagged as nonlinear (% of the time, in the long run) and other types of defects in the data may show up by giving a statistically significant value to DW. But all this is true for any and all statistics. The existence of at least one data set that is known to fool the calculation is a warning that the Durbin-Watson statistic, while a (large) step in the right direction, is not the ultimate answer. Some further comments here: there does seem to be some confusion between the usage of the statistics recommended by the guidelines, which are excellent for their intended purpose of testing the general “goodness of fit” of a model, and the specific testing of a particular model characteristic, such as linearity. A good deal of this confusion is probably due to the fact that the guidelines recommend those general statistics for the specific task of testing linearity. As Anscombe shows, however, and as we referred to previously, those generalized statistics are not up to the task. In our next chapter we will discuss other methods of testing for linearity that have appeared in the literature. Afterward, we will then turn our attention to a new test that has been devised. In fact, it turns out that while DW has much to recommend it, it is not the final or best answer. The new method, however, is much more direct and specific even than DW. It is the correct way to test for linearity. We will discuss it all in due course, in this same place.

434

Chemometrics in Spectroscopy

REFERENCES 1. Draper, N. and Smith, H., Applied Regression Analysis, 3rd ed. (John Wiley & Sons, New York, (1998). 2. Durbin, J. and Watson, G.S., Biometrika 37, 409–428 (1950). 3. Durbin, J. and Watson, G.S., Biometrika 38, 159–178 (1951). 4. Durbin, J., Biometrika 56, 1–15 (1969). 5. Durbin, J., Econometrica 38, (422–429), (1970). 6. Durbin, J. and Watson, G.S., Biometrika 58, 1–19 (1971). 7. Mark, H. and Workman, J., Spectroscopy 2(11), 38–42 (1987). 8. Ritchie, G. and Ciurczak, E., American Pharmaceutical Review 3(3), 34–40 (2000). 9. Anscombe, F.J., The American Statistician 27, 17–21 (1973). 10. Mark, H., Principles and Practice of Spectroscopic Calibration (John Wiley & Sons, New York, 1991).

65

Linearity in Calibration: Act III Scene III – Other Tests

for Nonlinearity

We continue what our discussion of the previous chapter subject matter: discussions of other ways to test data for nonlinearity. Let us begin by reviewing what we want to test. The FDA/ICH guidelines, starting from a univariate perspective, considers the relationship between the actual analyte concentration and what they generically call the “test result”, a term that is independent of the technology used to ascertain the analyte concentration. This term therefore holds good for every analytical methodology from manual wet chemistry to the latest hightech instrument. In the end, even the latest instrumental methods have to produce a number, representing the final answer for that instrument’s quantitative assessment of the concentration, and that is the test result from that instrument. This is a univariate concept to be sure, but the same concept that applies to all other analytical methods. Things may change in the future, but this is currently the way analytical results are reported and evaluated. So the question to be answered, for any given method of analysis, is the relationship between the instrument readings (test results) and the actual concentration linear? Three tests of this characteristic were discussed in the previous chapters: the FDA/ICH recommendation of linear regression with a report of various regression statistics, visual inspection of a plot of test results versus the actual concentrations, and use of the Durbin-Watson Statistic. Since we previously analyzed these tests we will not further discuss them here, but will summarize them in Table 65-1, along with other tests for nonlinearity that we explain and discuss in this chapter. So we now proceed to present various linearity tests that can be found in the statistical literature:

F -TEST Figure 65-1 shows a schematic representation of the F -test for linearity. Note that there are some similarities to the Durbin-Watson test. The key difference between this test and the Durbin-Watson test is that in order to use the F -test as a test for (non) linearity, you must have measured many repeat samples at each value of the analyte. The variabilities of the readings for each sample are pooled, providing an estimate of the within-sample variance. This is indicated by the label “Operative difference for denominator”. By Analysis of Variance, we know that the total variation of residuals around the calibration line is the sum of the within-sample variance S 2 within plus the variance of the means around the calibration line. Now, if the residuals are truly random, unbiased, and in particular the model is linear, then we know that the means for each sample will cluster

436

Chemometrics in Spectroscopy

Table 65-1 Various tests for (non) linearity that have been proposed and a summary of their characteristics Test method

Advantages

Disadvantages

Visual inspection of plot

Works

Cannot be automated Cannot be tested statistically Subjective

Durbin-Watson statistic

Works Objective Is statistically testable Can be computerized

Has “fatal flaw” Requires large number of samples Low statistical power

FDA/ICH recommendation: Linear regression with report of slope, intercept, correlation coefficient, and residual sum of squares

Objective Can be computerized Uses standard statistics

Doesn’t work as a test of linearity

F -test

Objective Computerized Uses standard statistics

Requires large number of samples Low statistical power Usually not applicable to historical data Not specific for nonlinearity; other defects in the data may be flagged as nonlinearity

Normal distribution of residuals

Objective Can be computerized Uses standard statistics

Very insensitive Very low statistical power Not specific for nonlinearity

randomly around the calibration line, and their variance will equal S 2 within /n1/2 (indicated by the label “Operative difference for numerator”). The ratio of these two variances will be distributed as the F -distribution, with an expected value of unity. If there is nonlinearity, such as is shown in Figure 65-1, then the variance corresponding to the means will be inflated by the systematic offset of each sample, and the computed F -ratio will statistically significantly larger than unity. This test thus shares several characteristics with the Durbin-Watson test. It is based on well-known and rigorously sound statistics. It is amenable to automated computerized calculation, and suitable for automatic operation in an automated process situation. It does not have the “fatal flaw” of the Durbin-Watson Statistic. On the other hand, it also shares some of the disadvantages of the Durbin-Watson Statistic. It is also based on a comparison of variances, so that it is of low statistical power. It requires many more samples and readings than the Durbin-Watson statistic does, since each sample must be measured many times. In general, it is not applicable

Linearity in Calibration: Act III Scene III

437

Residuals

Operative difference for numerator

Operative difference for denominator

Mean

0

Predicted values

Figure 65-1 Schematic representation of the residuals for the F -test.

to historical data, since the data must have been collected using the proper protocols, and rarely are so many readings taken for each sample as this test requires. It is also not specific for nonlinearity. Outliers, poorly fitting models, bias or error in the reference values or other defects of the data may appear to be nonlinearity.

NORMALITY OF RESIDUALS In a well-behaved calibration model, residuals will have a Normal (i.e., Gaussian) distribution. In fact, as we have previously discussed, least-squares regression analysis is also a Maximum Likelihood method, but only when the errors are Normally distributed. If the data does not follow the straight line model, then there will be an excessive number of residuals with too-large values, and the residuals will then not follow the Normal distribution. It follows, then, that a test for Normality of residuals will also detect nonlinearity. Over time, statisticians have devised many tests for the distributions of data, including one that relies on visual inspection of a particular type of graph. Of course, this is no more than the direct visual inspection of the data or of the calibration residuals themselves. However, a statistical test is also available, this is the 2 test for distributions, which we have previously described. This test could be applied to the question, but shares many of the disadvantages of the F -test and other tests. The main difficulty is the practical one: this test is very insensitive and therefore requires a large number of samples and a large departure from linearity in order for this test to be able to detect it. Also, like the F -test it is not specific for nonlinearity, false positive indication can also be triggered by other types of defects in the data. We continue in our next chapter with a explanation of a new test that has been devised, that overcomes the limitations of the various tests we have described.

This page intentionally left blank

66

Linearity in Calibration: Act III Scene IV – How to Test

for Nonlinearity

In Chapter 65, dealing with linearity [1], we promised we would present a description of what we believe is the best way to test for linearity (or nonlinearity, depending on your point of view). In our Chapters 63 through 65 [1–3], we examined the DurbinWatson statistic along with other methods of testing for nonlinearity. We found that while the Durbin-Watson statistic is a step in the right direction, we also saw that it had shortcomings, including the fact that it could be fooled by data that had the right (or wrong!) characteristics. The method we now present is mathematically sound, more subject to statistical validity testing, based on well-known mathematical principles, is of much higher statistical power than DW and can distinguish different types of nonlinearity from each other. This new method has also been recently described in the literature [4]. But let us begin by discussing what we want to test. The FDA/ICH guidelines, starting from a univariate perspective, considers the relationship between the actual analyte concentration and what they generically call the “test result”, a term that is independent of the technology used to ascertain the analyte concentration. This term therefore holds good for every analytical methodology from manual wet chemistry to the latest hightech instrument. In the end, even the latest instrumental methods have to produce a number, representing the final answer for that instrument’s quantitative assessment of the concentration, and that is the test result from that instrument. This is a univariate concept to be sure, but the same concept that applies to all other analytical methods. Things may change in the future, but this is currently the way analytical results are reported and evaluated. So the question to be answered, for any given method of analysis, is the relationship between the instrument readings (test results) and the actual concentration linear? This method of determining nonlinearity can be viewed from a number of different perspectives, and can be considered as coming from several sources. One way to view it is as having a pedigree as a method of numerical analysis [5]. Our new method of determining nonlinearity (or showing linearity) is also related to our discussion of derivatives, particularly when using the Savitzky-Golay method of convolution functions, as we discussed recently [6]. This last is not very surprising, once you consider that the Savitzky-Golay convolution functions are also (ultimately) derived from considerations of numerical analysis. In some ways it also bears a resemblance to the current method of assessing linearity that the FDA and ICH guidelines recommend, that of fitting a straight line to the data, and assessing the goodness of the fit. As we showed [2, 3], based on the work of Anscombe [7], the currently recommended method for assessing linearity is faulty because it cannot distinguish linear from nonlinear data, nor can it distinguish between nonlinearity and other types of defects in the data. But an extension of that method can.

440

Chemometrics in Spectroscopy

In our recent chapter we proposed a definition of linearity [2]. We defined linearity as “The property of data comparing test results to actual concentrations, such that a straight line provides as good a fit (using the least-squares criterion) as any other mathematical function.” This almost seems to be the same as the FDA/ICH approach, which we just discredited. But there is a difference. The difference is the question of fitting other possible functions to the data; the FDA/ICH guidelines only specify trying to fit a straight line to the data. This is also more in line with our own proposed definition of linearity. We can try to fit functions other than a straight line to the data, and if we cannot obtain an improved fit, we can conclude that the data is linear. But it is also possible to fit other functions to a set of data, using least-squared mathematics. In fact, this is what the Savitzky-Golay method does. The Savitzky-Golay algorithm, however, does a whole bunch of things, and lumps all those things together in a single set of convolution coefficients: it includes smoothing, differentiation, curve-fitting of polynomials of various degrees, least-squares calculations, does not include interpo lation (although it could) and finally combines all those operations into a single set of numbers that you can multiply your measured data to directly get the desired final answer. For our purposes, though, we do not want to lump all those operations together. Rather, we want to separate them and retain only those operations that are useful for our own purposes. For starters, we discard the smoothing, derivatives and performing a successive (running) fit over different portions of the data set, and keep only the curvefitting. Texts dealing with numerical analysis tell us what to do and how to do it. Many texts exist dealing with this subject, but we will follow the presentation of Arden [5]. Arden points out and discusses in detail, many applications of numerical analysis: fitting data, determining derivatives and integrals, interpolation (and extrapolation), solving systems of equations and solving differential equations. These methods are all based on using a Taylor series to form an approximation to a function describing a set of data. The nature of the data and the nature of the approximation considered differ from what we are used to thinking about, however. The data is assumed to be univariate (which is why this is of interest to us here) and to follow the form of some mathematical function, although we may not know what the function is. So all the applications mentioned are based on the concept that since a function exists, our task is to estimate the nature of that function, using a Taylor series, and then evaluate the parameters of the function by imposing the condition that our approximating function must pass through all the data points available, since those data points are all described exactly by that function. Using a Taylor series implies that the approximating function that we wind up with will be a polynomial, and perhaps one of very high degree (the “degree” of a polynomial being the highest power to which the variable is raised in that polynomial). If we have chosen the wrong function, then there may be some error in the estimate of data between the known data points, but at the data points the error must be zero. A good deal of mathematical analysis goes into estimating the error that can occur between the data points. The concepts of interest to us are contained in Arden’s book in a chapter entitled “Approximation”. This chapter takes a slightly different tack than the rest of the discussion, but one that goes exactly in the direction that we want to go. In this chapter, the scenario described above is changed very slightly. There is still the assumption that there is a single (univariate) mathematical system (corresponding to “analyte concen tration” and “test reading”), and that there is a functional relationship between the two variables of interest, although again the nature of the relationship may be unknown. The

Linearity in Calibration: Act III Scene IV

441

difference, however, is the recognition that data may have error, therefore we no longer impose the condition that the function we arrive at must pass through every data point. We replace that criterion with a different criterion, the criterion we use is one that will allow us to say that the function we use to describe the data “follows” the data in some sense. While other criteria can be used, a common criterion used for this purpose is the “least squares” principle: to find parameters for any given function that minimizes the sum of the squares of the differences between the data and a corresponding point of the function. Similarly, many different types of functions can be used. Arden discusses, for example, the use of Chebyshev polynomials, which are based on trigonometric functions (sines and cosines). But these polynomials have a major limitation: they require the data to be collected at uniform X-intervals throughout the range of X, and real data will seldom meet that criterion. Therefore, since they are also by far the simplest to deal with, the most widely used approximating functions are simple polynomials; they are also convenient in that they are the direct result of applying Taylor’s theorem, since Taylor’s theorem produces a description of a polynomial that estimates the function being reproduced. Also, as we shall see, they lead to a procedure that can be applied to data having any distribution of the X-values. Y = a0 + a1 X + a2 X 2 + a3 X 3 + · · · an X n

(66-4)

Note that here again we continue our usual practice of continuing equation and figure numbering through a set of related chapters. While discussing derivatives, we have noted in a previous chapter that for certain data a polynomial can provide a better fit to that data than can a straight line (see Figure 66-6B of [8]). In fact, we reproduce that Figure 66-6B here again as Figure 66-3 in this chapter, for ease of reference. Higher degree polynomials may provide an even better fit, if the data requires it. Arden points this out, and also points out that, for example in the non-approximation case (assuming exact functionality), if the underlying function is

0.0015 Parabola 0.0005

Response

–0.0005 1 5

9

13 17 21 25 29 33 37 41 45 49 53 57 61 65 69 73 77 81

–0.0015

Second derivative

–0.0025 –0.0035 –0.0045 –0.0055

Wavelength

Figure 66-3 A quadratic polynomial can provide a better fit to a nonlinear function over a given region than a straight line can; in this case the second derivative of a Normal absorbance band.

442

Chemometrics in Spectroscopy

in fact itself a polynomial of degree n, then no higher degree polynomial is needed in that case, and in fact, it is impossible to fit a higher polynomial to the data. Even if an attempt is made to do so, the coefficients of any higher-degree terms will be zero. For functions other than polynomials the “best” fit may not be clear, but as we shall see, that will not affect us. The mathematics of fitting a polynomial by least squares are relatively straightforward, and we present a derivation here, one that follows Arden, but is rather generic, as we shall see: Starting from equation 66-4, we want to find coefficients (the ai ) that minimize the sum-squared difference between the data and the function’s estimate of that data, given a set of values of X. Therefore we first form the differences: D = a0 + a1 X + a2 X 2 + a3 X 3 + · · · + an X n − Y

(66-5)

Then we square those differences and sum those squares over all the sets of data (corresponding to the samples used to generate the data): � 2 � D = i �a0 + a1 X + a2 X 2 + a3 X 3 + · · · + an X n − Y�2 (66-6) i The problem now is to find a set of values for the ai that minimizes �D2 with respect to each ai . We do this by the usual procedure of taking the derivative of �D2 with respect to each ai and setting each of those derivatives equal to zero. Note that since there are n + 1 different ai , we wind up with n + 1 equations, although we only show the first three of the set: � �� � �� i D2 � � �a0 + a1 X + a2 X 2 + a3 X 3 + · · · + an X n − Y�2 = =0 (66-7a) �a0 �a0 �� � � �� i D2 � � �a0 + a1 X + a2 X 2 + a3 X 3 + · · · + an X n − Y�2 = =0 (66-7b) �a1 �a1 �� � � �� i D2 � � �a0 + a1 X + a2 X 2 + a3 X 3 + · · · + an X n − Y�2 = =0 (66-7c) �a2 �a2 and so on. Now we actually indicated derivative of each term and separate the summations. � take the � Noting that �� i F 2 � = 2 i F �F (where F is the inner summation of the ai X): � � � � � � �1� + 2a1 i X + 2a2 i X 2 + 2a3 i X 3 + · · · + 2an i X n − 2 i Y = 0 (66-8a) � � 2 � 3 � 4 � n+1 � 2a0 i X + 2a1 i X + 2a2 i X + 2a3 i X + · · · + 2an i X − 2 i XY = 0 (66-8b) � 2 � 3 � 4 � 5 � n+2 � 2 2a0 i X + 2a1 i X + 2a2 i X + 2a3 i X + · · · + 2an i X − 2 i X Y = 0 (66-8c) 2a0

and so on.

Linearity in Calibration: Act III Scene IV

443

Dividing both sides of equations 66-8 (a–c) by two eliminates the constant term and subtracting the term involving Y from each side of the resulting equations puts the equations in their final form: � � � � � � (66-9a) a0 �1� + a1 i X + a2 i X 2 + a3 i X 3 + · · · + an i X n = i Y � � � � � � (66-9b) a0 i X + a1 i X 2 + a2 i X 3 + a3 i X 4 + · · · + an i X n+1 = i XY � 2 � 3 � 4 � 5 � n+2 � 2 = i X Y (66-9c) a0 i X + a1 i X + a2 i X + a3 i X + · · · + an i X and so on. The values of X and Y are known, since they constitute the data. Therefore equa tions 66-9 (a–c) comprise a set of n + 1 equations in n + 1 unknowns, the unknowns being the various values of the ai since the summations, once evaluated, are constants. Therefore, solving equations 66-9 (a–c) as simultaneous equations for the ai results in the calculation of the coefficients that describe the polynomial (of degree n) that best fits the data. In principle, the relationships described by equations 66-9 (a–c) could be used directly to construct a function that relates test results to sample concentrations. In practice, there are some important considerations that must be taken into account. The major consideration is the possibility of correlation between the various powers of X. We find, for example, that the correlation coefficient of the integers from 1 to 10 with their squares is 0.974 – a rather high value. Arden describes this mathematically and shows how the determinant of the matrix formed by equations 66-9 (a–c) becomes smaller and smaller as the number of terms included in equation 66-4 increases, due to correlation between the various powers of X. Arden is concerned with computational issues, and his concern is that the determinant will become so small that operations such as matrix inversion will be come impossible to perform because of truncation error in the computer used. Our concerns are not so severe; as we shall see, we are not likely to run into such drastic problems. Nevertheless, correlation effects are still of concern for us, for another reason. Our goal, recall, is to formulate a method of testing linearity in such a way that the results can be justified statistically. Ultimately we will want to perform statistical testing on the coefficients of the fitting function that we use. In fact, we will want to use a t-test to see whether any given coefficient is statistically significant, compared to the standard error of that coefficient. We do not need to solve the general problem, however, just as we do not need to create the general solution implied by equation 66-4. In the broadest sense, equation 66-4 is the basis for computing the best-fitting function to a given set of data, but that is not our goal. Our goal is only to determine whether the data represent a linear function or not. To this end it suffices only to ascertain whether the data can be fitted better by any polynomial of degree greater than 1, than it can by a straight line (which is a polynomial of degree 1). To this end we need to test a polynomial of any higher degree. While in some cases, the use of more terms may be warranted, in the limit we need test only the ability to fit the data using only one term of degree greater than one. Hence, while in general we may wish to try fitting equations of degrees 2, 3, � � � m (where m is some upper limit less than n), we can begin by using polynomials of degree 2, that is quadratic fits.

444

Chemometrics in Spectroscopy

A complication arises. We learn from considerations of multiple regression analysis that when two (or more) variables are correlated, the standard error of both variables is increased over what would be obtained if equivalent but uncorrelated variables are used. This is discussed by Daniel and Wood (see p. 55 in [9]), who show that the variance of the estimates of coefficients (their standard errors) is increased by a factor of VIF =

1 1 − R2

(66-10)

when there is correlation between the variables, where R represents the correlation coefficient between the variables and we use the term VIF, as is sometimes done, to mean Variance Inflation Factor. Thus we would like to use uncorrelated variables. Arden describes a general method for removing the correlation between the various powers of X in a polynomial, based on the use of orthogonal Chebyshev polynomials, as we briefly mentioned above. But this method is unnecessarily complicated for our current purposes, and in any case has limitations of its own. When applied to actual data, Chebyshev and other types of orthogonal polynomials (Legendre, Jacobi and others) that could be used will be orthogonal only if the data is uniformly, or at least symmetrically, distributed; real data will not always meet that requirement. Since, as we shall see, we do not need to deal with the general case, we can use a simpler method to orthogonalize the variables, based on Daniel and Wood, who showed how a variable can be transformed so that the square of that variable is uncorrelated with the variable. This is a matter of creating a new variable by simply calculating a quantity Z and subtracting that from each of the original values of X. A symmetric distribution of the data is not required since that is taken into account in the formula. Z is calculated using the expression (see p. 121 in [9]). In Appendix A, we present the derivation of this formula: N �

Z=

j=1

2

Xj2 �Xj − X�

N �

(66-11) 2

�Xj − X�

j=1

Then the set of values �X − Z�2 will be uncorrelated with X, and estimates of the coefficients will have the minimum possible variance, making them suitable for statistical testing. In Appendix A, we also present formulas for making the cubes, quartics and, by induction, higher powers of X be orthogonal to the set of values of the variable itself. In his discussion of using these approximating polynomials, Arden presents a com putationally efficient method of setting up and solving the pertinent equations. But we are less concerned with abstract concepts of efficiency than we are with achieving our goal of determining linearity. To this end, we point out that the equations 66-9 and indeed the whole derivation of them is familiar to us, although in a different con text. We are all familiar with using a relationship similar to equation 66-4; in using spectroscopy to do quantitative analysis, one of the representations of the equation involved is C = b0 + b1 X1 + b2 X2 + · · · + bn Xn

(66-12)

Linearity in Calibration: Act III Scene IV

445

which is the form we commonly use to represent the equations needed for doing quantitative spectroscopic analysis using the MLR algorithm. The various Xi in equation 66-12 represent entirely different variables. Nevertheless, starting from equation 66-12, we can derive the set of equations for calculating the MLR calibration coefficients, in exactly the same way we derived equation 66-9 (a–c) from equation 66-4. An example of this derivation is presented in [10]. Because of this parallelism we can set up the equivalencies: a 0 = b0 a1 = b 1

X1 = X

a2 = b 2

X2 = X 2

a3 = b 3

X3 = X 3

and so on. and we see that by replacing our usual MLR-oriented variables X1 , X2 , X3 , and so on with X, X 2 , X 3 , and so on, respectively, we can use our common and wellunderstood mathematical methods (and computer programs) to perform the necessary calculations. Furthermore, along with the values of the coefficients, we can obtain all the usual statistical estimates of variances, standard errors, goodness of fit, and so on that MLR programs produce for us. Of special interest is the fact that MLR pro grams compute estimates of the standard errors of the coefficients, as described by Draper and Smith (see, for example, p. 129 in [11]). This allows testing the statis tical significance of each of the coefficients, which, as we recall, are now the coef ficients of the various powers of X that comprise the polynomial we are fitting to the data. This is the basis of our tests for nonlinearity. We need not use polynomials of high degree since our goal is not necessarily to fit the data as well as possible. Especially since we expect that well-behaved methods of chemical analysis will produce results that are already close to linearly related to the analyte concentrations, we expect nonlinear terms to decrease as the degree of the fitting equation used increases. Thus we need to only fit a quadratic, or at most a cubic equation to our data to test for linearity, although there is nothing to stop us from using equations of higher degree if we choose. Data well-described by a linear equation will produce a set of coefficients with a statistically significant value for the term X 1 (which is X, of course) and non-significant values for the coefficients of X 2 or higher degree.

CONCLUSION This is the basis for our new test of linearity. It has all the advantages we described: it gives an unambiguous determination of whether any nonlinearity is affecting the relationship between the test results and the analyte concentration. It provides a means of distinguishing between different types of nonlinearity, if they are present, since only those that have statistically significant coefficients are active. It also is more sensitive than any other statistical linearity test including the Durbin-Watson statistic. The tables

446

Chemometrics in Spectroscopy

in Draper and Smith for the thresholds of the Durbin-Watson statistic only give values for more than ten samples. As we shall shortly see, however, This method of linearity testing is quite satisfactory for much smaller numbers of samples. As an example, we applied these concepts to the Anscombe data [7]. Table 66-1 shows the results of applying this to both the “normal” data (Anscombe’s X1, Y 1 set) and the data showing nonlinearity. We also computed the nature of the fit using only a straight-line (linear) fit as was done originally by Anscombe and also fitted a polynomial using the quadratic term as well. It is interesting to compare results both ways. We find that in all four cases, the coefficient of the linear term is 0.5. In Anscombe’s original paper, this is all he did, and obtained the same result, but this was by design: the synthetic data he generated was designed and intended to give this result for all the data sets. The fact that we obtained the same coefficient (for X) using the polynomial demonstrates that the quadratic term was indeed uncorrelated to the linear term. The improvement in the fit from the quadratic polynomial applied to the nonlinear data indicated that the square term was indeed an important factor in fitting that data. In fact, including the quadratic term gives well-nigh a perfect fit to that data set, limited only by the computer truncation precision. The coefficient obtained for the quadratic term is comparable in magnitude to the one for linear term, as we might expect from the amount of curvature of the line we see in Anscombe’s plot [7]. The coefficient of the quadratic term for the “normal” data, on the other hand, is much smaller than for the linear term.

Table 66-1 The results of applying the new method of detecting nonlinearity to Anscombe’s data sets, both the linear and the nonlinear, as described in the text Parameter

Results for nonlinear data Constant Linear term Square term SEE R

Coefficient when using only linear term

3.000 0.500 --------------

t-value when using only linear term

Coefficient using square term

4.24 --------------

4.268 0.5000 −0�1267

1.237 0.816

t-value using square term

3135.5 −2219�2

0.0017 1.0

Results for normal data Constant Linear term Square term SEE R

3.000 0.500 -------------1.237 0.816

4.24 --------------

3.316 0.500 −0�0316 1.27 0.8291

4.1 −0�729

Linearity in Calibration: Act III Scene IV

447

As we expected, furthermore, for the “normal”, linear relationship, the t-value for the quadratic term for the linear data is not statistically significant. This demonstrates our contention that this method of testing linearity is indeed capable of distinguishing the two cases, in a manner that is statistically justifiable. The performance statistics, the SEE and the correlation coefficient show that including the square term in the fitting function for Anscombe’s nonlinear data set gives, as we noted above, essentially a perfect fit. It is clear that the values of the coefficients obtained are the ones he used to generate the data in the first place. The very large t-values of the coefficients are indicative of the fact that we are near to having only computer round-off error as operative in the difference between the data he provided and the values calculated from the polynomial that included the second-degree term. Thus this new test also provides all the statistical tests that the current FDA/ICH test procedure recommends. and it also provides information as to whether, and how well, the analytical method gives a good fit of the test results to the actual concentration values. It can distinguish between different types of nonlinearities, if necessary, while simultaneously evaluating the overall goodness of the fitting function. As the results from applying it to the Anscombe data show, it is eminently suited to evaluating the linearity characteristics of small data set as well as large ones.

APPENDIX A: DERIVATION AND DISCUSSION OF THE FORMULA IN EQUATION 66–11 Starting with a set of data values Xi , we want to create a set of other values from these Xi such that the squares of those values are uncorrelated to the Xi themselves. We do this by subtracting a value Z, from each of the Xi and find a suitable value of Z, so that the set of values (Xi − Z�2 is uncorrelated with the Xi . From the definition of the correlation coefficient, then, this means that the following must hold: �� � �� i

i

Xi − X

� Xi − X �Xi − Z�2

�2 �2 � � �Xi − Z�2 − �Xi − Z�2

=0

(66-A1)

i

Multiplying both sides of equation 66-A1 by the denominator of the LHS of equa tion 66-A1 results in the much-simplified expression: �� � (66-A2) Xi − X �Xi − Z�2 = 0 i

We now need to solve this expression for Z. We begin by expanding the square term: �� �� � (66-A3) Xi − X Xi2 − 2Xi Z + Z2 = 0 i

We then multiply through �� 2 � � � � � �� Xi Xi − X − 2Xi Z Xi − X + Z2 Xi − X = 0 i

(66-A4a)

448

Chemometrics in Spectroscopy

distributing the summations and bringing constants outside the summations: � � �� � 2� � � � Xi Xi − X − 2Z Xi Xi − X + Z2 Xi − X = 0 (66-A4b) i

i

i

� �� Since Xi − X = 0, the last term in equation 66-A4b vanishes, leaving i

�

� � � � � Xi2 Xi − X − 2Z Xi Xi − X = 0

i

(66-A5)

i

equation 66-A5 is now readily rearranged to solve for Z: � � 2� X i Xi − X i � Z= � � 2 Xi Xi − X

(66-A6)

i

Equation 66-A6 appears to differ from the expression in Daniel and Wood [9], in that the denominator expressions differ. To show that they are equivalent, we start with the denominator term of the expression on p. 121 of [9]: ��

Xi − X

�2

(66-A7)

i

Again, we expand this expression: �

Xi2 − 2

i

�

Xi X +

�

X

2

(66-A8)

i

and separating and collecting terms: � i

Xi2 −

�

Xi X −

��

i

2

X − Xi X

� (66-A9)

i

Rearranging the last term in the expression: � 2 � �� � Xi − Xi X − X X − Xi i

i

(66-A10)

i

And we find that again, the last term in equation 66-A10 vanishes since

�� i

leaving: � i

Xi2 −

�

Xi X

� Xi − X = 0,

(66-A11)

i

And upon combining the summations and factoring out Xi : � � � X i Xi − X

(66-A12)

i

which is thus seen to be the same as the denominator term we derived in equation 66-A6: QED

Linearity in Calibration: Act III Scene IV

449

By similar means we can derive expressions that will create transformations of other powers of the X-variable that make the corresponding power uncorrelated to the X variable itself. Thus, analogously to equation 66-A2, if we wish to find a quantity Z3 that will make �Xi − Z3 �3 be uncorrelated with X, we set up the expression: �� � Xi − X �Xi − Z3 �3 = 0 (66-A13) i

which provides the following polynomial in Z3 : � � � � � � � 1 � 3� Xi Xi − X − 3Z3 Xi2 Xi − X + 3Z3 2 Xi Xi − X = 0 3 i i i

(66-A14)

Equation 66-A14 is quadratic in Z3 , and thus, after evaluating the summations is easily solved through use of the Quadratic Formula. Similarly, for fourth powers we set up the expression: �� � Xi − X �Xi − Z4 �4 = 0 (66-A15) i

which gives � � � � � � 6 � � � � 1 � 4� Xi Xi − X + 4Z Xi3 Xi − X + Z2 Xi 2 Xi − X − Z3 Xi Xi − X = 0 4 4 i i i i

(66-A16) Again, equation 66-A16 is cubic in Z4 and can be solved by algebraic methods. For higher powers of the variable we can derive similar expressions. After the sixth power, algebraic methods are no longer available to solve for the Zi , but after evaluating the summations, computerized approximation methods can be used. Thus the contribution of any power of the X-variable to the nonlinearity of the data can be similarly tested by these means.

REFERENCES 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11.

Mark, H. and Workman, J., Spectroscopy 20(4), 38–39 (2005). Mark, H. and Workman, J., Spectroscopy 20(1), 56–59 (2005). Mark, H. and Workman, J., Spectroscopy 20(3), 34–39 (2005). Mark, H., Journal of Pharmaceutical and Biomedical Analysis 33, 7–20 (2003). Arden, B.W., An Introduction to Digital Computing, 1st ed. (Addison-Wesley Publishing Co., Inc., Reading, MA, 1963). Mark, H. and Workman, J., Spectroscopy 18(12), p. 106–111 (2003). Anscombe, F.J., The American Statistician 27, 17–21 (1973). Mark, H. and Workman, J., Spectroscopy 18(9), 25–28 (2003). Daniel, C. and Wood, F., Fitting Equations to Data – Computer Analysis of Multifactor Data for Scientists and Engineers, 1st ed. (John Wiley & Sons, 1971). Mark, H., Principles and Practice of Spectroscopic Calibration (John Wiley & Sons, New York, 1991). Draper, N. and Smith, H., Applied Regression Analysis, 3rd ed. (John Wiley & Sons, New York, 1998).

This page intentionally left blank

67

Linearity in Calibration: Act III Scene V – Quantifying

Nonlinearity

In Chapters 63–66 [1–4], we discussed shortcomings of current methods used to assess the presence of nonlinearity in data, and presented a new method that addresses those shortcomings. This new method is statistically sound, provides an objective means to determine if nonlinearity is present in the relationship between two sets of data, and is inherently suitable for implementation as a computer program. A shortcoming of the method presented is one that it has in common with virtually all statistical tests: while it provides a means of unambiguously and objectively determining the presence of nonlinearity, if we find that nonlinearity is present, it does not address the question of how much nonlinearity is present. This chapter therefore presents results from some computer experiments designed to assess a method of quantifying the amount of nonlinearity present in a data set, assuming that the test for the presence of nonlinearity has already been applied and found that indeed, a measurable, statistically significant degree of nonlinearity exists. The spectroscopic community, and indeed, the chemical community at large is not the only group of scientists concerned with these issues. Other scientific disciplines also are concerned with ways to evaluate methods of chemical analysis. Notable among them are the pharmaceutical communities and the clinical chemistry communities. In those communities, considerations of the sort we are addressing are even more important, for at least two reasons: 1) These disciplines are regulated by governmental agencies, especially the Food and Drug Administration. In fact, it was considerations of the requirements of a regulatory agency that created the impetus for this series of chapters in the first place [1]. 2) The second reason is what drives the whole effort of ensuring that everything that is done, is done “right”: an error in an analytical result can conceivably, in literal fact, cause illness or even death to occur. Thus the clinical chemistry community has also investigated issues such as the linearity of the relationship between test results and actual chemical composition, and an interesting article provides the impetus for creating a method of assessing the degree of nonlinearity present in the relationship between two sets of data [5]. The basis for this calculation of the amount of nonlinearity is illustrated in Figure 67-1. In Figure 67-1a we see a set of data showing some nonlinearity between the test results and the actual values. If a straight line and a quadratic polynomial are both fit to the data, then the difference between the predicted values from the two curves give a measure of the amount of nonlinearity. Figure 67-1a shows data subject to both random error and nonlinearity, and the different ways linear and quadratic polynomials fit the data.

452

Chemometrics in Spectroscopy

Linear fit

Result

Quadratic fit

Concentration

Figure 67-1(a) An illustration of the method of measuring the amount of nonlinearity showing hypothetical synthetic data to which each of the functions are fit.

As shown in Figure 67-1a, at any given point, there is a difference between the two functions which represents the difference between the Y-values corresponding to a given X-value. Figure 67-1b shows that irrespective of the random error of the data, the difference between the two functions depends only on the nature of the functions and can be calculated from the difference between the Y-values corresponding to each X-value. If there is no nonlinearity at all, then the two functions will coincide, and all the differences

Linear fit

Result

Quadratic fit

Concentration

Xi

Xn

Figure 67-1(b) The functions, without the data, showing the differences between the functions at two values of X. The circles show the value of the straight line, the crosses show the value of the quadratic function at the given values of X.

Linearity in Calibration: Act III Scene V

453

will be zero. Increasing amounts of nonlinearity will cause increasingly large differences between the values of the two functions corresponding to each X-value, and these can be used to calculate the nonlinearity. The calculation used is the calculation of the sum of squares of the differences [5]. This calculation is normally applied to situations where random variations are affecting the data, and, indeed, is the basis for many of the statistical tests that are applied to random data. However, the formalism of partitioning the sums of squares, which we have previously discussed [6] (also in [7], p. 81 in the first edition or p. 83 in the second edition), can be applied to data where the variations are due to systematic effects rather than random effects. The difference is that the usual statistical tests (t 2 F , etc.) do not apply to variations from systematic causes because they do not follow the required statistical distributions. Therefore it is legitimate to perform the calculation, as long as we are careful how we interpret the results. Performing the calculations on function fitted to the raw data has another ramification: the differences, and therefore the sums of squares, will depend on the units that the Y -values are expressed in. It is preferable that functions with similar appearances give the same computed value of nonlinearity regardless of the scale. Therefore the sumof-squares of the differences between the linear and the quadratic functions fitted to the data is divided by the sum-of-squares of the Y -values that fall on the straight line fitted to the data. This cancels the units, and therefore the dependency of the calculation on the scale. A further consideration is that the value of the calculated nonlinearity will depend not only on the function that fits the data, we suspect that it will also depend on the distribution of the data along the X-axis. Therefore, for pedagogical purposes, here we will consider the situation for two common data distributions: the uniform distribution and the Normal (Gaussian) distribution. Figure 67-2 presents some quadratic curves containing various amounts of nonlinear ity. These curves represent data that was, of course, created synthetically. The purpose of generating these curves was for us to be able to compare the visual appearance of curves containing known amounts of nonlinearity with the numerical values for the various test parameters that describe the curves. Figure 67-2 represents data having a uniform distribution of X-values, although, of course, data with a different distribution of X-values would follow the same curves. The curves were generated as follows: 101 values of a uniformly distributed variable (used as the X-variable) was generated by creating a set of numbers from 0 to 1 at steps of 0.01. The Y -values for each curve were generated by calculating the Y -value from the corresponding X-value according to the following formula: Y = X − kX 2 + kX

(67-1)

The parameter k in equation 67-1 induces a varying amount of nonlinearity in the curve. For the curves in Figure 67-1, k varied from 0 to 2 in steps of 0.2. The subtraction of the quadratic term in equation 67-1 gives the curves their characteristic of being convex upward, while adding the term kX back in ensures that all the curves, and the straight line, meet at zero and at unity. Table 67-1 presents the results of computing the linearity evaluation results for the curves shown in Figure 67-1, for the case of a uniform distribution of data along the

454

Chemometrics in Spectroscopy 1.2 1 k = 2.0 0.8 0.6 0.4 k=0 0.2

1

0.9

0.95

0.85

0.8

0.7

0.75

0.6

0.65

0.55

0.5

0.45

0.4

0.3

0.35

0.2

0.25

0.15

0.1

0

0.05

0

Figure 67-2 Curves illustrating varying amounts of nonlinearity.

X-axis. It presents the coefficients of the linear models (straight lines) fitted to the several curves of Figure 67-1, the coefficients of the quadratic model, the sum-of-squares of the differences between the fitted points from the two models, and the ratio of the sum-of-squares of the differences to the sum-of-squares of the X-data itself, which, as we said above, is the measure of nonlinearity. Table 67-1 also shows the value of the correlation coefficient between the linear fit and the quadratic fit to the data, and the square of the correlation coefficient. In Table 67-1 we see an interesting result: the ratio of sums of squares we are using for the linearity measure is equal to 1 (unity) minus the square of the computed correlation coefficient value between the linear fit and the quadratic fit to the data. This should not surprise us. As noted above, the same formalisms that apply to random data can also be applied to data where the differences are systematic. Therefore, the equality we see here corresponds to the well-known property of sums of squares from any regression analysis, that from the analysis of variance of the regression results, the correlation coefficient is related to sum-squared-error of the analysis in a similar way (see, for example, p. 17 in [8]). It is also interesting to note that the coefficients of the models resulting from the calculations on the data (shown in Figure 67-1) are not the same as the original generating functions for the data. This is because the generating functions (from equation 67-1) are not the best-fitting functions (nor, as we shall see, are they the orthogonalized functions), which is what is used to create the models, and the predicted values from the models. Since the correlation coefficient is an already-existing and known statistical function, why is there a need to create a new calculation for the purpose of assessing nonlinearity? First, the correlation coefficient’s roots in Statistics direct the mind to the random aspects of the data that it is normally used for. In contrast, therefore, using the ratio of the sum of squares helps keep us reminded that we are dealing with a systematic effect whose magnitude we are trying to measure, rather than a random effect for which we want to ascertain statistical significance.

k(in equation 67-1)

0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0

Linear coefficients

Quadratic coefficients

b1 (slope) for linear fit

b0 (intercept) for linear fit

b2 (quadratic term) for quadratic fit

b1 (linear slope) for quadratic fit

b0 (intercept) for quadratic fit

1 1 1 1 1 1 1 1 1 1 1

0 0033 0066 0099 0132 0165 0198 0231 0264 0297 033

0 −02 −04 −06 −08 −10 −12 −14 −16 −18 −20

1 12 14 16 18 20 22 24 26 28 30

0 0 0 0 0 0 0 0 0 0 0

Sum-ofsquares of diffs

Linearity measure (ratio of sums of squares)

0 00334 00337 02100 03735 05836 08403 11438 14940 18908 23344

0 00026 00107 00242 00430 00673 00969 01319 01723 02180 02692

Corr. coeff.

1 09986 09946 09879 09789 09676 09543 09393 09229 09052 08866

Square of corr. coeff.

Linearity in Calibration: Act III Scene V

Table 67-1 Uniform data distribution

1 09972 09892 09761 09583 09363 09108 08824 08517 08195 07862

455

456

Chemometrics in Spectroscopy

Secondly, as a measure of nonlinearity, the calculation conforms more closely to that concept than the correlation coefficient does. As a contrast, we can consider terms such as precision and accuracy, where “high precision” and “high accuracy” mean data with small values of <whatever measure is used, such as standard deviation> while “low precision” and “low accuracy” mean large values of the measure. Thus, for those two characteristics, the measured value changes in opposition to the concept. If we were to use the correlation coefficient calculation as the measure of nonlinearity, we would have the same situation. However, by defining the “linearity” calculation the way did, the calculation now runs parallel to the concept: a calculated value of zero means “no nonlin earity” while increasing values of the calculation corresponds to increasing nonlinearity. Another interesting comparison is between the coefficients for the functions repre senting the best-fitting models for the data and the coefficients for the functions that result from performing the linearity test as described in the previous chapter [4]. We have not looked at these before since they are not directly involved in the linearity test. Now, however, we consider them for their pedagogic interest. These coefficients, for the case of testing a quadratic nonlinearity of the data from Figure 67-1, are listed in Table 67-2. We note that the coefficients for the quadratic terms are the same in both cases. However, the best fitting functions have a constant intercept and varying slopes, while the functions based on the orthogonalized quadratic term has a constant slope and varying intercept. We now take a look at the linearity values obtained when the X-data is Normally distributed. The nonlinearity used is the same as we used above for the case of uniformly distributed data, and the same diagram (Figure 67-2) applies, so we need not reproduce it. The difference is that the X-data is Normally distributed, so that there are more samples at X = 05 than at the extremes of the range of Figure 67-2, the falloff varying appropriately. The standard deviation of the X-values used was 0.2, so that the ends of the range corresponded to ±2.5 standard deviations. Again, synthetic data at the same 101 values of X were generated. In this case, however, multiple data at each X-value were created, the number of data at each X-value being proportional to the value of the Normal distribution corresponding to that X-value. The total number of data points generated, therefore, was 5,468. Table 67-2 Coefficients for orthogonalized functions k (in equation 67-1) 0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0

b2 (quadratic term)

b1 (linear slope)

b0 (intercept)

0 −02 −04 −06 −08 −10 −12 −14 −16 −18 −20

1 1 1 1 1 1 1 1 1 1 1

0 005 01 015 02 025 03 035 04 045 05

Linearity in Calibration: Act III Scene V

457

We can compare the values in Table 67-3 with those in Table 67-1: the coefficients of the models are almost the same. The coefficients for the quadratic model are, unsurpris ingly, identical in all cases, since the data values are identical and there is no random error. The main difference in the linear model is the value of the intercept, reflecting the higher average value of the Y -data resulting from the center of the curves being more heavily weighted. The sums-of-squares are of necessity larger, simply because there are more data points contributing to this sum. The interesting (and important) difference is in the values for the ratio of sumsof-squares, which is the nonlinearity measure. As we see, at small values of nonlinearity (i.e., k = 0 1 2) the values for the nonlinearity are almost the same. As k increases, however, the value of the nonlinearity measure decreases for the case of Normally distributed data, as compared to the uniformly distributed data, and the discrepancy between the two gets greater as k continues to increase. In retrospect, this should also not be surprising, since in the Normally distributed case, more data is near the center of the plot, and therefore in a region where the local nonlinearity is smaller than the nonlinearity over the full range. Therefore the Normally distributed data is less subject to the effects of the nonlinearity at the wings, since less of the data is there. As a quantification of the amount of nonlinearity, we see that when we compare the values of the nonlinearity measure between Tables 67-1 and 67-3, they differ. This indicates that the test is sensitive to the distribution of the data. Furthermore, the disparity increases as the amount of curvature increases. Thus this test, as it stands, is not completely satisfactory since the test value does not depend solely on the amount of nonlinearity, but also on the data distribution. In our next chapter we will consider a modification of the test that will address this issue.

Table 67-3 Normal data distribution k (in equation 67-1)

0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0

Linear coefficients

Quadratic coefficients

b1 (slope) for linear fit

b0 (inter cept) for linear fit

b2 (quadratic term) for quadratic fit

b1 (linear slope) for quadratic fit

b0 (inter cept) for quadratic fit

1 1 1 1 1 1 1 1 1 1 1

0 00414 00829 01243 01658 02072 02487 02901 03316 03730 04145

0 −02 −04 −06 −08 −10 −12 −14 −16 −18 −20

1 12 14 16 18 20 22 24 26 28 30

0 0 0 0 0 0 0 0 0 0 0

Sum-ofsquares of diffs

0 06000 23996 53991 95984 14997 21596 29395 38393 48592 59990

Linearity measure (ratio of sums of squares)

0 00025 00102 00230 00410 00641 00923 01257 01642 02078 02566

458

Chemometrics in Spectroscopy

REFERENCES 1. 2. 3. 4. 5. 6. 7.

Mark, H. and Workman, J., Spectroscopy 20(1), 56–59 (2005). Mark, H. and Workman, J., Spectroscopy 20(3), 34–39 (2005). Mark, H. and Workman, J., Spectroscopy 20(4), 38–39 (2005). Mark, H. and Workman, J., Spectroscopy 20(9), 26–35 (2005). Kroll, M.H. and Emancipator, K., Clinical Chemistry 39(3), 405–413 (1993). Workman, J. and Mark, H., Spectroscopy 3(3), 40–42 (1988). Mark, H. and Workman, J., Statistics in Spectroscopy, 1st ed. (Academic Press, New York, 1991). 8. Daniel, C. and Wood, F., Fitting Equations to Data – Computer Analysis of Multifactor Data for Scientists and Engineers, 1st ed. (John Wiley & Sons, New York, 1971).

68

Linearity in Calibration: Act III Scene VI – Quantifying Nonlinearity, Part II, and a News Flash

In Chapters 63 through 67 [1–5], we devised a test for the amount of nonlinearity present in a set of comparative data (e.g., as are created by any of the standard methods of calibration for spectroscopic analysis), and then discovered a flaw in the method. The concept of a measure of nonlinearity that is independent of the units that the X and Y data have is a good one. The flaw is that the nonlinearity measurement depends on the distribution of the data; uniformly distributed data will provide one value, Normally distributed data will provide a different value, randomly distributed (i.e., what is commonly found in “real” data sets) will give still a different value, and so forth, even if the underlying relationship between the pairs of values is the same in all cases. “Real” data, in fact, may not follow any particular describable distribution at all. Or the data may not be sufficient to determine what distribution it does follow, if any. But does that matter? At the point we have reached in our discussion, we have already determined that the data under investigation does indeed show a statistically significant amount of nonlinearity, and we have developed a way of characterizing that nonlinearity in terms of the coefficients of the linear and quadratic contributions to the functional form that describes the relationship between the X and Y values. Our task now is to come up with a way to quantifying the amount of nonlinearity the data exhibits, independent of the scale (i.e., units) of either variable, and even independent of the data itself. Our method of addressing this task is not unique, there are other ways to reach the goal. But we will base our solution on the methodology we have already developed. We do this by noting that the first condition is met by converting the nonlinear component of the data to a dimensionless number (i.e., a statistic), akin to but different than the correlation coefficient, as we showed in our previous chapter first published as [5]. The second condition can be met by simply ignoring the data itself, once we have reached this point. What we need is a standard way to express the data so that when the statistic in computed, the standard data expression will give rise to a given value of the statistic, regardless of the nature of the original data. For this purpose, then, it would suffice to replace the original data with a set of syn thetic data with the necessary properties. What are those properties? The key properties comprise the number of data values, the range of the data values and their distribution. The range of the synthetic data we want to generate should be such that the X-values have the same range as the original data. The reason for this is obvious: when we apply the empirically derived quadratic function (found from the regression) to the data, to compute the Y -values, those should fall on the same line, and in the same relationship to the X as the original data did. Choosing the distribution is a little more nebulous. However, a uniform distribution is not only easy to compute, but it also will neither go outside the specified range nor will

460

Chemometrics in Spectroscopy

the range change with the number of samples, as data following other distributions might (see, for example, reference [6], or Chapter 6 in [7], where we discussed the relationship between the range and the standard deviation for the Normal distribution when the number of data differ, although our discussion was in a different context). Therefore, in the interest of having the range and the nonlinearity measure be independent of the number of readings, we should generate data following a uniform distribution. The number of data points to generate in order to get an optimum value for the statistic is not obvious. Intuition indicates that the value of the statistic may very well be a function of the number of data points used in its calculation. At first glance, this would also seem to be a “showstopper” in the use of this statistic for the purpose of quantifying nonlinearity. However, intuition also indicates that even so, use of “sufficiently many” data points will give a stable value, since “sufficiently many” eventually becomes an approximation to “infinity”, and therefore even in such a case will at least tend toward an asymptotic value, as more and more data points are used. Since we have already extracted the necessary information from the actual data itself, computations from this point onward are simply a computer exercise, needing no further input from the original data set. Therefore, in fact, the number of points to generate is a consideration that itself needs to be investigated. We do so by generating data with controlled amounts of nonlinearity as we did previously [5] and filling in the range of the X-values with varying numbers of data points (uniformly spaced), computing the corresponding Y -values (according to the computed values for the coefficients of the quadratic equation) and then the statistic we described [5]. We performed this computation for several different combinations of number of data points generated and the value of k, using the nonlinearity term generator from equation 67-1 found in Chapter 67, and present the results in Table 68-1. Although not shown, similar computations were performed for 200,000 and 1,000,000 points. There was no further change in any of the entries, compared to the column corresponding to 100,000 points. As we can see, the value of the nonlinearity value converges to a limit for each value of k, as the number of points used to calculate it increases. Furthermore, it converges more

Table 68-1 Table of computed nonlinearity values for varying numbers of simulated samples k (from equation 67-1) 0 0.1 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0

N= 10

N= 100

N= 500

N= 1000

N= 2000

N= 5000

N= 10000

N= 100000

0 0.0045 0.0181 0.0408 0.0725 0.1133 0.1632 0.2221 0.2901 0.3671 0.4532

0 0.0036 0.0142 0.0320 0.0570 0.0890 0.1282 0.1745 0.2279 0.2284 0.3560

0 0.0035 0.0139 0.0313 0.0556 0.0869 0.1252 0.1704 0.2226 0.2817 0.3478

0 0.0035 0.0139 0.0312 0.0555 0.0867 0.1248 0.1699 0.2219 0.2809 0.3468

0 0.0035 0.0138 0.0312 0.0554 0.0866 0.1246 0.1697 0.2216 0.2805 0.3462

0 0.0035 0.0138 0.0311 0.0553 0.0865 0.1245 0.1695 0.2214 0.2802 0.3459

0 0.0035 0.0138 0.0311 0.0553 0.0865 0.1245 0.1695 0.2213 0.2801 0.3458

0 0.0035 0.0138 0.0311 0.0553 0.0865 0.1245 0.1695 0.2213 0.2800 0.3457

Linearity in Calibration: Act III Scene VI

461

slowly when the amount of nonlinearity in the data increases. The results in Table 68-1 are presented to four figures, and to require that degree of convergence means that fully 10,000 points must be generated if the value of k approaches two (or more). Of course, if k is much above two, it might require even more points to achieve this degree of exactness in the convergence. For k = 0.1, however, this same degree of convergence is achieved with only 500 points. Thus, the user must make a trade-off between the amount of computation performed and the exactness of the calculated nonlinearity measure, taking into account the actual amount of nonlinearity in the data. However, if sufficient points are used, the results are stable and depend only on the amount of nonlinearity in the original data set. Or need the user do anything of the sort? In fact, our computer exercise is just an advanced form of a procedure that we all learned to do in second-term calculus; evaluate a definite integral by successively better approximations, the improvement coming via exactly the route we took, using smaller and smaller intervals at which to perform the numerical integration. By computing the value of a definite integral, we are essentially taking the computation to the limit of an infinite number of data points. Generating the definite integral to evaluate is in fact a relatively simple exercise at this point, since the underlying functions are algebraic. We recall that the pertinent quantities are 1) The sum of squares of the differences between the linear and the quadratic lines fit to the data 2) The sum of squares of the Y -data linearly related to the X-data. As we recall from the previous chapter [5], the nonlinearity measure we devised equals the first divided by the second. Let us now develop the formula for this. We will use a subscripted small “a” for the coefficients of the quadratic equation, and a subscripted small “k” for those of the linear equation. Thus the equation describing the quadratic function fitted to the data is YQ = a0 + a1 X + a2 X 2

(68-4)

The equation describing the linear function fitted to the data is YL = k0 + k1 X

(68-13)

Where the ai and the ki are values obtained by the least-squares fitting of the quadratic and linear fitting functions, respectively. The differences, then, are represented by D = YQ − YL = a0 + a1 X + a2 X 2 − k0 + k1 X

(68-14)

D = a0 − k0 + a1 − k1 X + a2 X 2

(68-15)

and the squares of the differences are D2 = a0 − k0 + a1 − k1 X + a2 X 2 2

(68-16)

462

Chemometrics in Spectroscopy

which expands to D2 = a0 − k0 + 2a0 − k0 a1 − k1 X + a1 − k1 2 X 2 + 2a2 a0 − k0 X 2 + 2a2 a11 − k1 X 3 + a22 X 4

(68-17)

We can simplify it slightly to a regular polynomial in X: D2 = a0 − k0 + 2a0 − k0 a1 − k1 X + 2a2 a0 − k0 + a1 − k1 2 X 2 + 2a2 a11 − k1 X 3 + a22 X 4

(68-18)

The denominator term of the required ratio is the square of the linear Y term, according to equation 68-15. The square involved is then: YL2 = Y − Y 2

(68-19)

and substituting for each Y , the expression for X: YL2 = k0 + k1 X − k0 + k1 X2

(68-20)

With a little algebra this can also be put into the form of a regular polynomial in X: YL2 = k1 X − k1 X2 YL2 = k12 X 2 − 2k12 XX + k12 X

(68-21) 2

(68-22)

which, unsurprisingly, equals YL2 = k12 X − X2

(68-23)

although we will find equation 68-22 more convenient. Equations 68-18 and 68-22 represent the quantities whose sums form the required measurement. They are each a simple polynomial in X, whose definite integral is to be evaluated between Xlow and Xhigh , the ends of the range of the data values, in order to calculate the respective sums-of-squares. Despite the apparently complicated form of the coefficients of the various powers of X in equation 68-18, once they have been determined as described in our previous chapter, they are constants. Therefore the various coefficients of the powers of X are also constants, and may be replaced by a new label, we can use subscripted small “c” for these; then equation 68-18 becomes D2 = c0 + c1 X + c2 X 2 + c3 X 3 + c4 X 4

(68-24)

Put into this form it is clear that forming the definite integral of this function (to form the sum of squares) is relatively straightforward, we merely need to apply the formula for the integral of a power of a variable to each term in equation 68-24. We recall that from elementary calculus the integral of a power of a variable is �

X n dX =

X n+1 n+1

(68-25)

Linearity in Calibration: Act III Scene VI

463

Applying this formula to equation 68-24, we achieve SSD = c0

�

Xhigh

Xlow

+ c4 SSD = c0 X

�

1dX + c1

Xhigh

�

Xhigh Xlow

XdX + c2

�

Xhigh

Xlow

X 2 dX + c3

�

Xhigh

X 3 dX

Xlow

X 4 dX

(68-26)

Xlow

�Xhigh Xlow

X2 + c1 2

�Xhigh Xlow

X3 + c2 3

�Xhigh Xlow

X4 + c3 4

�Xhigh Xlow

X5 + c4 5

�Xhigh (68-27) Xlow

where the various ci represent the calculation based on the corresponding coefficients of the quadratic and linear fitting functions, as indicated in equation 68-18. The denominator term for the ratio is derived from equation 68-22 in similar fashion; the result is �Xhigh �Xhigh �Xhigh 3 2 2 X 2 2 2X SSY = k1 + 2k1 X + k1 X X (68-28) 3 2 Xlow

Xlow

Xlow

And the measure of nonlinearity is then the result of equation 68-27 divided by equa tion 68-28.

NEWS FLASH!! It will be helpful at this point to again review the background of why (non)linearity is important, in order to understand why we bring up the “News Flash”. In the context of multivariate spectroscopic calibration, for many years most of the attention was on the issues of noise effects (noise and error in both the X (spectral) and the Y (constituent values) variables. The only attention paid to the relation between them was the effect of the calibration algorithm used, and how it affected and responded to the noise content of the data. There is another key relationship between the X and the Y data, and that is the question of whether the relationship is linear, but that is not addressed. In fact, hardly anybody talks (or writes) about it even though it is probably the only remaining major effect that influences the behavior of calibration models. A thorough understanding of it would probably allow us to solve most of the remaining problems in creating calibrations. A nonlinear relation can potentially cause larger errors than any random phenomenon affecting a data set (see, for example, reference [8]). The question of linearity inevitably interacts with the distribution of constituent values in the samples (not only of the analyte but of the interferences as well – see the referenced Applied Spectroscopy article). I first got my attention turned onto this issue back when MLR was king of the NIR hill, and we could not understand how the wavelength selection process worked, and why it picked certain wavelengths that appeared to have no special character. The Y -error was the same for all sets of wavelengths. The X-error might vary somewhat from wavelength to wavelength, but the precision of the NIR instruments was so good that the maximum differences in random absorbance error simply could not account for the variations in the wavelengths chosen. Eventually the realization arose that the only explanation that

464

Chemometrics in Spectroscopy

was never investigated was that a wavelength selection algorithm would find those wavelengths where the fit (in terms of linearity of the absorbance versus constituent concentrations) of the data could change far more than any possible change in random error. Considerations of nonlinearity potentially explains lots of things: the inability to extrapolate models, the “unknown” differences between instruments that prevents calibration transfer, and so on. Recently we wrote some chapters that showed that PCR and PLS are also subject to the effects of nonlinearity and are not simply correctable (see Chapters 29–33 in this book, as well as references [9–14]). So there is a big effect here that hardly anybody is paying attention to – at least not insofar as they are quantitatively evaluating the effect on calibration models. I think this is key, because it is inevitably one of the major causes of error in the X variable (at least, as long as the X-variable represents instrument readings). Now here is the news flash: we recently became aware that Philip Brown has written a paper [15] nominally dealing with wavelength selection methods for calibration using the MLR algorithm (more about this paper later). We are old MLR advocates (since 1976, when we first got involved with NIR and MLR was the only calibration algorithm used in the NIR world then). But what has happened is that until fairly recently the role of nonlinearity in the selection of wavelengths for MLR as well as other effects on the modeling process have been mostly ignored (and only partly because MLR itself has been mostly ignored until fairly recently). For a long time, however, there was much confusion in the NIR world over the question of why computerized wavelength searches would often select wavelengths on the side of absorption bands instead of at the peaks (or in other unexpected places), and manual selection of wavelengths at absorption peaks would produce models that did not perform as well as when the wavelengths on the side of the peaks were used. This difference existed in calibration, validation, and in long-term usage. It also was (and still is, for that matter) independent of the methods of wavelength selection used. This behavior puzzled the NIR community for a long time, especially since it was well-known that a wavelength on the side of an absorbance band would be far more sensitive to small changes in the actual wavelength measured by an instrument (due to non-repeatability of the wavelength selection portion of the instrument) than a wavelength at or near the peak, and we expected that random error from that source should dominate the wavelength selection process. In hindsight, of course, we recognize that if a nonlinear effect exists in the data, it will implicitly affect the modeling process, regardless of whether the nonlinearity is recognized or not. There are other “mysteries” in NIR (and other applications of chemometrics) that nonlinearity can also explain. For example, as indicated above, one is the difficulty of transferring calibration models between instruments, even of the same type. Where would our technological world be if a manufacturer of, say, rulers could not reliably transfer the calibration of the unit of length from one ruler to the next? But here is what Philip Brown did: He took a different tack on the question. He set up and performed an experiment wherein he took different sugars (fructose, glucose, and sucrose) and made up solutions by dissolving them in water, each at five different concentration levels, and made solutions using all combinations of concentrations. That gave an experimental design with 125 samples. He then measured the spectra of all of those samples. Since the samples were all clear solutions there were no extraneous effects due to optical scatter. The nifty thing he then did was this: he then applied an ANOVA to the data, to deter mine which wavelengths were minimally affected by nonlinearity. We have discussed

Linearity in Calibration: Act III Scene VI

465

ANOVA in these chapters also, back when it was still called “Statistics in Spectroscopy” [16–19] although, to be sure, our discussions were at a fairly elementary level. The experiment that Philip Brown did is eminently suitable for that type of computation. The experiment was formally a three-factor multilevel full-factorial design. Any nonlinearity in the data will show up in the analysis as what Statisticians call an “interaction” term, which can even be tested for statistical significance. He then used the wavelengths of maximum linearity to perform calibrations for the various sugars. We will discuss the results below, since they are at the heart of what makes this paper important. This paper by Brown is very welcome – The four-component sugar solutions (water being one of the components, even though it is ignored in the analysis, which, by the way, may be a mistake. We will also discuss that further below). The use of this experimental design is a good way to analyze the various effects he investigates, but is unfortunately not applicable to the majority of sample types that are of interest in “real” applications, where neither experimental designs nor non-scattering samples are available or can be generated. In fact, it can be argued that the success of NIR as an analytical method is largely due to the fact that it can be applied to all those situations of interest where neither of those characteristics exist (in addition to the reasons usually given about it being non-destructive, etc.). Nevertheless, we must recognize that in trying to uncover new information about a technique, “walking before we run” is necessary and desirable, and this paper should be taken in that spirit. Especially since Brown does explicitly consider and directly attack the question of nonlinearity, which is a favorite topics of ours (in case you couldn’t tell), largely because it has mostly been previously ignored as a contributor to the error in calibration modeling, and because the effects occur in very subtle ways – which is largely what has hidden this phenomenon from our view. Not that questions of nonlinearity had been completely ignored in the past. Not only had we taken an interest as far back as 1988 [8], but others in the chemometric community have also, for example [20], who was able to successfully extrapolate a model despite nonlinearity in the data. The problem with these efforts is that they are idiosyncratic to the data set being analyzed. Whether a particular calibration can be extrapolated or not is beside the point. Missing is a general method to determine whether a model based on a given data set will or will not be extrapolatable. Brown’s paper demonstrates a novel approach to the problem, which shows promise for being the basis of that type of general method, and for that reason is new and exciting. Overall, Brown’s paper is a wonderful paper, despite the fact that there are some criticisms. The fact that it directly attacks the issue of nonlinearity in NIR is one reason to be so pleased to see it, but the other main reason is that it uses well-known and well-proven statistical methodology to do so. It is delightful to see classical Statistical tools used as the primary tool for analyzing this data set. Since we tend to be rather disagreeable sorts, let us start by disagreeing with a couple of statements Brown makes. First, while discussing the low percentage of variance in the 1900-nm region accounted for by the sugars, he states “ where there is most variability in the spectrum and might wrongly be favored region in some methods of analysis” (at the top of p. 257). We have to disagree with his decision that using the 1900-nm region is “wrong”. This is a value judgment and not supported by any evidence. To the contrary, he is erroneously treating the water component of the mixtures as though it had no absorbance, despite his recognition that water, and the 1900-nm region in particular, has the strongest absorbance of any component in his samples.

466

Chemometrics in Spectroscopy

Why say this? Because of the result of combining two facts: 1. The system is closed in that the total concentrations of all four components add to 100%, and also because the total variance due to all four components (and interactions, etc.) add to 100%. 2. The water not only has absorbance, it is the strongest absorber in the mixtures. If water had no absorbance, that is if it was the “perfect non-absorbing solvent” that we like to deal with, then Brown’s statement would be correct: it would not contribute to the variance and the three sugars would be the source of all variance. But in that case the total variance in the 1900-nm region would also be less than it actually is, so we cannot say a priori what would happen “if”. But we can say the following: since the absorbance of the water in that wavelength region is strong, we can consider the possibility that a measurement there will be a (inverse, to be sure) measure of “total sugar” or some equivalent. However, the way the experiment is set up precludes a determination of the presence of nonlinearity of the water absorbance in that region. If it were linear, then it should be determinable with the least error of all four components, since it has the strongest absorbance and therefore any fixed amount of random error would have the least relative effect. Then it would be a matter of determining which two sugars could be determined most accurately, and then the third by difference. This is essentially what he does for the linear effects he analyzes, so this would not be breaking any new ground, just using the components that are most accurately determined to compute all concentrations. But to get back to where this all came from, this is the reason we disagree with his statement that using the water absorbance is “wrong”. Now let us do a thought experiment, illustrated in Figure 68-1 (Figure 68-1a is copied from [5]): imagine a univariate spectroscopic calibration (with some random error superimposed) that follows what is essentially a straight line, as shown, over some range of values. Now raise the question: what prevents extrapolating that calibration? We believe it is nonlinearity. For the univariate case it is well-nigh self-evident. At least it is to us – see Figure 68-1b. As Figure 68-1b shows, if the underlying data is linear, (a)

(b)

Test results

Test results

Extrapolated data

0

End of original range

0 0 Analyte concentration

0

Analyte concentration

Figure 68-1 (a) Artificial data representing a linear relationship between the two variables. This data represents a linear, one-variable calibration. (b) The same artificial data extended in a linear manner. The extrapolated calibration line (broken line) can predict the data beyond the range of the original calibration set with equivalent accuracy, as long as the data itself is linear.

Linearity in Calibration: Act III Scene VI

467

there should be no problem extending the calibration line (the extension being shown as a broken line) and using the extended line to perform the analysis with the same accuracy as the original data was analyzed. Yet, to not be able to extrapolate a calibration is something “everyone knows”. What nobody knows, near as we can tell, is why we have to put up with that limitation. There are a couple of other, low-probability, answers that could be brought up, such as some sort of discontinuity in one or the other of the variables, but otherwise, any deviation of the data in the region of extrapolation would ipso facto indicate nonlinearity. Therefore, by far the most common cause of not being able to extrapolate that calibration is nonlinearity (almost by definition: a departure from the straight line is essentially the definition of nonlinearity). Engineers can point to various known physical phenomena of instruments to explain where nonlinearity in spectra can arise: stray light at the highabsorbance end and detector saturation effects at the low-absorbance end of the ranges, for example. Chemists can point to chemical interactions between the components as a source of nonlinearity at any part of the range. But mathematically, if you can make those effects go away, there is no reason left why you could not reliably extrapolate the calibration model. Now let us consider a two-wavelength model for one of the components in a solution containing two components in a nonabsorbing solvent (hypothetical case, NOT water in the NIR!). The effect of nonlinearity in the relationship of the two components to their absorbances will have different effects. If the component being calibrated for has a nonlinear relationship, that will show up in the plot of the predicted versus actual values, as a more-or-less obvious curvature in the plot, somewhat as we showed as Figure 68-1b in our Chapter 68 [1]. A nonlinear relationship in the “other” component, however, will not show up that way. Let us try to draw a word picture to describe what we are trying to say here (the way we draw, this is by far the easier way): since we could imagine this being plotted in three dimensions, the nonlinear relation will be in the depth dimension, and will be projected on the plane of the predicted-versus-actual plot of the component being calibrated for. In this projection, the nonlinearity will show up as an “extra” error superimposed on the data, and will be in addition to whatever random error exists in the “known” values of the composition. Unless the concentrations of the “other” component are known, there is no way to separate the effects of the nonlinearity from the random error, however. While we cannot actually draw this picture, graphical illustration of these effects have been previously published [8]. Again, however, if there is perfect linearity in the relationship of the absorbance at both wavelengths with respect to the concentrations of the components, one should equally well be able to extrapolate the model beyond the range of either or both components in the calibration set, just as in the univariate case. The problem is knowing where, and how much nonlinearity exists in the data. Here is where Brown has made a good start on determining this property in his paper back in 1993, at least for the limited case he is dealing with: a designed experiment with (optically) nonscattering samples. Now for Philip Brown’s main (by our reckoning) result: when he used the wavelengths of minimum nonlinearity to perform the calibration at, he found that he was indeed able to extrapolate the calibration. Repeat: under circumstances where the effects of data nonlinearity (from all sources) are minimized, he was able to extrapolate the calibration.

468

Chemometrics in Spectroscopy

In this paper he makes the statement, “One might argue that trying to predict values of composition outside the data used in calibration breaks the cardinal rule of not predicting outside the training data.” He seems almost surprised at being able to do that. But given our discussion above, he should not be. So in this case it is only surprising that he is able to extrapolate the predictions – we think that it is inevitable, since he has found a way to utilize only those wavelengths where nonlinearity is absent. Now what we need are ways to extend this approach to samples more nearly like “real” ones. And if we can come up with a way to determine the spectral regions where all components are linearly related to their absorbances, the issue of not being able to extrapolate a calibration should go away. Surely it is of scientific as well as practical and commercial interest to understand the reasons we cannot extrapolate calibration models. And then devise ways to circumvent those limitations. Chemometricians do not believe that good calibration diagnostics properly interpreted can estimate prediction performance, and insist on a separate validation data set. Statis ticians, on the other hand, do believe that. Certainly, it is good practice and statisticians also prefer to verify the estimates through the use of validation data when that is avail able, but in some cases they are not always available. In those cases, having generalized statistics available so that you can know when a model will be a good estimate of prediction performance is a major benefit. Statisticians have a long history of dealing with situations of limited data. In one sense we are “spoiled” by having our data being easy and cheap to acquire, so that asking for another 1,000 data points is usually no problem. But any experienced statistician has been in situations where each experiment, giving only one data point each, might cost upward of $10∧ 6. Estimating prediction performance from the calibration data becomes VERY important under those circumstances. Especially when, say, an “outlier” could mean a fatality. Under those circumstances you do not get a whole lot of volunteers for just testing the prediction performance of the model – you have got to know you can rely on it before you “predict”! The problem that statisticians have had regarding linearity is the same one that everybody else has had: they have not had a good statistic for determining linearity any more than anybody else, so they also have been limited to idiosyncratic empirical methods. But Philip Brown’s approach may just form the basis of one. Obviously, however, someone needs to do more research on that topic. I contacted Philip Brown and asked him about this topic. Unfortunately, linearity per se is not of interest to him; the emphasis of the paper he wrote was on role of linearity in the wavelength-selection process, not the nonlinearity itself. Furthermore, in the years since that paper appeared, his interests have changed and he is no longer pursuing spectroscopic applications. But to extend the work to understanding the role of nonlinearity in calibration, how to deal with it when an experimental design is not an option, and what to do when the optical scatter is the dominant phenomenon in the measurement of samples’ spectra are still very open questions.

REFERENCES 1. Mark, H. and Workman, J., Spectroscopy 20(1), 56–59 (2005). 2. Mark, H. and Workman, J., Spectroscopy 20(3), 34–39 (2005).

Linearity in Calibration: Act III Scene VI 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20.

469

Mark, H. and Workman, J., Spectroscopy 20(4), 38–39 (2005). Mark, H. and Workman, J., Spectroscopy 20(9), 26–35 (2005). Mark, H. and Workman, J., Spectroscopy 20(12), 96–100 (2005). Mark, H. and Workman, J., Spectroscopy 2(9), 37–43 (1987). Mark, H. and Workman, J., Statistics in Spectroscopy, 1st ed. (Academic Press, New York, 1991). Mark, H., Applied Spectroscopy 42(5), 832–844 (1988). Mark, H. and Workman, J., Spectroscopy 13(6), 19–21 (1998). Mark, H. and Workman, J., Spectroscopy 13(11), 18–21 (1998). Mark, H. and Workman, J., Spectroscopy 14(1), 16–17 (1998). Mark, H. and Workman, J., Spectroscopy 14(2), 16–27,80 (1999). Mark, H. and Workman, J., Spectroscopy 14(5), 12–14 (1999). Mark, H. and Workman, J., Spectroscopy 14(6), 12–14 (1999). Brown, P., Journal of Chemometrics 7, 255–265 (1993). Mark, H. and Workman, J., Spectroscopy 5(9), 47–50 (1990). Mark, H. and Workman, J., Spectroscopy 6(1), 13–16 (1991). Mark, H. and Workman, J., Spectroscopy 6(4), 52–56 (1991). Mark, H. and Workman, J., Spectroscopy 6(July/August), 40–44 (1991). Kramer, R., Chemometric Techniques for Quantitative Analysis, (Marcel Dekker; New York, 1998).

This page intentionally left blank

69

Connecting Chemometrics to

Statistics: Part 1 – The Chemometrics Side

We have been writing about statistics and chemometrics for a long time. Long-time readers of the column series published in Spectroscopy magazine will recall that the series name changed since its inception. The original name was “Statistics in Spec troscopy” (which was a multiple pun, since it referred to

472

Chemometrics in Spectroscopy

This definition is convenient because it allows us to then jump directly to what is arguably the simplest Chemometric technique in use, and consider that as the prototype for all chemometric methods; that technique is multiple regression analysis. Written out in matrix notation, multiple regression analysis takes the form of a relatively simple matrix equation: −1 B = AT C AT A

(69-1)

where B represents the vector of coefficients, A represents the matrix of independent variables and C represents the vector −1 of dependent variables. One part of that equation, AT A , appears so commonly in chemometric equations that it has been given a special name, it is called the pseudoinverse of the matrix A. The uninverted term AT A is itself fairly commonly found, as well. The pseudoinverse appears as a common component of chemometric equations because it confers the Least Squares property on the results of the computations; that is, for whatever is being modeled, the computations defined by equation 69-1 produce a set of coefficients that give the smallest possible sum of the squares of the errors, compared to any other possible linear model. HUH?? It does? How do we know that? Well, let us derive equation 69-1 and see. We start by assuming that the relationship between the independent variables and the dependent variable can be described by a linear relationship: C = �A

(69-2)

where �, as we have noted previously, represents the “true”, or Population values of the coefficients [1]. Equation 69-2 expresses what is often called the “Inverse Least Squares”, or P-matrix, approach to calibration. Since we do not know what the true values of the coefficients are, we have to calculate some approximation to them. We therefore express the calculation in terms of “statistics”, quantities that we can calculate from the data (see that same chapter for further discussion of these points): C = bA

(69-3)

How are we going to perform that calculation? Well to start with, we need something to base it on, and the consensus is that the calculation will be based on the errors, since in truth, equation 69-3 is not exactly correct because C will in general NOT be exactly equal bA. Therefore we extend equation 69-3: C = bA + error

(69-4)

Now that we have a correct equation, we want to solve this equation (or equation 69-3, which is essentially equivalent) for b. Now, if matrix A had the same number of rows and columns (a square matrix), we could form its inverse, and multiply both sides of equation 69-3 by A−1 : CA−1 = bAA−1

(69-5)

Connecting Chemometrics to Statistics: Part 1

473

and since multip