This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!
k
k=l
where Ν i s the number o f data vectors i n the t r a i n i n g s e t and i s the property o f the k data v e c t o r .
p^
SELECT (28) i s a feature s e l e c t i o n technique t h a t generates ortho gonal features based on t h e i r importance t o c l a s s i f i c a t i o n . The c r i t e r i o n f o r importance f o r c a t e g o r i z e d data i s the variance o r F i s h e r weight and f o r continuous-property data, the c o r r e l a t i o n to-property weight (see s e l e c t e d as the f i r s t f e a t u r e d e c o r r e l a t e d from the chosen f e a t u r e . The d e c o r r e l a t e d f e a t u r e s are reweighted and the feature whose new weight i s highest becomes the second s e l e c t e d f e a t u r e . The process continues u n t i l e i t h e r a s p e c i f i e d number of features i s chosen or a given minimum weight a t t a i n e d . The s e l e c t e d (unweighted) features are output to a f i l e f o r l a t e r use. The user can opt f o r the d e c o r r e l a t e d f e a t u r e s o r the same features i n t h e i r unchanged form. Since one s e t i s a l i n e a r combination o f the other s e t , the same information i s r e t a i n e d f o r e i t h e r o p t i o n . Only the r e p r e s e n t a t i o n i s changed ( i . e . the sub-feature space i s e i t h e r r o t a t e d or not r o t a t e d t o orthogon a l axes). GRAB. As a feature s e l e c t i o n method, GRAB (12) i s intermediate between weight (with no feature d e c o r r e l a t i o n ) and the more expen s i v e SELECT (with t o t a l d e c o r r e l a t i o n ) . A previously-weighted f i l e o f η data vectors i s input t o the r o u t i n e . Each feature i s as signed an i n i t i a l weight 1/2 W(l)
=
i
x
x
Σ < i,k- i> k=l
:
The feature with the l a r g e s t weight i s s e l e c t e d as the f i r s t new f e a t u r e . Each o f the remaining features i s reweighted such t h a t i f C ^ j i s the c o r r e l a t i o n between the i f e a t u r e j u s t chosen and the remaining feature j , t
h
i t e r a t i o n the weight o f the j
t
f
W(2). For the m
th
=
wiD.ll-lc^jl] h
feature remaining i s
m-1 W(m)
i
= W ( l ) i Π [1-ICi
J]
In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
2.
ARTHUR
HARPER ET AL.
45
and Experimental Data Analysis
the r e s u l t i n g s e l e c t e d f e a t u r e s are autoscaled but n e i t h e r weighted nor d e c o r r e l a t e d , SCALE o f f e r s s e v e r a l methods o f s c a l i n g f e a t u r e s (26) i n the data. The s c a l i n g f a c t o r s are d e r i v e d from the η data v e c t o r s o f the t r a i n i n g s e t and a p p l i e d t o a l l o f the data. For data f o r which the u n c e r t a i n t i e s are known, s e v e r a l s c a l i n g schemes are a v a i l a b l e . The f o l l o w i n g conventions are a v a i l a b l e f o r s c a l i n g without uncer t a i n t y weighting:
j
t
h
Autoscaling: I f X.. data v e c t o r , then
i s the i
t
h
f e a t u r e a s s o c i a t e d with
the
l r 3
<Xi,j-5i> x
~n
i , j
1/2
where η i s the t o t a l number o f data v e c t o r s i n the t r a i n i n g data. The r e s u l t i n g new f e a t u r e s a l l have a mean of 0.0 and a v a r i a n c e o f 1.0. T h i s removes any i n a d v e r t e n t weighting t h a t might occur due to the d i f f e r e n c e i n magnitude o f the f e a t u r e s . Range s c a l i n g : I f Xmini and Xmax^ and the minimum and maxi mum values r e s p e c t i v e l y o f f e a t u r e i i n the t r a i n i n g data then
x
. ,
k
(Xirfc-anlnj) (Xmaxi-Xminj[)
=
=
s c a l e s each f e a t u r e t o a range o f 1 l y i n g between 0.0 Mean s u b t r a c t i o n : x
Variance
i'k
=
(
x
x
i k" i
and
1.0.
)
f
normalization: x
'i,k
=
X
i
'
l χ 3-1
k
—
2
. 1
0
Mean n o r m a l i z a t i o n : x x
i,k
i,k
S c a l i n g methods which weight the measurement by i t s u n c e r t a i n t y (3) include: (a) Error-weighted a u t o s c a l e , (b) Error-weighted mean s u b t r a c t i o n and (c) Error-weighted mean n o r m a l i z a t i o n . In each o f these methods the mean i n equation 1, 3, and 5 i s r e p l a c e d by a weighted mean, x^, which i s c a l c u l a t e d as:
In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
CHEMOMETRICS: THEORY AND APPLICATION
46 η
r*i.j|
L η
y L
1
ι · ~τ~ • u
i , j
where the sum i s over the data v e c t o r s and u^ j i s the u n c e r t a i n t y associated with feature ·. ' Linear Discriminant Analysis The l i n e a r c l a s s i f i c a t i o n s e c t i o n o f ARTHUR c o n t a i n s r o u t i n e s f o r both category and continuous p r o p e r t y data. For category data, m u l t i - l i n e a r r e g r e s s i o n and hyperplane d i s c r i m i n a n t a n a l y s i s are available. LEAST performs a l e s t - s q u a r e i s b e s t s u i t e d t o continuous p r o p e r t y problems. I f D i s a data m a t r i x w i t h a s s o c i a t e d p r o p e r t y matrix P, then (D D)"*D P i s t h l e a s t squares s o l u t i o n t o the s e t o f l i n e a r equations P=DW where W i s a v e c t o r which weights the u t i l i t y o f the f e a t u r e s i n f i t t i n g the data. In a c t u a l p r a c t i c e , determination o f the weight v e c t o r i s done by W » [Ε^^ΕίχΤρ T
T
1
e
where X i s o b t a i n e d by mean n o r m a l i z a t i o n o f D, c T i s the i n v e r t ed c o r r e l a t i o n matrix a s s o c i a t e d with D and Ε i s a d i a g o n a l matrix whose elements are the r e c i p r o c a l v a r i a n c e s o f the f e a t u r e s . P r e d i c t i o n o f an unknown p r o p e r t y Ρ· i s based on the weight vector obtained i s therefore
p i = x»w LEDISC i s a m u l t i - l i n e a r l e a s t squares r e g r e s s i o n designed f o r c a t e g o r i z e d data. Except i n p r o p e r t y d e f i n i t i o n s i t i s computa t i o n a l l y e q u i v a l e n t t o LEAST. For a data s e t o f η c a t e g o r i e s , η l i n e a r r e g r e s s i o n s are performed such t h a t f o r the i regression the p r o p e r t y Ρ i s d e f i n e d as ρ
_
( +1 f o r a l l v e c t o r s i n category i t 0 f o r a l l v e c t o r s not i n category i
An unknown data v e c t o r i s p l a c e d i n t o t h a t c l a s s whose weight v e c t o r produces the l a r g e s t v a l u e . LESLT i s a v a r i a b l e r e d u c t i o n technique which seeks t o o p t i m i z e category p a i r s e p a r a t i o n i n as few v a r i a b l e s as p o s s i b l e (30). A
In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
HARPER ET AL.
2.
ARTHUR
and Experimental Data Analysis
47
feature d e r i v e d i s a l i n e a r combination o f the o r i g i n a l data that d e s c r i b e the p o s i t i o n o f a data vector r e l a t i v e t o a hyperplane between two c a t e g o r i e s i n the data s e t . The input data matrix (X) o f η c a t e g o r i e s i s d i v i d e d i n t o n ( n - l ) / 2 submatrices. If Y i s the submatrix c o n t a i n i n g only those p a t t e r n s i n c a t e g o r i e s i and j p l u s the t e s t data, an outcome column matrix o f p r o p e r t i e s can be d e f i n e d such t h a t i,j Γ-1 f o r patterns i n i {+1 f o r p a t t e r n s i n j G
Thus defined, there e x i s t s a vector W o f weights such t h a t YW = G i ' J . (Determination o f W i s the l e a s t squares s o l u t i o n f o r t h i s equation (see LEAST).) The weight vector obtained i s used t o transform and c l a s s i f y a l l the data v e c t o r s i n Y. T h i s process i s followed f o r a l l category p a i r s . Once a l l the weight v e c t o r s are obtained, the e n t i r e data matrix (X) i s transformed such t h a t X' = XW. The new matri are approximate c a t e g o r y - p a i k
k
k
LEPIECE (12) does a piece-wise l e a s t squares m u l t i p l e r e g r e s s i o n f o r each data vector i n the t r a i n i n g and t e s t s e t . The property of each data vector i s p r e d i c t e d from the f i t (see LEAST) u s i n g the k-nearest-neighbors (see KNN) t o the v e c t o r s . The value o f k i s a user-defined m u l t i p l e o f the number o f f e a t u r e s . The c r i t e r i o n used f o r "nearest" i s the i n t e r p a t t e r n d i s t a n c e (see DIS TANCE) . Only those features used i n the determination o f the d i s t a n c e are used i n the r e g r e s s i o n . MULTI i s a hyperplane d i s c r i m i n a n t f u n c t i o n method designed f o r multi-category data (31). Computationally, i t i s e q u i v a l e n t t o PLANE, except i n category d e f i n i t i o n . For a data matrix o f η c a t e g o r i e s , η hyperplanes are generated such t h a t the i hyper plane d e s c r i b e s the separation o f the i category from the r e s t o f the data. t
h
PLANE generates and c l a s s i f i e s on the b a s i s o f a l i n e a r d i s c r i m i n a n t f u n c t i o n (31) and i s best s u i t e d t o data c o n t a i n i n g two c a t e g o r i e s (see MULTI f o r multicategory case). By an e r r o r - c o r r e c t i o n feed back method i t seeks a hyperplane i n an augmented n+1 space (where η i s the number o f features) t h a t best separates a p a i r o f c a t e gories. Each data vector i n η space i s considered a v e c t o r i n n+1 space where the n+1** feature i s u n i t y . Therefore, two c l a s s e s can be d e f i n e d as l y i n g on e i t h e r s i d e o f a hyperplane (whose equa t i o n i n n+1 space i s W»Y=0), through the o r i g i n with correspond ing c l a s s numbers +1 and -1. The d i s c r i m i n a n t f u n c t i o n i s c a l c u l a t e d by f i r s t l o a d i n g a weight vector W with random o r user-de f i n e d values. During t r a i n i n g , c l a s s i f i c a t i o n o f v e c t o r Y^ by t h i s weight vector i s a d e c i s i o n o f the form 1
American Chemical Society Library 1155 16th St. N. w . Washington, D. C 20036
In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
48
CHEMOMETRICS:
W-Y
k
= S
k
=
THEORY AND APPLICATION
c o r r e c t , i f the s i g n o f the response r e l a t i v e to the hyperplane i s the same as the sign of i t s class i n c o r r e c t , i f the s i g n i s not the same
I f a p a t t e r n i s m i s c l a s s i f i e d , the weight v e c t o r i s adjusted by r e f l e c t i o n o f the hyperplane about the m i s c l a s s i f i e d p o i n t . The new weight v e c t o r i s then used to c l a s s i f y the data. The process continues u n t i l a l l p a t t e r n s i n the t r a i n i n g s e t are c o r r e c t l y c l a s s i f i e d o r a maximum number o f i t e r a t i o n s i s reached. For more than two c a t e g o r i e s , a hyperplane s e p a r a t i n g each p a i r o f c a t e g o r i e s i s found. An unknown data v e c t o r i s then c l a s s i f i e d using a m a j o r i t y committee vote procedure on a l l the d i s c r i m i n a n t f u n c t i o n responses. The use o f PLANE f o r m u l t i - c a t e g o r y data i s e q u i v a l e n t t o a piece-wise l e a r n i n g machine (31). REGRESS i s a multidimensiona computes a l i n e a r d i s c r i m i n a n t f u n c t i o n . I t accepts both category and continuous data. Two o p t i m i z a t i o n methods are a v a i l a b l e . E i t h e r the r e s i d u a l variance o r the m u l t i p l e c o r r e l a t i o n can be minimized. STEP (32) i s a stepwise m u l t i - l i n e a r r e g r e s s i o n method. Features used i n the r e g r e s s i o n are determined by t h e i r c o n t r i b u t i o n to the o v e r a l l variance. In the r e g r e s s i o n , f e a t u r e s are added one a t a time such t h a t the f e a t u r e t h a t i s added makes the g r e a t e s t improvement i n the "goodness o f f i t . " When a f e a t u r e t h a t i s i n d i c a t e d t o be s i g n i f i c a n t t o the r e d u c t i o n i n v a r i a n c e i n an e a r l y stage o f the r e g r e s s i o n i s i n d i c a t e d to be i n s i g n i f i c a n t a f t e r the a d d i t i o n o f s e v e r a l other f e a t u r e s , i t i s e l i m i n a t e d from the r e g r e s s i o n before a d d i t i o n o f another f e a t u r e . The c r i t e r i o n f o r s e l e c t i o n o f a f e a t u r e to add or remove from the c a l c u l a t i o n i s as follows: Removal: I f the v a r i a n c e c o n t r i b u t i o n i s i n s i g n i f i c a n t a t a s p e c i f i e d F - l e v e l , the f e a t u r e i s removed from the r e g r e s s i o n . Addition: I f the variance r e d u c t i o n due t o a d d i t i o n o f a f e a t u r e i s s i g n i f i c a n t a t a s p e c i f i e d F - l e v e l , t h i s feature i s entered i n t o the r e g r e s s i o n .
In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
2. HARPER ET AL.
ARTHUR
and Experimental Data Analysis
49
Utilities The f o l l o w i n g i s a s e r i e s o f r o u t i n e s t h a t permit the user t o e a s i l y c o n t r o l p r o c e s s i n g o f the data. A major source o f ARTHUR'S v e r s a t i l i t y i s p r o v i d e d through the "CHANGE" (CHXXX) and the "TUNE" (TUXXX) r o u t i n e s . These r o u t i n e s a l l o w t h e user t o change the v a r i o u s d e f i n i t i o n s o f a problem without changing the form o f the i n p u t data. The Calcomp and T e k t r o n i c s p l o t r o u t i n e s are designed to run on the CDC. The p r i n t e r / p l o t t e r r o u t i n e VARVAR i s machine independent. Any data f i l e can be l i s t e d o r punched a t any p o i n t i n a run by a c a l l t o UTILIT. NEW i n i t i a l i z e s the program. System f i l e s are rewound and a l l except one a r e i n i t i a l i z e d by w r i t i n g a one r e c o r d header a t the beginning o f the f i l e . S e v e r a l f i x e d common parameters can be redefined with a c a l l to t h i s routine ENDIT terminates the program occur by a user c a l l o r by the encounter o f a r e c o g n i z a b l e e r r o r condition during a run. UTILIT p r o v i d e s a l i n e p r i n t e r l i s t i n g o f the data matrix and/or the d i s t a n c e matrix. INPUT. In INPUT, a coded data matrix may be i n p u t t o the program. I f the data are c a t e g o r i z e d , i t i s reordered such t h a t a l l data v e c t o r s belonging t o the same c l a s s occur together. F o r continuous data, the user may o p t f o r r e o r d e r i n g by t h e magnitude o f the prope r t y . M i s s i n g values i n the data are f l a g g e d w i t h a value equal to t h e l a r g e s t r e a l number allowed i n the program. The data matrix i s output t o a b i n a r y f i l e t h a t i s compatible w i t h a l l other rout i n e s i n ARTHUR. INFILL f i l l s i n any m i s s i n g data i n the data matrix. F o r category data, a m i s s i n g f e a t u r e value i n a data v e c t o r o f the t r a i n i n g s e t i s f i l l e d w i t h t h e mean value o f the f e a t u r e f o r the category t o which t h e v e c t o r i s a member. A m i s s i n g f e a t u r e value i n the t e s t set i s f i l l e d w i t h the mean value o f the f e a t u r e f o r a l l the t r a i n ing data. For continuous p r o p e r t y data the f e a t u r e value i s f i l l e d with t h e mean o f the data. INDUMP d e l e t e s constant and redundant f e a t u r e s i n the data. F o r category data, the occurrence o f a constant f e a t u r e i n a category r e s u l t s i n an automatic c a l l t o terminate the program s i n c e many methods employ v a r i a n c e i n the data f e a t u r e s as a c r i t e r i o n . VARVAR produces l i n e p r i n t e r p l o t s o f a data matrix. Two o p t i o n s are a v a i l a b l e . E i t h e r two f e a t u r e s may be p l o t t e d a g a i n s t each other o r one f e a t u r e may be p l o t t e d a g a i n s t the p r o p e r t i e s o f the data v e c t o r s .
In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
50
CHEMOMETRICS: THEORY AND APPLICATION
VACALC produces calcomp p l o t s e q u i v a l e n t t o the l i n e p r i n t e r p l o t s a v a i l a b l e i n VARVAR. VATKIT produces p l o t s f o r a T e k t r o n i c s graphics t e r m i n a l . p l o t s a r e e q u i v a l e n t t o those produced i n VARVAR.
These
CHANGE allows t h e user t o q u i c k l y and c o n v e n i e n t l y change a data matrix d e f i n i t i o n from continuous p r o p e r t y t o category and v i c e versa. CHCATEGORY. In CHCATEGORY the user can r e d e f i n e the category o f s e l e c t e d data v e c t o r s , r e o r d e r the data, o r c r e a t e a new data mat r i x w i t h a s e l e c t e d number o f c a t e g o r i e s r e t a i n e d i n the t r a i n i n g s e t and a l l o t h e r s p l a c e d i n the t e s t s e t . CHDATA allows manipulation o f the data v e c t o r s i n a data matrix. S p e c i f i e d v e c t o r s may b and v e c t o r s may be d e l e t e CHFEATURE p r o v i d e s f e a t u r e manipulation. Features may be d e l e t e d from the data, transformed (by a d d i t i o n , s u b t r a c t i o n , m u l t i p l i c a t i o n , d i v i s i o n , e x p o n e n t i a t i o n , and l o g a r i t h m i c s u b s t i t u t i o n ) and/ o r combined by any o f these o p e r a t i o n s t o form new f e a t u r e s . CHJOIN combines t h e matrices o f two data f i l e s . T h i s can occur i n two modes. E i t h e r t h e data v e c t o r s o f the f i l e s a r e connected or the f e a t u r e s o f the two f i l e s a r e combined. I n e i t h e r case, a new data f i l e i s c r e a t e d from the merging. CHSPLIT. User d e f i n e d c a t e g o r i e s may be s p l i t o f f onto an a l t e r n a t i v e data f i l e i n CHSPLIT. CHSUB c r e a t e s a new data f i l e by randomly s e l e c t i n g a subset o f the data v e c t o r s i n t h e data matrix. By d e f a u l t a l l c a t e g o r i e s i n t h e data r e t a i n 80% o f t h e i r data v e c t o r s . T h i s percent can be r e d e f i n e d by t h e user. CHUNCE. Feature u n c e r t a i n t i e s can be added, changed, o r d e l e t e d from t h e data f i l e i n CHUNCE. The u n c e r t a i n t y may be added as a r e l a t i v e o r absolute e r r o r . F o r a g i v e n f e a t u r e ( i ) o f a g i v e n v e c t o r (j) t h e u n c e r t a i n t y u ^ j i s d e f i n e d as f
u
i # j
= (xabs)
i f j
+
(*rel>i,j —
** i , j
where ( x a b s ) j ^ j i s t h e absolute e r r o r and ( x r e l ) i j i s the r e l a t i v e e r r o r i n percent o f the f e a t u r e x i , j . By d e f a u l t , ( x a b s ) i , j = 0, and ( x r e l ) i j = 10.0%. f
f
In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
2.
ARTHUR
HARPER ET AL.
and Experimental Data Analysis
51
TUNE generates a new data f i l e with f e a t u r e s formed from the n ( n - l ) r a t i o s o f the o r i g i n a l measurements. T h i s simple transform n o t only o f f e r s more f e a t u r e s t o the v a r i o u s s e l e c t i o n algorithms, but a l s o , i n many cases, lend s t a b i l i t y t o the f e a t u r e s . For ex ample, i n cases where i n e r t methods f o r sample d i l u t i o n have taken p l a c e before measurement, the r a t i o s o f c o n c e n t r a t i o n s o f elements i n a group o f samples more r e a d i l y l e n d themselves t o e s t a b l i s h ing a common o r i g i n than the i n d i v i d u a l c o n c e n t r a t i o n s (18). TURAND p e r t u r b s each measurement o f each v e c t o r by a f u n c t i o n o f the e r r o r a s s o c i a t e d with the f e a t u r e . I f X j ^ j i s the i * * f e a t u r e of the j t h data v e c t o r , then the e r r o r perturbed feature x ' i , j i s : 1
x
'i/j
=
x
+
i , j
a
s
i/j i,j
where s^ i s the standard d e v i a t i o n o f the d i s t r i b u t i o n o f and <*i j i s a random numbe variance and zero mean (33) r
TUMED normalizes a l l v e c t o r s i n a data matrix by t h e i r E u c l i d e a n d i s t a n c e s from a d e f i n e d o r i g i n . T h i s transform i s o f t e n u s e f u l i n data that e x h i b i t a "time" dependence. TUTRAN takes the transpose o f a data matrix. The new data f i l e that i s formed i s output t o a d i s k f i l e f o r subsequent a n a l y s i s i n other r o u t i n e s . Acknowledgment We would l i k e t o thank Maynarhs Da Koven f o r h i s h e l p i n c r e a t i n g ARTHUR. S p e c i a l thanks are given t o Robert W. Gerlach f o r help and c r i t i c i s m s on t h i s manuscript. Literature Cited
1. Duewer, D. L., Harper, Α. Μ., Koskinen, J. R., Fasching, J. L. and Kowalski, B. R., ARTHUR, Version 3-7-77, 2. Koskinen, J. R. and Kowalski, B.R., Journal of Chemical In formation and Computer Science (1975), 15, 119. 3. Fasching, J. L., Duewer, D. L. and Kowalski, B. R., submitted to Analytical Chemistry. 4. Nie, Ν. H. et al, "Statistical Package for the Social Sciences, 2nd Edition" McGraw Hill, Inc., New York, 1975. 5. Dixon, W. J., Ed., "BMD, Biomedical Computer Programs," Univ ersity of California Press, Berkeley, 1971. 6. Kowalski, B. R. and Reilley, C.A., Analytical Chemistry, (1971), 42, 1387. 7. Kowalski, B. R. and Bender, C. F., Analytical Chemistry, (1973), 45, 2334. 8. Miller, R. G., Biometrika, (1974) 61, 1. 9. Duewer, D. L., et al., Analytical Chemistry (1976) 48, 2002.
In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
52
CHEMOMETRICS: THEORY AND APPLICATION
10. Kowalski, B. R. et al., Analytical Chemistry (1972) 44, 2176. 11. Stevenson, D. F. et al., Archaeometry (1971) 13, 17. 12. Duewer, D. L. et al., "Documentation for ARTHUR, Version 1-8-75," (1975) Chemometrics Society Report No. 2. 13. Reinsch, C. Η., Numerische Mathematik (1967) 10, 177. 14. Birnbaum, Z. W., JASA (1952) 47, 425. 15. Davies,O.L. and Goldsmith, P. L., "Statistical Methods for Research and Production," p. 234, Hafner, New York, 1972. 16. Anderson, T. W., "An Introduction to Multivariate Statistical Analysis," p. 65, John Wiley & Sons, New York, 1958. 17. Mahalanobis, P. C., Proceedings of the National Institute of Science of India, p. 49, 122, 1936. 18. Anders, O. U., Analytical Chemistry (1972) 44, 1930. 19. Any numerical analysis text. 20. Horst, P., "Factor Analysis of Data Matrices," Holt, Rinehart and Winston, Inc., New York, 1965. 21. Wold, S., Journal f Patter Recognitio (1976) 8 127 22. Kowalski, B. R. in Research, Vol. 2," Academic Press, New York, 1974. 23. Andrews, H. C., "Introduction to Mathematical Techniques in Pattern Recognition," Wiley Interscience, New York, 1972. 24. Kowalski, B. R. and Bender, C. F., Journal of the American Chemical Society (1973) 95, 686. 25. Cover, T. M. and Hart, P. E., IEEE Transactions on Informa tion Theory, IT-13, 21 (1967). 26. Kowalski, B. R. and Bender, C. F., Journal of the American Chemical Society (1972) 94, 5632. 27. Fisher, R. Α., Annals of Eugenics (1936) 179. 28. Kowalski, B. R. and Bender, C. F., Journal of Pattern Recog nition (1976) 8, 1. 29. Kowalski, B. R., Snalytical Chemistry, (1969)41,695. 30. Kowalski, B. R. and Bender, C. F., Analytical Chemistry (1973) 45, 590. 31. Nilsson, N. J., "Learning Machines," McGraw-Hill, New York 1965. 32. Ralston, Α., and Wolf, H. S., "Mathematical Methods for Digi tal Computers," p. 191, John Wiley and Sons, New York, 1966. 33. Hamming, R. W., "Introduction to Applied Numerical Analysis," McGraw-Hill, New York, 1971.
In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
3
Abstract Factor Analysis—A Theory of Error and Its Application to Analytical Chemistry E D M U N D R. MALINOWSKI Department of Chemistry and Chemical Engineering, Stevens Institute of Technology, Hoboken, NJ 07030
Factor analysis (FA) dimensional problems, i of chemistry (1-5). The first step in FA is concerned with de termining the number of factors which are responsible for the data. Unfortunately, because of experimental uncertainty this is not an easy task. Most FA methods depend upon an accurate esti mate of the uncertainties present in the data. Even then, each FA method may yield a different estimate of the size of the factor space. This poses a fundamental dilemma to the factor analyst in the early stages of his analysis. In the present paper we will develop a theory of error which helps us overcome this difficulty. Our approach will be quite different from the traditional statistical methods reported by others (3,6). Our attention will be focused on how the error mixes into the abstract factor analysis (AFA) scheme. AFA is that part of the overall FA process which is concerned with data reproduction using the abstract mathematical factors produced by the decomposition of the covariance matrix. The theory shows that the eigenvalues can be grouped into two sets: a primary set which contains the true factors together with a mixture of error and a secondary s e t which i s composed o f pure e r r o r . By d e l e t i n g the secondary s e t we a c t u a l l y remove e r r o r . Consequently the AFA reproduced data i s a b e t t e r r e p r e s e n t a t i o n o f the r e a l data than t h e raw experimental data used i n the a n a l y s i s . Although t h i s i s not the prime purpose o f AFA, i t does c o n s t i t u t e an unexpected and u s e f u l f r i n g e b e n e f i t . The theory shows t h a t three types o f e r r o r ( r e a l e r r o r , RE; e x t r a c t e d e r r o r , XE; and imbedded e r r o r , IE) e x i s t . These e r r o r s are mutually r e l a t e d i n a pythagorean sense and can be c a l c u l a t e d from the secondary eigenvalues, the s i z e o f the data matrix and the s i z e o f t h e f a c t o r space. Arguments a r e presented t o show how the IE f u n c t i o n can be used t o determine not o n l y t h e dimension o f the f a c t o r space b u t a l s o the r e a l e r r o r . Most importantly, t h i s i s accomplished without recourse t o a knowledge o f the experimental e r r o r . 53
In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
54
CHEMOMETRICS: THEORY AND APPLICATION
An e m p i r i c a l f u n c t i o n , c a l l e d the f a c t o r i n d i c a t o r f u n c t i o n , IND, i s a l s o d e s c r i b e d . T h i s f u n c t i o n , s i m i l a r t o the IE func t i o n , i s c a l c u l a t e d from the secondary e i g e n v a l u e s . However, i t i s more s e n s i t i v e than the IE f u n c t i o n i n determining the s i z e of the f a c t o r space. Mathematical models are used t o i l l u s t r a t e and c o n f i r m the theory. The method i s then a p p l i e d t o n u c l e a r magnetic resonance, absorption spectrophotometry, mass s p e c t r a , g a s - l i q u i d chromato graphy and drug a c t i v i t y . Theory of E r r o r F a c t o r a n a l y s i s i s concerned w i t h a matrix I t i s based upon expressing each raw data p o i n t sum o f product terms. I f the data contained no e r r o r , the value o f a pure data p o i n t d*^ would follows : j = n d
t
i *
=
* r
* 3*1
k
* C
ij
h
o f data p o i n t s . d ^ as a l i n e a r experimental be expressed as
t
jk
( 1 )
h
where r j j i s the j c o f a c t o r o f the i row designee and Cjfc i s the j c o f a c t o r o f the k** column designee. The sum i s taken over a l l η f a c t o r s which are r e s p o n s i b l e f o r the data. However, because o f experimental e r r o r each raw data p o i n t i s b e s t represented as a sum o f pure data and an e r r o r e ^ , t J l
1
d
ik
-
d
ik
+
e
ik
<
2)
Such e r r o r s mix i n t o the FA process, p e r t u r b i n g the c o f a c t o r s and producing an e x c e s s i v e number o f f a c t o r s . Instead o f eq. (1) we o b t a i n eq. (3)
where the sum i s taken over c f a c t o r s r a t h e r than n; c, o f course, i s g r e a t e r than n. S u p e r s c r i p t φ i s used here t o d i s t i n g u i s h these c o f a c t o r s from those belonging t o the pure data. From our knowledge o f matrix a l g e b r a we can r e a d i l y show t h a t the number o f terms i n the above sum i s e i t h e r r , the num ber o f rows i n the raw data matrix, o r c, the number o f columns i n the raw data matrix, whichever i s the s m a l l e r number. Through out our d i s c u s s i o n we w i l l assume t h a t c i s s m a l l e r than r . Hence the sum i n eq. (3) i n v o l v e s c terms. The proposed theory o f e r r o r i s based upon the f o l l o w i n g o b s e r v a t i o n . Although c eigenvectors are r e q u i r e d t o span the
In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
3.
MALiNOWSKi
55
Abstract Factor Analysis
t o t a l raw data space, only η o f them are r e q u i r e d t o span the data space w i t h i n experimental e r r o r . Because any orthogonal s e t o f axes, o f the proper number, can be used t o d e s c r i b e the e r r o r space the same b a s i s axes used t o d e s c r i b e the raw data space can be used t o d e s c r i b e the e r r o r space. A c c o r d i n g l y , then, the error e a s s o c i a t e d with data p o i n t d can be expressed as follows: i
k
i
e
ik
f
Σ
+
t
k
0
Σ
σ . . c*.
(4)
h
Here c ^ j ^ i s a component o f the j b a s i s a x i s used t o d e s c r i b e the raw data, CT^j i s p r o j e c t i o n o f the i e r r o r onto the j a x i s and i s the p r o j e c t i o n o f the i e r r o r onto the j a x i s o f the pure e r r o r space. N o t i c e here t h a t we have grouped the e r r o r f a c t o r s i n t o tw j = 1 t o j = n, i s a s s o c i a t e ( c a l l e d primary eigenvectors) which are r e q u i r e d t o account f o r the r e a l data. The second sum, from j=n+l t o j=c, i s a s s o c i ated with unnecessary eigenvectors ( c a l l e d secondary eigenvalues) which are produced by experimental e r r o r . Equation (4) represents the foundation o f the proposed theory o f e r r o r . We w i l l now explore i t s s i g n i f i c a n c e and confirm i t s v a l i d i t y by t r a c i n g i t s path through the f a c t o r a n a l y s i s scheme. Upon p l a c i n g eq. (4) and (1) i n t o (2) we f i n d t
t
h
t
h
t
h
h
(5)
d
ik
-
] \ 3=1
r
c
< *ij * j k
Comparing t h i s r e s u l t with eq.
+
^ i j
+ V σ° 3=n+l
^jk>
cf
(3) we see t h a t c*
^ i j
=
+
* \ j c
In other words eqs.
*Li
<> 6
jk
(3) and (5) can be expressed as j=n
,
,
j=c
,
T h i s can a l s o be w r i t t e n as
In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
56
CHEMOMETRICS: THEORY AND APPLICATION
(8)
ik
ik where
ik
Σ j-l
ID
Σ j=n+l
(9)
jk
and
!3
(10)
3k
Since e i s a s s o c i a t e d w i t h pure e r r o r the r e t e n t i o n o f an e x c e s s i v e number o f eigenvectors simply tends t o reproduce experimental e r r o r . Henc should be d e l e t e d . Whe d i f f e r e n c e between the raw data p o i n t and the reproduced data point. The imbedded e r r o r e ^ ^ i s d e f i n e d as the d i f f e r e n c e b e t ween the reproduced data p o i n t and t h e pure data p o i n t :
a ik
=
d^
* i k
-
d
(11) i
k
Because the t r a c e o f the covariance matrix, c o n s t r u c t e d by p r e m u l t i p l y i n g the data matrix by i t s transpose, i s i n v a r i a n t upon a s i m i l a r i t y t r a n s f o r m a t i o n , the f o l l o w i n g i s t r u e : i=r Σ i=l
k=c Σ k=l
ik
3=n Σ j=l
+
3=c Σ j=n+l
(12)
where
χ*
=
(13)
i=r Σ i=l
ik "7 c
^
f o r j = 1,...,n
and i=r Σ i-l
σ°..2 1
for
j = n+1,...,c
(14)
3
Here λ j i s a primary eigenvalue and Xj i s a secondary eigen v a l u e . The l a r g e s t eigenvalues c o n t a i n the pure c o f a c t o r s
In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
3. MALiNOwsKi
57
Abstract Factor Analysis
and thus belong t o the primary s e t . The s m a l l e s t eigenvalues c o n t a i n nothing but pure e r r o r c o f a c t o r s and thus belong t o the secondary s e t . D e l e t i o n o f the secondary eigenvalues and t h e i r a s s o c i a t e d eigenvectors from the AFA scheme should l e a d t o data improvement. Equations (13) and (14) are important because they show how the e r r o r mixes i n t o the eigenvalues. We can a l s o c o n s t r u c t a covariance matrix from the AFA regenerated data matrix. I t s t r a c e i s a l s o i n v a r i a n t upon a s i m i l a r i t y transformation. Hence i=r k=c Σ Σ i = l k=l By s u b t r a c t i n g eq. i=r k=c Σ Σ i = l k=l
rf d
2
ik
(15) from
ik
=
j-n Σ j-l
(15)
(12) we o b t a i n
j=n+l
i=l
j=n+l
13
(16) In In the d i s c u s s i o n which f o l l o w s we w i l l use these equations t o develop f u l l y our understanding o f the r e a l e r r o r , e x t r a c t e d e r r o r , imbedded e r r o r , and t h e i r i n t e r r e l a t i o n s h i p . Residual Standard D e v i a t i o n - The Real E r r o r The r e s i d u a l standard d e v i a t i o n (RSD) i s d e f i n e d i n terms o f the p r o j e c t i o n s o f the e r r o r p o i n t s onto the secondary axes, namely i = r j=c r(c-n)(RSD) Σ Σ σ ° 2 (17) i = l j=n+l 2
J
Because the secondary axes i n v o l v e pure e r r o r components which are d e l e t e d from the reproduction we see t h a t the RSD i s , i n r e a l i t y , a measure o f the r e a l e r r o r (RE), the d i f f e r e n c e b e t ween the pure data and the raw data. By p l a c i n g eq. (16) i n t o (17) we f i n d
1/2 RE = RSD =
j=n+l r(c-n)
In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
(18)
CHEMOMETRICS:
58
THEORY AND APPLICATION
Root-Mean-Square E r r o r The root-mean-square e r r o r i s d e f i n e d i n terms o f e ° ^ , the d i f f e r e n c e between a raw data p o i n t and i t s value regenerated by AFA, namely i=r k=c Σ Σ rc(RMS) = e° (19) i=l k=l k
2
2
i k
Because the reproduced data matrix i s orthogonal to i t s a s s o c i ated r e s i d u a l e r r o r matrix,
ΣΣ
e° 2 i k
=
ΣΣ
d
2 i k
Hence, we can r e a d i l y see from
-
ΣΣ d *
ΣΣ
=
i k
(16), (19) and
(d
i k
- d*
i k
)
2
(20) t h a t
1/2 RMS
Σ j=n+l
=
(21)
3
Pythagorean R e l a t i o n s h i p L e t us now t h e o r e t i c a l l y form a covariance matrix from the experimental e r r o r matrix. The t r a c e o f t h i s matrix i s i n v a r i a n t upon a s i m i l a r i t y t r a n s f o r m a t i o n . Hence, r e c a l l i n g eq. (4), we conclude t h a t
i=r Σ i=l
k=c Σ k=l
e
2 i k
i=r Σ i=l
=
j=c Σ j=l
, σ^.2
+
i=r Σ i=l
j=c Σ σ ° 2 j=n+l (22)
Each o f these three sums i s i n t i m a t e l y r e l a t e d t o the re s i d u a l standard d e v i a t i o n as f o l l o w s :
rc(RSD)
2
m (RSD)'
=
i=r Σ i=l
k=c Σ k=l
i=r Σ i=l
j=n Σ j=l
e
z i k
13
In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
(23)
(24)
3. MALiNOwsKi
59
Abstract Factor Analysis
r(c-n) (RSD)
i=r Ζ i=l
2
3=c Σ σ j=n+l
0
^
(25)
2
P l a c i n g these three equations i n t o eq. (23) we f i n d (RE)
2
=
(IE)
2
+ (XE)
2
(26)
where (27)
RE
=
RSD
IE
=
[ nïl/2 c f
(RSD)
(28)
) 1/2
The r e a l e r r o r (RE) i s the d i f f e r e n c e between the pure data and the raw data. The imbedded e r r o r (IE) i s the d i f f e r e n c e between the pure data and the AFA reproduced data. The e x t r a c t e d e r r o r (XE) i s the d i f f e r e n c e between the AFA reproduced data and the raw data. Figure 1 i s presented as a mnemonic i l l u s t r a t i n g these r e l a t i o n s h i p s . S e v e r a l important f a c e t s o f AFA now become apparent. Since η < c we see c l e a r l y t h a t IE < RE, hence AFA should always l e a d t o data improvement. Secondly, by comparing eq. (29) t o eqs. (18) and (21) we l e a r n t h a t the RMS, the d i f f e r e n c e between the AFA reproduced data and the raw data, i s the e x t r a c t e d e r r o r . T h i s i s unexpected and q u i t e s u r p r i s i n g . I f we use too many eigenvectors t o reproduce the data we w i l l reduce the e x t r a c t e d e r r o r and i n c r e a s e the imbedded e r r o r . I t i s important t o use the proper number o f eigenvectors i n the reproduction process. T e s t i n g the Theory with Mathematical
Models
In order t o t e s t the proposed theory o f e r r o r a s e r i e s o f mathematical models o f v a r i o u s d i m e n s i o n a l i t i e s , s i z e s and e r r o r s were c o n s t r u c t e d and f a c t o r analyzed. The simplest and e a s i e s t model t o v i s u a l i z e c o n s i s t e d o f the one-dimensional data s e t d e p i c t e d i n Table I . The f i r s t two columns represent a pure data matrix. I t i s o b v i o u s l y one-dimensional s i n c e a p l o t o f the p o i n t s i n column one a g a i n s t the corresponding p o i n t s i n column two produce a s t r a i g h t l i n e o f u n i t s l o p e . When t h i s pure data matrix i s f a c t o r analyzed we o b t a i n the r e s u l t s shown i n the f i r s t column o f Table I I . An a r t i f i c i a l e r r o r matrix ( t h i r d and f o u r t h columns o f Table I) was c o n s t r u c t e d and added t o the pure data matrix t o produce the raw data matrix shown i n the f i f t h and s i x t h columns o f
In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
60
CHEMOMETRICS: THEORY AND APPLICATION
PURE DATA
(IE)
FA REPRODUCED DATA
RAW DATA
Figure 1. Mnemonic diagram of the Pythagorean relationship the extracted IE, and their rehtionships to the pure, raw, and FA reproduced data
Table I - Artificial (One-Dimensional) Pure Data Matrix, Error Matrix, Raw Data Matrix and FA Reproduced Raw Data Matrix.
Pure Data Matrix [0*1 olumn 1 column 2 1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 9 10 10
Error Matrix [E] column 1 column 2 0.2 0.0 -0.2 -0.2 -0.1 0.1 0.0 -0.1 -0.1 0.0 0.2 -0.2 0.2 -0. 1 -0.2 0.1 -0.2 0.1 -0.1 0.2
Raw Data Matrix [D] - [D*] + [E] column 1 column 2 1.0 1.2 1.8 1.8 2.9 3.1 4.0 3.9 5.0 4.9 6.2 5.8 7.2 6.9 7.8 8.1 8.8 9.1 10.2 9.9
FA Reproduced Raw Data Matrix column 1 1.0936 1.7904 2.9846 3.9288 4.9240 5.9671 7.0118 7.9086 8.9033 9.9974
In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
column 1.1052 1.8095 3.0163 3.9705 4.9763 6.0305 7.0862 7.9926 8.9978 10.1036
3.
MALiNOwsKi
61
Abstract Factor Analysis
Table I. T h i s raw data matrix simulates r e a l chemical data which contains experimental u n c e r t a i n t y . When t h i s matrix was f a c t o r analyzed we obtained the r e s u l t s shown i n the second and t h i r d columns o f Table I I . Two eigenvalues and two s e t s o f row cof a c t o r s emerge from the a n a l y s i s . These c o f a c t o r s are i n t i m a t e l y r e l a t e d t o the raw data p o i n t s . To understand t h i s r e l a t i o n s h i p we p l o t the raw data p o i n t s o f column 1 a g a i n s t those i n column 2. T h i s p l o t i s shown i n F i g u r e 2. Notice t h a t the p o i n t s l i e i n a two-dimen s i o n a l plane. The row c o f a c t o r s i n Table I I are the perpendicu l a r p r o j e c t i o n s o f these p o i n t s onto the primary and secondary axes which emerge from the FA. Notice t h a t the p r o j e c t i o n s onto the secondary a x i s c o n t a i n nothing but e r r o r as p r e d i c t e d by the theory. When the secondary eigenvector i s d e l e t e d we o b t a i n the FA reproduced raw data matrix shown i n Table I The RMS o f the d i f f e r e n c e between the F c a l c u l a t e d to be 0.087 e r r o r matrix i s 0.148. T h i s i s i n accord with our p r e d i c t i o n based upon the proposed theory o f e r r o r . The imbedded e r r o r , 0.087 i s l e s s than the r e a l e r r o r , 0.148, by a f a c t o r o f a p p r o x i mately /n/c = /1/2 as p r e d i c t e d by eq. (28). Exact agree ment should not be expected because twenty data p o i n t s do not c o n s t i t u t e a good s t a t i s t i c a l sample. Instead o f a d i r e c t comparison we can c a l c u l a t e the RE and IE from the secondary eigenvalues v i a eqs. (18) and (28). These r e s u l t s are shown i n Table I I I together with the d i r e c t c a l c u l a t i o n s d e s c r i b e d above. Many other a r t i f i c i a l s e t s o f data were c o n s t r u c t e d , having d i f f e r e n t s i z e s , d i m e n s i o n a l i t i e s and e r r o r s . A summary o f the f a c t o r analyses o f these matrices i s g i v e n i n Table I I I . In each case the RMS o f the a r t i f i c i a l e r r o r compares f a v o r a b l y with the r e a l e r r o r p r e d i c t e d from eq. (18). A l s o the RMS o f the d i f f e r e n c e between the pure and reproduced data compares f a v o r a b l y with the imbedded e r r o r p r e d i c t e d from eq. (28). These t e s t s give credence t o the proposed theory o f e r r o r . Imbedded E r r o r Function I t i s p o s s i b l e to deduce the t r u e number o f f a c t o r s i n a data matrix by studying the behavior o f the IE as a f u n c t i o n o f n. I f the e r r o r s are d i s t r i b u t e d randomly and f a i r l y uniform throughout the data matrix, and are f r e e from s p o r a t i c o r s y s t e matic behavior, then we should expect t h e i r p r o j e c t i o n s onto each o f the secondary axes t o be approximately the same. I f t h i s i s t r u e we may s e t X"j * ^°j+i = . . . » X ° . Inserting this into eqs. (18) and (28) g i v e s c
IE
»
n
1 / 2
k
for
η > true η
In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
(30)
CHEMOMETRICS: THEORY AND APPLICATION
Figure 2. Illustration of the geometrical relationships between the raw data points of Table I and the pri mary and secondary axes resulting from factor analysis of the raw data matrix
Table II EIGENVALUES AND ROW-COFACTORS RESULTING FROM FA
From Pure Data Matrix λ ι Λ
770
From Raw Data Matrix λ" 1 767.1514 r
1.41421 2.82843 4.24264 5.65685 7.07107 8.48528 9.89950 11.31371 12.72792 14.14214
îi
1.55487 2.54555 4.24333 5.58569 7.00063 8.48367 9.96895 11.24396 12.65816 14.21377
λ" 2 0.28861: <2 0.14964 0.01344 -0.11901 0.10021 -0.03374 0.32765 0.26479 -0.15275 -0.14528 -0.13707
In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
2 5 5 5 6 9 9
X X X X X X X
1 2 2 2 3 4 5
10 16 16 15 10 16 10
r X c
η
1 2 2 0 0 -581 -137
to to to to to to to
10 170 170 27 32 955 180
Range o f Pure Data -0.2 -0.08 -0.99 -1.9 -0.9 -1.0 -1.0 to to to to to to to
Range o f Error 0.2 0.08 0.91 2.0 0.9 1.0 1.0
A r t i f i c i a l Data and A r t i f i c i a l E r r o r
0.148 0.041 0.566 1.040 0.412 0.548 0.463
RMS
of Factor Analysis
IE 0.120 0.026 0.323 0.608 0.266 0.333 0.277
RE 0.170 0.041 0.511 0.962 0.376 0.499 0.372
Results o f Factor A n a l y z i n g Impure Data Using η F a c t o r s
Table I I I - Summary o f Model Data and the R e s u l t s
0.087 0.026 0.381 0.726 0.311 0.398 0.400
1/2
RMS o f D i f f e r e n c e between FA Repro duced Data and Pure Data
64
CHEMOMETRICS: THEORY AND APPLICATION
where k i s a
constant,
k
(31)
=
The imbedded e r r o r f u n c t i o n should decrease as we employ more and more primary eigenvectors. However, when we use a secondary eigenvector i n the reproduction scheme eq. (30) be comes e f f e c t i v e . T h i s equation shows t h a t the IE w i l l i n c r e a s e s y s t e m a t i c a l l y as we use more secondary e i g e n v e c t o r s . The imbedded e r r o r f u n c t i o n should reach a minimum when we use the exact number o f eigenvectors. This minimum can be used to deduce the true f a c t o r space. Unfortunately, with r e a l data t h i s be h a v i o r w i l l not always occur because the p r i n c i p a l component f e a t u r e o f FA tends to exaggerate the nonuniformity o f the e r r o r d i s t r i b u t i o n and, i n f a c t p l a c e heav emphasi bad dat points. F a c t o r I n d i c a t o r Function During our i n v e s t i g a t i o n s we found an e m p i r i c a l f u n c t i o n , which we c a l l the f a c t o r i n d i c a t o r (IND), which i s more s e n s i t i v e than the IE f u n c t i o n i n determining the d i m e n s i o n a l i t y o f the f a c t o r space. T h i s f u n c t i o n i n v o l v e s the same v a r i a b l e s (X°^ r , c and n) and i s d e f i n e d as f o l l o w s : f
I
N
D
=
-Τ£Τ,2
<32)
S i m i l a r t o the IE f u n c t i o n , the IND f u n c t i o n reaches a minimum when the c o r r e c t number o f eigenvectors are employed. T e s t i n g IE and IND Function Using Model Data P r i o r to i n v e s t i g a t i n g r e a l chemical data i t i s important to study the behavior o f the IE and IND f u n c t i o n s using a r t i f i c i a l data, generated mathematically, f o r which we know every aspect o f the pure data and e r r o r s introduced. Many such models were s t u d i e d . Three t y p i c a l examples are d e p i c t e d i n Table IV. For model data A, the IE f u n c t i o n decreases d r a m a t i c a l l y as we proceed from η = 1 to η = 4. The IE a t η * 5 i s l a r g e r than the value a t η = 4 . T h i s w i l l occur when we have exceeded the true f a c t o r space. Hence we conclude t h a t f o u r f a c t o r s are present. T h i s c o n c l u s i o n i s s u b s t a n t i a t e d by the f a c t that the IND f u n c t i o n shows a minimum at η = 4. Having concluded t h a t there are four f a c t o r s we then p r e d i c t t h a t the r e a l e r r o r , RE, i s approximately 0.50. These c o n c l u s i o n s are i n
In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
3. MALiNOwsKi
Abstract Factor Analysis
65
e x c e l l e n t agreement with the known f a c t s . The pure data was generated by u s i n g four mathematical f a c t o r s and the RMS o f the random e r r o r added was 0.55. For model data Β the IE f u n c t i o n shows a r a p i d drop from η = 1 t o η = 5 and only a s l i g h t decrease from η = 5 t o η = 6. T h i s s i g n i f i e s t h a t there are probably o n l y f i v e f a c t o r s present. More d r a m a t i c a l l y , the IND f u n c t i o n reaches a t r u e minimum a t η = 5, s u b s t a n t i a t i n g t h i s c o n c l u s i o n . F o r η = 5 the RE i s 0.37 i n reasonable agreement with 0.46, the RMS o f the e r r o r a c t u a l l y introduced. I t i s important t o recognize here t h a t we have deduced the experimental e r r o r as w e l l as the s i z e o f the f a c t o r space w i t h out r e l y i n g upon any a p r i o r i knowledge o f the e r r o r introduced. Model data C was generated by haphazardly choosing numbers from 1 t o 100. When t h i s 10 χ 8 data matrix was f a c t o r analyzed the r e s u l t s shown i n Table IV were obtained. Notice t h a t the IE i n c r e a s e s as we proceed the IND f u n c t i o n i n c r e a s e o b v i o u s l y not one dimensional i t must be zero dimensional. In other words no f a c t o r s are present. The data i s not f a c t o r analyzable. I t i s important t o recognize here t h a t we have d i s c o v e r e d a c r i t e r i o n f o r d e c i d i n g whether o r not a data matrix i s f a c t o r analyzable. T h i s i s so important t h a t we recommend t h a t i t be used r o u t i n e l y as the f i r s t step i n a f a c t o r a n a l y s i s study. Applications The f o l l o w i n g sequence o f s t u d i e s was conducted u s i n g data which p r e v i o u s l y had been f a c t o r analyzed, the r e s u l t s having been reported i n the s c i e n t i f i c l i t e r a t u r e . The purpose was t o see whether o r not the newly-developed RE, IE and IND f u n c t i o n s agreed with t h e previous c o n c l u s i o n s and what new i n s i g h t c o u l d be gleaned. A. NMR S h i f t s . Weiner, Malinowski and Levinstone (2) con ducted a FA study o f the proton s h i f t s o f some 14 s o l u t e s i n 9 common s o l v e n t s . Using t h e i r data matrix we obtained the r e s u l t s shown i n Table V. The IE g i v e s evidence t h a t o n l y three f a c t o r s are i n v o l v e d s i n c e only a very l i t t l e decrease i n t h i s f u n c t i o n , from 0.33 t o 0.32, occurs on going from three t o four f a c t o r s . T h i s c o n c l u s i o n i s confirmed by the IND f u n c t i o n which reaches a minimum a t η = 3. For three f a c t o r s the r e a l e r r o r i s p r e d i c t e d to be 0.58 Hz i n e x c e l l e n t agreement with the experimental e r r o r which was estimated t o 0.5 Hz. These c o n c l u s i o n s agree w i t h those o f t h e o r i g i n a l i n v e s t i g a t o r s . When the 19F s h i f t s reported by Abraham, Wileman and Bedford (7) was subjected t o f a c t o r a n a l y s i s we obtained the r e s u l t s shown i n Table V. Both the IE and the IND f u n c t i o n s give strong evidence t h a t not two b u t three f a c t o r s a r e o p e r a t i v e .
In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
66
CHEMOMETRICS:
Table
IV - R e s u l t s o f F a c t o r A n a l y z i n g
Model
η
RE
1
140.21
0.7507
14.50
0.6277
1.13
0.0543
21.09
12:18
0.33 0.34
0.0200
8.83
0.0283
0.37
5.89 0.28
0.5859 0.3532
0.37 0.32
0.30
0.0409
0.29
0.24
0.28
0.25
0.24
0.0797 0.2543
0.15 0.11
0.13 0.10
0.45
16x9 f o u r
RMS e r r o r Β - A
factor
equal
10x9 f i v e
RMS C - A
error
V
data data
21.88
12.33 13.40
18.51 14.42
13.09 11.40
0.0233 0.0323 0.0376 0.1084
- Results
9.54
2.75
6.47
6.05
6.47
-
-
-
from
- 581.00
t o 955.00 w i t h a n
matrix
values
ranging
from
-
t o 180.14 w i t h a n
with
137.75
of Various
SHIFTS
ϋ
Η and Solutes
^F Nuclear in a
3
FLUORINE
IND χ 10'
Magnetic
SHIFTS*
RE
(Hz)
Resonance
Variety of Solvents.
3
IND χ 10-
(ppm)
(ppm)
1
2.32
0.77
3.62
0.767
1.12
0.53
2.29
0.077
0.271 0.038
lS.6
2 3 4
0.58
1.60
0.035
0.021
1.4
0.48
0.33 0.32
1.93
5
0.37
0.27
2.30
0.027 0.021
0.019 0.016
2.3
6
0.29
0.24
3.26
0.27
0.24
6.79
0.015 0.014
0.013
7 8
0.22
0.21
shifts
matrix
Levinstone,
J . Phys.
Fluorine shifts Data
taken
from
Perkin
the data
reported
1.7
3.7 14.1
0.013
to Table
T M S ) o f 14 s o l u t e s I o f Weiner,
Chem., 7 4 , 4 5 3 7
0.5
(1970).
The e r r o r
from
Table
I I , 1027 ( 1 9 7 3 ) . matrix:
and
was
reported
Hz.
o f 14 n o n p o l a r
in part
in 9 solvents.
Malinowski
(corrected f o rbulk-susceptibility
standard)
was
to internal
identical
external
2.1
22.43
(relative
was
be a p p r o x i m a t e d
J.C.S.
1.60
11.02
ranging
Shifts
RE
Data
1.16
values
of Factor Analyzing
(Hz)
Proton
0.55 0.69 0.88
matrix with
PROTON
b)
IND
9.54
27.00 24.66
consistin
Chemical
to
Data C ϋ
t o 0.46.
matrix
jl
Data
t o 0.55.
factor
equal
10x8 d a t a
Table
a)
Model RE
16.02
5 6
of Artificial
IND
48.05 30.76
1.95 0.50
Sets
Data Β
JJE
RE
2.1909 0.9444
46.27
- A
IND
46.74 21.81
2
A
Different
el
Data A JE
3 4
7 8
Three
THEORY AND APPLICATION
solutes
0.035
using an
solvents.
III o f Abraham, Wileman and
The f o l l o w i n g s o l u t e s were
C ^ H r C F o , CF^CHOCBr
t o be a p p r o x i m a t e l y
effects
in 8 nonpolar
and C^Fj^.
Bedford,
deleted
The e r r o r
ppm.
In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
was
67
Abstract Factor Analysis
3. MALiNOwsKi
Furthermore, f o r three f a c t o r s , the RE value 0.035 i s i n p e r f e c t agreement with the r e p o r t e d e r r o r . Without recourse t o FA, Abraham and co-workers assumed t h a t only two f a c t o r s , t h e gas-phase s h i f t and t h e van der Waals i n t e r a c t i o n , were r e s p o n s i b l e f o r the s o l v e n t s h i f t s . FA g i v e s c l e a r evidence t h a t a t h i r d f a c t o r , although s m a l l , makes a mea s u r a b l e c o n t r i b u t i o n . Target t r a n s f o r m a t i o n FA should be h e l p f u l i n i d e n t i f y i n g the p h y s i c a l s i g n i f i c a n c e o f t h i s elusive f a c t o r . B. Spectrophotometrie Absorbances. Bulmer and S h u r v e l l (8) f a c t o r analyzed t h e i n f a r e d s p e c t r a o f the c a r b o n y l r e g i o n o f a c e t i c a c i d and t r i c h l o r o a c e t i c a c i d i n CCI4 s o l u t i o n . T h e i r data matrix c o n s i s t e d o f the absorbances o f 9 s o l u t i o n s o f d i f f e r e n t c o n c e n t r a t i o n s measured a t 200 and 301 d i f f e r e n t wavenumbers, r e s p e c t i v e l y . Using t h e i r r e p o r t e d eigenvalues we c a l c u l a t e d the RE, IE and IND f u n c t i o n s which are l i s t e d i n Table V I . For a c e t i c a c i d , o n l 0.00045) occurs on goin imum a t η = 4. Hence, f o u r f a c t o r s a r e p r e s e n t . Unfortunately, f o r t r i c h l o r o a c e t i c a c i d the IE f u n c t i o n ex h i b i t s n e i t h e r a minimum nor a l e v e l i n g o f f . No c o n c l u s i o n s can be reached on t h i s b a s i s . F o r t u n a t e l y , however, the IND g i v e s
Table VI - Results of Factor Analyzing Digitized Infrared Spectra of the Carbonyl Region of Acetic Acids in CCt, Solutions Acetic
η 1 2 3 4 5 6 7 8
a)
RE
Based of
Trichloroacet ic A c i d
3
IE
0.01461 0.00284 0.00174 0.00072 0.00060 0.00046 0.00036 0.00024
256
Acid
0.00487 0.0013** 0.00100 0.00048 0.00045 0.00038 0.00032 0.00023
on a s t u d y
(1973). different
IND
made
Data
χ
10
5
J_E
RE 0.02992 0.00520 0.00296 0.00123 0.00086 0.00060 0.00045 0.00029
22.83 5.80 4.82 2.90 3.75 5.13 9.12 23.90
by Bulmer and S h u r v e l l ,
matrix
consisted
concentrations
T h e e r r o r was e s t i m a t e d
of
measured
at
200
O.OOI7I
0.00082 0.00064 0.00049 0.00039 0.00028
J . Phys.
to
5 1
10
46.74 10.61 8.21 4.92 5.39 6.64 11.15 29.^0
0.00997 0.00245
Chem., of 9
different
0.0005
χ
IND
the absorbances
t o be b e t w e e n
b
Jl*
solutions
wavenumbers.
0.0015
absorbance
units. b)
Based 1251 of
on a s t u d y (1975).
different
made
Data
by Bulmer and S h u r v e l l ,
matrix
consisted
concentrations
The e r r o r was e s t i m a t e d
of
measured
t o be b e t w e e n
Canad.
J . Chem.,
the absorbances
a t 301
0.0005
different to
0.0015
of 9
53,
solutions
wavenumbers. absorbance
units.
In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
68
CHEMOMETRICS:
THEORY AND APPLICATION
evidence t h a t f o u r f a c t o r s a r e present s i n c e i t reaches a minimum a t η = 4. The r e a l e r r o r s f o r η » 4, c a l c u l a t e d t o be 0.00072 and 0.00123, r e s p e c t i v e l y , a r e w i t h i n t h e e r r o r range which was estimated t o be between 0.0005 and 0.0015 absorbance u n i t s . Furthermore, the present c o n c l u s i o n t h a t f o u r f a c t o r s are respon s i b l e f o r the data i s i n complete accord with t h e c o n c l u s i o n s o f Bulmer and S h u r v e l l based upon other s t a t i s t i c a l c r i t e r i a depen dent upon a knowledge o f t h e e r r o r . C. Mass Spectra. R i t t e r and co-workers (£) f a c t o r analyzed the mass s p e c t r a l i n t e n s i t i e s o f v a r i o u s mixtures o f the same components. T h e i r purpose was t o use FA t o deduce the number o f components i n the r e l a t e d mixtures. T h e i r c r i t e r i a depended upon a knowledge o f the experimental e r r o r . When t h e i r data matrix, which c o n s i s t e d o f the i n t e n s i t i e s o f 4 d i f f e r e n t mixtures a t 20 m/e p o s i t i o n s , wa as shown i n Table VTI. The IE shows o n l y a very s l i g h t decrease on going from 2 t o 3 f a c t o r s . The IND i s a minimum a t η - 2. Our c o n c l u s i o n t h a t there a r e two f a c t o r s agrees with the known f a c t t h a t o n l y two components a r e present. When the mass s p e c t r a l data concerning 7 d i f f e r e n t mixtures of cyclohexane and hexane were analyzed the r e s u l t s given i n Table V I I were obtained. In t h i s case both the IE and IND func t i o n s c l e a r l y i n d i c a t e d t h a t not two but three components are p r e sent. R i t t e r and co-workers suggested t h a t the unexpected t h i r d f a c t o r was due t o n i t r o g e n contamination. When the i n t e n s i t i e s of the m/e 28 peaks were removed from the data matrix the IE and IND f u n c t i o n s , as shown i n Table V I I , showed t h a t o n l y two f a c t o r s remained. T h i s confirmed t h e s u s p i c i o n t h a t n i t r o g e n was present. T h i s a l s o i l l u s t r a t e s t h e s e n s i t i v i t y o f the IE and IND f u n c t i o n s . The d e l e t i o n o f t h e m/e 28 data i n v o l v e d the removal o f only 7 data p o i n t s out o f 126. The amount o f n i t r o g e n present was extremely small and o n l y c o n t r i b u t e d p a r t i a l l y t o the m/e 28 intensities. The RE c a l c u l a t e d f o r the three data matrices were 0.154, 0.128 and 0.134, r e s p e c t i v e l y . These e r r o r s are almost three times the e r r o r , 0.05, r e p o r t e d by R i t t e r and co-workers. The e r r o r was based s o l e l y upon the e r r o r i n v o l v e d i n reading the i n t e n s i t i e s from the s p e c t r a . The r e a l e r r o r from FA i s a com p o s i t e from a l l sources o f e r r o r , i n c l u d i n g o p e r a t i o n a l and i n strumental v a r i a t i o n s as w e l l as reading e r r o r s . D. Gas-Liquid Chromatography. S e l z e r and Howery (9) f a c t o r analyzed the g a s - l i q u i d chromatographic r e t e n t i o n i n d i c e s o f ethers and found t h a t s i x a b s t r a c t f a c t o r s reproduced the data satisfactorily. Using a data matrix concerning some 22 ethers
In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977. 3
RE
d)
c)
0.684 0.249 0.084 0.084 0.073 0.068
50.27 18.62 8.03 12.30 24.56 73.51 1.812 0.134 0.106 0.092 0.072 0.058
0·685 0.071 0.070 0.070 0.061 0.054
IE
50.35 5.36 6.65 10.25 18.08 58.1Q
IND χ 1 0
3
Data taken from the work o f R i t t e r , Isenhour and W i l k i n s , A n a l . Chem., 48, 591 (1976). Data matrix c o n s i s t e d o f the i n t e n s i t i e s o f 4 d i f f e r e n t mixtures o f cyclohexane and cyclohexene measured a t 20 m/e p o s i t i o n . Data matrix c o n s i s t e d o f the i n t e n s i t i e s o f 7 d i f f e r e n t mixtures o f cyclohexane and hexane measured a t 18 m/e p o s i t i o n s . The i n t e n s i t i e s o f m/e 28 were d e l e t e d from the matrix d e s c r i b e d i n c) l e a v i n g o n l y 17 m/e p o s i t i o n s .
1.810 0.465 0.128 0.111 0.098 0.074
IND χ 1 0
a) b)
21.47 3.86 11.18
IE
0.9660 0.1092 0.0969
RE
1.932 0.154 0.118
2
cyclohexane/hexane without m/e 28<*
1 2 3 4 5 6
IND χ 1 0
c
IE
eyelonexane/hexane
RE
D
η
cyclohexane/cyclohexene mixtures
Table VTI - R e s u l t s o f F a c t o r A n a l y z i n g Mass S p e c t r a l I n t e n s i t i e s o f a S e r i e s o f R e l a t e d M i x t u r e s *
In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977. n_ 10 11 12 13 14 15 16 17
IND 0.07708 0.02831 0.02354 0.02070 0.01915 0.01914 0.01997 0.02045 0.02114 1.40 1.25 1.07 0.94 0.73 0.69 0.61 0.59
RE 1.04 0.98 0.87 0.80 0.65 0.63 0.58 0.57
IE 0.02187 0.02553 0.02975 0.03748 0.04586 0.07618 0.15261 0.59012
IND
T h i s problem was suggested by D. G. Howery, p r i v a t e communication. Data was taken from W. O. McReynolds, "Gas Chromatographic Retention Data," Preston T e c h n i c a l A b s t r a c t s Co., N i l e s , 111., (1966). The experimental e r r o r was estimated t o be no g r e a t e r than 3 r e t e n t i o n i n d i c e s .
5.25 2.42 2.16 1.91 1.71 1.59 1.51 1.36 1.21
22.28 7.25 5.30 4.06 3.24 2.76 2.42 2.05 1.71
1 2 3 4 5 6 7 8 9
a) b)
IE
RE
11
on 18 Chromatographic Columns. '
a
Table V I I I - R e s u l t s o f F a c t o r A n a l y z i n g the GLC Retention I n d i c e s o f 22 E t h e r s
Abstract Factor Analysis
3. MALiNOwsKi
71
on 18 chromatographic columns we obtained the r e s u l t s presented i n Table V I I I . No conclusions can be drawn from the IE f u n c t i o n because IE decreases continuously throughout the e n t i r e range of f a c t o r s , e x h i b i t i n g n e i t h e r a minimum nor a l e v e l i n g o f f . The IND f u n c t i o n , on the other hand, shows a s i n g l e , t r u e minimum a t η = 6 i n accord with the c o n c l u s i o n o f S e l z e r and Howery. The RE f o r s i x f a c t o r s i s 2.76, i n e x c e l l e n t agreement with the ex perimental e r r o r e s t i m a t i o n t h a t the u n c e r t a i n t y i s no g r e a t e r than 3 r e t e n t i o n i n d i c e s . E. Drug A c t i v i t y . Weiner and Weiner (5) were the f i r s t t o use f a c t o r a n a l y s i s t o i n v e s t i g a t e b i o l o g i c a l drug a c t i v i t y . Using the drug data o f K e a s l i n g and M o f f e t t (10), they f a c t o r analyzed the n a t u r a l logarithm o f 11 d i f f e r e n t b i o l o g i c a l respon ses o f 16 s t r u c t u r a l l y r e l a t e d drugs. Because o f the nature o f the t e s t s no e r r o r estimation was r e p o r t e d . Using an a r b i t r a r y e r r o r c r i t e r i o n Weiner Using the same dat r i t h m i c f u n c t i o n , we obtained the r e s u l t s shown i n Table IX. Notice here that the IE increases as η goes from 1 t o 3. A l s o n o t i c e t h a t the IND f u n c t i o n blows up as we employ more eigen v e c t o r s . T h i s behavior i s i d e n t i c a l t o t h a t which we observed during our a n a l y s i s o f a p e r f e c t l y random matrix o f numbers (see Model Data C i n Table IV). We conclude, t h e r e f o r e , t h a t t h i s drug data i s not f a c t o r analyzable. There are two p o s s i b l e reasons f o r t h i s f a i l u r e . First, the experimental e r r o r may be too l a r g e . Secondly, i t i s pos s i b l e t h a t the logarithm o f the drug a c t i v i t y does not obey the sum o f product f u n c t i o n s demanded by mathematics i n v o l v e d i n the FA approach. The a c q u i r i n g o f accurate experimental drug a c t i v i t y data should help e l i m i n a t e one o f these two p o s s i b i l i t i e s . The r e s u l t s presented here, however, warn the a n a l y s t t h a t some-
Table
IX -
Results
of Factor
Drug A c t i v i t y
η
RE
1
0.136
Series
Biological 3
b
0.141
0.00766
7 8
0.155
0.124
0.00970
0.134
0.114
0.01488
9 10
0.089
0.080
0.02225
0.085
0.081
0.08462
0.162
0.00486
0.266
0.160
0.00543
5
0.226
0.152
0.00627
carried
the
Compounds. »
0.191
0.158
FA was
of
Relative
6
0.3Π
original
Logarithm
η
3 4
The
the Natural
of Structurally
IND 0.00453 0.00456
2
a)
JE
0.453 0.370
Analyzing
of a
o u t by Weiner
IND
JE
RE
and Weiner,
J . Med.
Chem.,
\6,
665
(1973). b)
The d r u g
data
was
obtained
by K e a s l i n g
and M o f f e t t ,
J . Med.
Chem., J 4 ,
(1971).
In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
1106
72
CHEMOMETRICS:
THEORY AND APPLICATION
t h i n g i s wrong. To search f o r the t r u e c o n t r o l l i n g f a c t o r s v i a t a r g e t - t r a n s f o r m a t i o n FA i s r i s k y and c o u l d be a waste o f v a l uable time. Remarks I t i s important f o r us t o r e a l i z e t h a t the work d e s c r i b e d i n the l a t t e r p a r t o f t h i s paper represents a b o l d attempt t o deduce both the dimensions o f the f a c t o r space as w e l l as the experimental e r r o r s t r i c t l y from a knowledge o f the experimental data. The methodology i s so new a t the present time t h a t i t s l i m i t a t i o n s have not been e s t a b l i s h e d . Obviously there w i l l e x i s t many s i t u a t i o n s wherein the I E and IND f u n c t i o n s w i l l give misleading r e s u l t s o r w i l l f a i l completely. However, systematic accumulation and documentation o f both the successes and the f a i l u r e s o f these c r i t e r i a should e v e n t u a l l y l e a d us t o a b e t t e r understanding o f t h e u t i l i t Acknowledgment The author wishes t o express h i s thanks t o Harry Rozyn and John P e t c h u l f o r h e l p i n c a r r y i n g out many o f the c a l c u l a t i o n s . Literature Cited
1. Funke, P.T., Malinowski, E.R., Martire, D.E., and Pollara, L.Z., Separation Sci., 1, 661 (1966). 2. Weiner, P.H., Malinowski, E.R., and Levinstone, A.R., J. Phys. Chem., 74, 4537 (1970). 3. Hugus, Z.Z., Jr., and El-Awady, Α.Α., J. Phys. Chem., 75, 2954 (1971). 4. Ritter, G.L., Lowry, S.R., Isenhour, T.L., and Wilkins, C.L., Anal. Chem., 48, 591 (1976) 5. Weiner, M.W. and Weiner, P.H., J. Med. Chem., 16, 665 (1973). 6. Duewer, D.L., Kowalski, B.R. and Fasching, J.L., Anal. Chem. 48, 2002 (1976). 7. Abraham, R.J., Wileman, D.F., and Bedford, G.R., J.C.S. Perkin II, 1027 (1973). 8. Bulmer, J.T. and Shurvell, H.F., J. Phys. Chem., 77, 256 (1973); Canad. J. Chem., 53, 1251 (1975). 9. Selzer, R.Β. adn Howery, D.G., J. Chromatography, 115, 665 (1973). 10. Keasling, H.H. and Moffett, R.B., J. Med. Chem., 14, 1106 (1971).
In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
4 The Unique Role of Target-Transformation Factor Analysis in the Chemometric Revolution DARRYL G. HOWERY Department of Chemistry, City University of New York, Brooklyn College, Brooklyn, NY 11210
A mathematical-analysi occuring in chemistry a logical manifestation of the revolution. By adapting a battery of mathematical/statistical techniques to high speed computers, researchers in chemometrics can extract new and here -tofore unobtainable insights into large, multifactor data sets. Factor analysis, a major weapon of the revolution, is proving to be a versatile, general method for analyzing matrices of chem ical data. In particular, the target-transformation method of factor analysis (1,2), which enables one to test empirical and theoretical models, offers powerful and unique potentialities for obtaining partial and even complete solutions to many kinds of chemical problems. The main objective of this presentation is to summarize the distinctive attributes of target-trans formation factor analysis (TTFA). Factor analytical solutions are of a form nicely adapted to chemistry. A data point, dij, in a data matrix is expressed as a linear sum of factors, each factor being the product of a row -designee cofactor and a column-designee cofactor. Mathema tically, factor analytical solutions obey the equation: η
η (1)
where η i s the minimum number o f f a c t o r terms, m, t o adequately p r e d i c t the data, and r and c ^ j a r e the c o f a c t o r s f o r the 1 row designee and the i t h column designee, r e s p e c t i v e l y , assoc i a t e d with the mth f a c t o r . The c e n t r a l purpose o f a TTFA i s t o d e r i v e information about the two s e t s o f c o f a c t o r s not only i n an a b s t r a c t (mathematical) sense but a l s o i n a r e a l ( p h y s i c a l l y s i g n i f i c a n t ) sense. In matrix n o t a t i o n , the data matrix t h
i
m
[ D ]
=
[R ] [C1
where [RJ i s the row matrix c o n t a i n i n g a row f o r each row
73 In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
(2)
74
CHEMOMETRICS: THEORY AND APPLICATION
designee and a column f o r each c o f a c t o r , and [ C ] i s the column matrix having a column f o r each column designee and a row f o r each c o f a c t o r . S o l u t i o n s o f the type i n d i c a t e d by equations (1) and (2) are e s p e c i a l l y s u i t e d to s t u d i e s o f e n t i t y - e n t i t y data matrices. The a b s t r a c t f a c t o r s are r e l a t e d i n some manner to those r e a l f a c t o r terms which measurably i n f l u e n c e the data. Even more conveniently, each c o f a c t o r p a i r i n a given f a c t o r can be t r a n s formed v i a t a r g e t transformation to s p e c i f i c p r o p e r t i e s o f the row designees and the column designees which are r e s p o n s i b l e f o r the f a c t o r . In a s o l u t e - s o l v e n t problem, f o r example, the f a c t o r s correspond t o the important energies o f i n t e r a c t i o n and the c o f a c t o r s p i n p o i n t the nature o f the i n t e r a c t i o n s i n terms o f the p r o p e r t i e s o f the pure s o l u t e and the pure s o l v e n t . (In u s u a l chemical p a r t i c l e s such as molecules, ions and r a d i c a l s , but a l s o , e.g., b i o l o g i c a l s p e c i e s , persons, p o l i t i c a l groups and c e l e s t i a l bodies.) Factor a n a l y t i c a l transformation technique and/or upon what we term the a b s t r a c t f a c t o r a n a l y t i c a l approach. In a b s t r a c t ( t r a d i t i o n a l ) FA, abs t r a c t s o l u t i o n s are obtained under v a r i o u s mathematical cons t r a i n t s . The i n v e s t i g a t o r then t r i e s t o gain i n s i g h t by examining the c o e f f i c i e n t s i n the matrices generated i n the a b s t r a c t s o l u t i o n . A b s t r a c t FA, long used i n the p s y c h o l o g i c a l and s o c i a l s c i e n c e s , can be g a i n f u l l y a p p l i e d t o c e r t a i n types o f chemical problems ( 3 ^ ) · The t a r g e t - t r a n s f o r m a t i o n method o f Malinowski (1) opens new t e r r i t o r y by enabling the researcher to t e s t parameters o f the row and column designees o f the matrix. The TTFA extension, i n a l l o w i n g one the p o s s i b i l i t y of t r a n s forming from a b s t r a c t f a c t o r s t o r e a l f a c t o r s o f the designees, a l l e v i a t e s f o r the p h y s i c a l s c i e n t i s t a major weakness o f a b s t r a c t FA. The steps i n a complete TTFA: data p r e p a r a t i o n , reproduction, t a r g e t transformation, combination and p r e d i c t i o n , have been d i s c u s s e d elsewhere (5). The number o f f a c t o r s r e q u i r e d i n equation (1) can be estimated i n the s h o r t - c i r c u i t reproduction procedure using both experimental-error and t h e o r e t i c a l c r i t e r i a , as was d e f t l y explained i n the previous t a l k (6). The raodelt e s t i n g c a p a b i l i t y of the t a r g e t transformation step i s the heart of f a c t o r a n a l y s i s f o r the p h y s i c a l s c i e n t i s t . A best complete model, i . e . , the best r e a l s o l u t i o n , i s generated i n the combination step. TTFA has been thoroughly t e s t e d during the past s i x years. Howery {5) and Weiner (7) , i n recent reviews which complement each other, consider the philosophy, theory, procedures and a p p l i c a t i o n s o f TTFA. D e t a i l s o f the mathematical development are given i n the already c l a s s i c paper of Malinowski and coworkers (2). TTFA can be u t i l i z e d using a blend of t h e r o r e t i c a l and e m p i r i c a l i n s i g h t s . Important TTFA's based a t l e a s t i n p a r t
In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
4.
HOWERY
Target-Transformation Factor Analysis
75
on a t h e o r e t i c a l framework i n c l u d e the study o f s o l u t e - s o l v e n t i n t e r a c t i o n s i n f l u e n c i n g proton chemical s h i f t s C2,8), the v e r i f i c a t i o n o f gas chromatographic r e t e n t i o n mechanisms (9), and the e l u c i d a t i o n o f s o l u t e - s o l v e n t e f f e c t s on a c i d i t y constants (10). These papers i l l u s t r a t e the e x c e p t i o n a l power o f TTFA i f the f a c t o r a n a l y s t s t a r t s with some t h e o r e t i c a l h e l p . In such cases, in-depth fundamental s o l u t i o n s can be achieved. However, f o r most chemical problems/ t h e o r e t i c a l i n s i g h t i s minimal. Thus, the second way o f u s i n g TTFA, i n v o l v i n g a more e m p i r i c a l approach, has an even wider a p p l i c a b i l i t y i n chemistry. Examples o f e m p i r i c a l s o l u t i o n s i n c l u d e a d e t a i l e d study o f ether c o f a c t o r s (11) (one o f a s e r i e s o f researches on the s o l u t e c o f a c t o r s i n f l u e n c i n g r e t e n t i o n i n d i c e s ) , and an i n v e s t i g a t i o n o f solvent-metal e f f e c t s on p o l a r o g r a p h i c half-wave p o t e n t i a l s (12). These s t u d i e s show the p o t e n t i a l f o r using TTFA t o f u r n i s h u s e f u l e m p i r i c a l s o l u t i o n s i n f i e l d s devoid o f a t h e o r e t i c a l underpinning Target Transformations Unique Features. The q u i t e unique a t t r i b u t e s o f the t a r g e t transformation procedure center around the model-testing and model-building c a p a b i l i t i e s o f TTFA. No other mathematical/ s t a t i s t i c a l method shows such promise f o r e x t r a c t i n g r e a l c o f a c t o r s and f o r developing complete s o l u t i o n s t o m u l t i f a c t o r problems. 1) P o t e n t i a l c o f a c t o r s are separated and t e s t e d mathemat i c a l l y r e g a r d l e s s o f t h e complexity o f the data space. Any parameter o f e i t h e r the row o r column designees can be i n v e s t i gated independently. S i n g l e terms i n a t h e o r e t i c a l o r e m p i r i c a l model can be t e s t e d one a t a time even though the other c o f a c t o r s i n the space are o p e r a t i v e , an unmatched accomplishment o f TTFA. Target transformation serves t o curve f i t v e c t o r s o f parameters i n a m u l t i f a c t o r space. 2) R e s t r i c t i o n s on the procedure are minimal. No knowledge o f the other c o f a c t o r s i s r e q u i r e d . One can s t a r t w i t h complete ignorance o f the nature o f the r e a l c o f a c t o r s , i n marked cont r a s t w i t h m u l t i p l e r e g r e s s i o n a n a l y s i s which i s a p p l i c a b l e only i f a complete model i s s p e c i f i e d . Furthermore, the r e a l v e c t o r s to be t e s t e d need n o t be complete, a tremendous p r a c t i c a l advantage s i n c e the data s t o r e f o r most types o f chemical informat i o n i s u s u a l l y incomplete. Missing o r u n c e r t a i n p o i n t s can be l e f t blank on a t e s t v e c t o r (a procedure termed " f r e e floating"). Such p o i n t s w i l l be p r e d i c t e d as a premium i n s u c c e s s f u l t a r g e t transformations. 3) The separation o f the f a c t o r a n a l y t i c a l s o l u t i o n i n t o two p a r t s as shown i n equation (2) enables one t o b u i l d up s o l u t i o n s f o r the two kinds o f designees independently. Even i f the problem i n terms o f one k i n d o f designee appears hopel e s s l y complex, i t i s s t i l l p o s s i b l e t o d e r i v e a s o l u t i o n f o r
In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
76
CHEMOMETRICS: THEORY AND APPLICATION
the other k i n d o f designee* Two complete s o l u t i o n s can be developed term by term and each p o s s i b l e r e a l s o l u t i o n i n v o l v i n g s e t s o f c o f a c t o r s can be t e s t e d i n the combination step* If a l l o f the f a c t o r s i n the a b s t r a c t s o l u t i o n are not spanned i n a given combination, the reproduction v i a combination w i l l be poor, i n d i c a t i n g the s e n s i t i v i t y and the n o n - f o r c e - f i t t i n g nature o f the step. Two Examples. Two t y p i c a l examples from r e c e n t research w i l l i l l u s t r a t e the scope o f the t a r g e t - t r a n s f o r m a t i o n approach f o r t e s t i n g parameters o f the designees. Computations were c a r r i e d out on an I.B.M. 370/168 d i g i t a l computer u s i n g a computer program i n FORTRAN IV which has evolved over the decade (13). V e c t o r s t o be t e s t e d can c o n t a i n data o f any type which the researcher t h i n k s might be i n d i c a t i v e o f the behavior o f the designees (and hence p o s s i b l y r e s p o n s i b l e f o r a c o f a c t o r ) . Both p h y s i c a l vectors ( i l l u s t r a t e v e c t o r s (exemplified b D e s c r i p t o r s used i n p a t t e r n r e c o g n i t i o n s t u d i e s have much i n common with the t e s t v e c t o r s o f TTFA. The e s s e n t i a l question i n e v a l u a t i n g the success o f a t a r g e t transformation i s how w e l l does the b e s t - f i t p r e d i c t e d v e c t o r c a l c u l a t e d from the l e a s t - s q u a r e s method (2) agree p o i n t - b y - p o i n t with the r e a l vector being tested. I f the two v e c t o r s are reasonably s i m i l a r , the t e s t v e c t o r i s taken t o be a r e a l c o f a c t o r . The examples to be s i t e d i n v o l v e f o r pedagogical purposes a d i f f i c u l t - t o i n t e r p r e t r e s u l t and an u n s u c c e s s f u l transformation. The f i r s t example i s taken from a TTFA o f the r e t e n t i o n i n d i c e s o f o r g a n i c s o l u t e s on stationary-phase s o l v e n t s (14). S t u d i e s o f r e t e n t i o n i n d i c e s have amply demonstrated the a b i l i t y o f TTFA to i s o l a t e c o f a c t o r s i n problems f a r too complicated f o r d e t a i l e d t h e o r e t i c a l treatments. Whereas s o l u t e s have been s t u d i e d i n d e t a i l , t h i s i s the f i r s t in-depth i n v e s t i g a t i o n o f GLC s o l v e n t s u s i n g TTFA. To b e t t e r examine the c o f a c t o r s o f the s o l v e n t s , only monomeric s o l v e n t s were s e l e c t e d . (Previous TTFA's o f r e t e n t i o n i n d i c e s have i n v o l v e d r e l a t i v e l y complex, polymeric s o l v e n t s f o r which t e s t v e c t o r s are d i f f i c u l t t o formulate.) The s p e c i f i c parameter t e s t e d by t a r g e t t r a n s formation i n t h i s example i s the molar r e f r a c t i o n , a v e c t o r which has g e n e r a l l y t e s t e d w e l l as a s o l u t e c o f a c t o r . As shown i n Table I, agreement between the t e s t v e c t o r and the p r e d i c t e d v e c t o r i s o v e r a l l moderately good a t b e s t . Values f o r three d e l i b e r a t e l y f r e e - f l o a t e d p o i n t s are p r e d i c t e d reasonably w e l l . The molar r e f r a c t i o n may be a c o f a c t o r ; such b o r d e r l i n e conc l u s i o n s are common i n TTFA. The second example i s s e l e c t e d t o i l l u s t r a t e the manner i n which s t r u c t u r a l v e c t o r s based on chemical i n s i g h t can be employed t o t r a c k down c o f a c t o r s . Such v e c t o r s are e s p e c i a l l y u s e f u l f o r developing e m p i r i c a l s o l u t i o n s . The example i n v o l v e s bond d i s s o c i a t i o n energies f o r r a d i c a l - r a d i c a l bonds (15), a
In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977. 101.0 98.6 138.2 133.8 39.0 110.3 105.9 132.0 125.4
(79.7) 96.6 107.7 126.3 37.7 134.7 113.6 126.3 139.3
bis(2-ethoxyethyl) phthalate dibutyltetrachloro phthalate di-2-ethylhexyl adipate di-2-ethylhexyl sebacate diglycerol diisodecyl phthalate d i o c t y l phthalate dioctyl' sebacate Flexol 8N8 Hallocomid Ml8 Hyprose SP80 isooctyldecyl adipate Quadrol sucrose acetate isobutyrate sucrose octaacetate ΤΜΡ tripelargonate t r i c r e s y l phosphate Zonyl Ε7
Solvent
methyl ethyl isopropyl t-butyl phenyl benzyl
Radical
0.00 .00 .00 .00 .00 .00
Test Vector
Test vector:
Data Matrix:
(140.8) 203.7 119.6 76.3 192.4 141.3 160.5 103.0 (134.8)
Test Vector
129.7 163.0 134.1 117.9 176.8 166.2 154.5 84.5 117.6
Predicted Vector
0.02 .04 .06 .06 .08 .04
Predicted Vector
fluorine chlorine bromine hydroxy methoxy amine
Radical
1.00 1.00 1.00 0.00 .00 .00
Test Vector
0.97 .34 .17 .27 .28 .47
Predicted Vector
bond dissociation energies involving 12 radicals with the same 12 r a d i c a l s , data taken from compilation of A. Zavitsas (17). halogen uniqueness, TT i n 3-factor space.
Table II - Target transformation of a structural test vector.
Predicted Vector
Test Vector
retention indices for 39 carbonyl-containing solutes on 18 monomeric stationary-phase solvents, data taken from reference 16. molar refraction estimated by summing s p e c i f i c refractions, ΤΤ i n 6-factor space, free-floated points shown i n parentheses.
Solvent
Test vector:
Data matrix:
Table I - Target transformation of a physical test vector.
CHEMOMETRICS: THEORY AND APPLICATION
78
most b a s i c type o f chemical data. The t e s t vector shown i n Table I I i s designed t o a s c e r t a i n i f the group o f halogenr a d i c a l s (assigned t e s t values of "1") i s r e s p o n s i b l e f o r a unique c o f a c t o r , i . e . , a c o f a c t o r not e x h i b i t e d by the remaining group o f r a d i c a l s (given t e s t values o f "0"). Such c o f a c t o r s can be t e s t e d without knowing the t h e o r e t i c a l form o f the i n t e r a c t i o n term. As can be seen i n Table I I , the t a r g e t transform a t i o n i s c l e a r l y unsuccessful; a unique halogen i n t e r a c t i o n i s not a c o f a c t o r . Ramifications o f TTFA Combination. Sets o f vectors can be u t i l i z e d simultan eously i n the combination step to f i n d the best e m p i r i c a l s o l u t i o n . T h i s step of FA i s s i m i l a r to m u l t i p l e r e g r e s s i o n a n a l y s i s i n t h a t complete models are u t i l i z e d , but d i f f e r e n t from r e g r e s s i o n a n a l y s i model be f o r c e f i t t e d . model, the s e n s i t i v e combination step w i l l l e a d to a poor r e production. Even so, s o l u t i o n s having e r r o r s l e s s than twice experimental e r r o r have been developed from thorough f a c t o r analyses of r e t e n t i o n i n d i c e s using the e m p i r i c a l approach. For example, i n the problem r e f e r r e d t o i n the f i r s t TT example above, the best s o l u t i o n t o the solvent p a r t o f the complex problem gave an e r r o r o f 7.1 r . i . u n i t s ; the e m p i r i c a l model can p r e d i c t r . i . 's with an e r r o r o f about one percent. P r e d i c t i o n . The p r e d i c t i v e a b i l i t y o f TTFA has as yet r e c e i v e d l i t t l e a t t e n t i o n . To i l l u s t r a t e the p o t e n t i a l o f t h i s step, consider the p r e d i c t i o n o f a new row o f data based on the best e m p i r i c a l s o l u t i o n obtained i n the combination step. To c a l c u l a t e a new data p o i n t a s s o c i a t e d with an added row designee, x, and a column designee from the o r i g i n a l data matrix, j , a modified form of equation (1) i s employed: η =
Σ m=l
r r
e
a
l ,
c
xm calc,mj
( 3 )
The row-designee c o f a c t o r s i n equation (3), r ^ are those key vectors from the best s o l u t i o n v i a combination, while the columndesignee c o f a c t o r s , c ^ , are c o e f f i c i e n t s i n the [ C ] matrix (which i s r e a d i l y c a l c u l a t e d u s i n g equation (2), given a s o l u t i o n [ R ^ - L I ] and the data m a t r i x ) . To c a l c u l a t e the new datum, only the values o f the η r e a l c o f a c t o r s f o r the new de signee and the η c o e f f i c i e n t s i n the j t h row of the c a l c u l a t e d column matrix are r e q u i r e d . For example, i n the study o f the c o f a c t o r s o f ethers (11), a r e a l s o l u t i o n having the f o l l o w i n g s i x v e c t o r s : carbon number, t o t a l atom number, chain d i f f e r e n c e , r e a
c a
c
In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
4. HOWERY
Target-Transformation Factor Analysis
79
chain r a t i o , t o t a l atom number squared, and b o i l i n g p o i n t squared, produced an e r r o r o f only 5.4 r . i . u n i t s . The r e t e n t i o n index o f t-butylmethyl ether (a s o l u t e n o t incorporated i n the o r i g i n a l matrix) on b u t y l t e t r a c h l o r o phthalate (a s o l v e n t i n the o r i g i n a l matrix) i s p r e d i c t e d using equation (3) t o be 613 r . i . u n i t s , n i c e l y i n agreement with the experimental value o f 609 ± 3 u n i t s . The mean e r r o r f o r the e n t i r e new row i n v o l v i n g the new ether with t h e 25 o r i g i n a l s o l v e n t s i s 3.5 units. Such s a t i s f a c t o r y p r e d i c t i o n s i n d i c a t e t h a t the empir i c a l s o l u t i o n q u i t e adequately spans a l l o f the s o l u t e c o f a c t o r s . The examples p u t f o r t h during t h i s p r e s e n t a t i o n demon s t r a t e c l e a r l y t h a t the TTFA approach can be u t i l i z e d t o thoroughly c h a r a c t e r i z e a chemical data space. Target t r a n s formation methodology seems d e s t i n e d t o p l a y a l e a d i n g and unique r o l e i n the chemometric r e v o l u t i o n . Literature Cited
1. Malinowski, E. R., Doctoral Dissertation, Stevens Inst. Technology, Hoboken, N. J., 1961. 2. Weiner, P. Η., Malinowski, E. R., and Levinstone, Α., J. Phys. Chem., (1970), 74, 4537. 3. Bulmer, J. T., and Shurvel, H. F., J. Phys. Chem., (1973), 77, 256. 4. Rozett, R. W., and Petersen, Ε. Μ., Anal. Chem., (1976), 48, 817. 5. Howery, D. G., Amer. Lab., (1976), 8(2), 14. 6. Malinowski, E. R., in "Chemometrics: Theory and Applications," B. R. Kowalski, Ed., A. C. S. Symposium Series, P. xxx, 1977. 7. Weiner, P. H., Chem. Tech., in press. 8. Weiner, P. Η., and Malinowski, E. R., J. Phys. Chem., (1971), 75, 3160. 9. Weiner, P. H., Liao, H. L., and Karger, B. L., Anal. Chem., (1974), 46, 2182. 10. Weiner, P. H. , J. Amer. Chem.Soc.,(1973),95,5845. 11. Selzer, R. Β., and Howery, D. G., J. Chromatogr., (1975), 115, 139. 12. Howery, D. G., Bull. Chem. Soc. Japan, (1972), 45, 2643. 13. Malinowski, E. R., Howery, D. G., Weiner, P. Η., Soroka, J. M., Funke, R. T., Selzer, R. Β., and Levinstone, Α., "FACTANAL - Target-Transformation Factor Analysis,." Program 320, Quant. Chem. Prog. Exch., Indiana Univ., Bloomington, Ind., 1976. 14. Soroka, J. Μ., and Howery, D. G., to be submitted. 15. Howery, D. G., to be submitted. 16. McReynolds, W. Ο., "Gas Chromatographic Retention Data," Preston Tech. Abstracts Co., Niles, 111., 1966. 17. Zavitsas, Α., Long Island Univ., Brooklyn, Ν. Υ., private communication.
In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
5 Application of Factor Analysis to the Study of Rain Chemistry in the Puget Sound Region ERIC J. KNUDSON, DAVID L. DUEWER, and GARY D. CHRISTIAN Department of Chemistry, University of Washington, Seattle, WA 98195 TIMOTHY V. LARSON Department of Atmospheric Sciences, University of Washington, Seattle, WA 98195
As part of a study on "acid rain" at the University of Wash ington, we have investigated the applications of factor analysis to understanding the variations in chemical compositions of rain samples gathered during single storms in the Puget Sound region. The determination of the sources of such variation, and the ex tent of their influence, can aid not only in understanding air chemistry in general, but also in making future judgements con cerning man's activities which affect atmospheric chemistry. The collection of rain samples has several points to recom mend it as an ideal means for sampling the chemical state of the atmosphere. It is relatively inexpensive and simple to do. Much of the aerosol content of air, as well as many gaseous constitu ents , are apparently removed by precipitation (1). Rain consists primarily of distilled water in which a relatively uniform back ground spectrum of chemical constituents (mainly sea salts) is dissolved. And local variations in the chemical composition of rain samples are primarily a function of local variations in atmospheric chemistry due to local sources, as mediated by the rates of rainout and washout and variations in rain volumn. The data used i n t h i s paper a r e from analyses o f r a i n samp l e s gathered ^pver a 24 hour p e r i o d during a s i n g l e storm i n 1975. The f a c t o r a n a l y s i s techniques used a r e g e n e r a l l y widely known and are a v a i l a b l e , a t l e a s t i n p a r t , i n a number o f computer data a n a l y s i s systems. The system used i n t h i s study, ARTHUR, i s a v a i l a b l e from the Laboratory f o r Chemometrics a t the U n i v e r s i t y of Washington (2). Experimental Rain samples were c o l l e c t e d i n washed polyethylene buckets p l a c e d a t p r e - s e l e c t e d s i t e s i n the Puget Sound r e g i o n (see map, F i g u r e 1). A f t e r a sampling p e r i o d o f 24 hours during a s i n g l e storm, the samples were returned t o the l a b o r a t o r y and t r a n s 80 In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
5.
KNUDSON ET AL.
Rain Chemistry
81
f e r r e d t o washed polyethylene b o t t l e s , from which a l i q u o t s were taken f o r a n a l y s i s . Species analyzed f o r were sodium., potassium, calcium, magnesium, z i n c , copper, cadmium, manganese, l e a d , a r s e n i c , and antimony, as w e l l as hydrogen, ammonium, n i t r a t e , s u l f a t e , and c h l o r i d e i o n s . D e t a i l s o f a n a l y t i c a l procedures used (atomic a b s o r p t i o n spectrometry f o r the metals and m e t a l l o i d s , and other methods f o r the other ions) a r e d i s c u s s e d elsewhere (3_r4). F a c t o r a n a l y s i s techniques used were the f o l l o w i n g : d e r i v a t i o n o f the c o r r e l a t i o n matrix and a h e i r a r c h i c a l dendrogram based on i t , eigenvector and varimax-rotated v e c t o r r e p r e s e n t a t i o n s , and the d e r i v a t i o n o f " f e a t u r e s " from the eigenvector and varimax-vector r e p r e s e n t a t i o n s v i a the Karhunen-Loeve transform (and an analogous transform u t i l i z i n g the varimax-rotated v e c t o r s ) . A n a l y t i c a l e r r o r p e r t u r b a t i o n s on the o r i g i n a l data s e t were used to evaluate the s t a b i l i t y o f the eigenvector and varimax-rotated v e c t o r r e p r e s e n t a t i o n s . C r e a t i o n o f contour maps o f s i n g l e spec i e s and o f m u l t i v a r i a b l p l i s h e d through the use developed a t the Laboratory f o r Computer Graphics o f the Harvard u n i v e r s i t y Graduate School o f Design. Results and D i s c u s s i o n S i n g l e Species Maps. A summary o f the c o n c e n t r a t i o n s o f the s p e c i e s found i n the r a i n samples i s given i n Table I , and r e p r e s e n t a t i v e contour maps o f s e v e r a l o f these s p e c i e s are given i n F i g u r e s 2 through 10, p l o t t e d over the geographic sampling area. The predominant movement o f t h i s storm, as t h a t o f most storms through the Puget Sound r e g i o n , was from g e n e r a l l y southwest t o northeast. Hence, higher c o n c e n t r a t i o n s o f a s p e c i e s i n a given area i s l i k e l y t o be i n d i c a t i v e o f a source f o r t h a t s p e c i e s l y i n g to the southwest o f the e l e v a t e d c o n c e n t r a t i o n s . I t can be seen, f o r example, t h a t there appears t o be a source o f both a r s e n i c and antimony (Figures 2 and 3) i n o r near Tacoma. In t h i s case, the source, w e l l known t o r e s i d e n t s o f the region, i s a l a r g e copper smelter l o c a t e d i n Tacoma. The e f f e c t s o f t h i s source are a l s o apparent i n copper (Figure 4) and, t o a c e r t a i n extent, cadmium (Figure 5) maps. Hydrogen i o n concentrations (Figure 6) are e l e v a t e d downwind o f both S e a t t l e and Tacoma, a not unexpected phenomenon, but s u l f a t e i o n concentrations (Figure 7) seem i n d i c a t i v e , o f no p a r t i c u l a r source. T h i s i s somewhat contrary t o observations on two e a r l i e r storms sampled (and o f s e v e r a l o t h e r r e p o r t e d s t u d i e s on " a c i d r a i n " ) , where hydrogen and s u l f a t e i o n c o n c e n t r a t i o n s were found t o be s t r o n g l y c o r r e l a t e d . In t h i s p a r t i c u l a r storm, however, i t i s n i t r a t e i o n c o n c e n t r a t i o n s (Figure 8) which appear t o be a s s o c i a t e d with i n c r e a s e d hydrogen i o n c o n c e n t r a t i o n s . The contour map f o r z i n c c o n c e n t r a t i o n s (Figure 9) shows no i d e n t i f i a b l e anthropogenic i n f l u e n c e , while the contour map f o r l e a d (Figure 10) i l l u s t r a t e s the problem encountered when the
In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
82
CHEMOMETRICS: THEORY AND APPLICATION
"plume" lengths are c o n s i d e r a b l y s h o r t e r than the sampling d i s tances. Lead concentrations are undoubtedly a s s o c i a t e d with t r a f f i c p a t t e r n s , and the map i s dominated by "hot spots," probably i n d i c a t i v e o f how c l o s e the sampling bucket was p l a c e d to areas o f heavy t r a f f i c . C o r r e l a t i o n Matrix and H e i r a r c h i c a l Dendrogram. The c o r r e l a t i o n matrix i s r e a d i l y d e r i v e d from the normalized data matrix, D , by m u l t i p l y i n g the transpose o f t h i s matrix by the matrix Dn n
The c o r r e l a t i o n matrix f o r the a n a l y t i c a l data presented i n Table I i s given i n F i g u r e 11: the upper h a l f i s the complete c o r r e l a t i o n matrix, and the lower h a l f c o n s i s t s o f o n l y those c o r r e l a t i o n s g r e a t e r than o r equal t o 0.7. Examination o f t h i highly-correlated species c o r r e l a t e d s p e c i e s contains sodium, potassium, calcium, magnesium, and c h l o r i d e i o n . Another prominent group contains a r s e n i c , a n t i mony, cadmium, and copper. And f i n a l l y , there i s a small group c o n s i s t i n g o f hydrogen and n i t r a t e i o n s . Another s t r i k i n g feature i s the n o n - c o r r e l a t i o n o f hydrogen and s u l f a t e i o n c o n c e n t r a t i o n s , as mentioned e a r l i e r . A s i m p l i f i e d way o f l o o k i n g a t the c o r r e l a t i o n matrix i s a h e i r a r c h i c a l " s i m i l a r i t y " dendrogram, ttie technique o f d e r i v i n g t h i s r e p r e s e n t a t i o n i n v o l v e s c o n v e r t i n g the c o r r e l a t i o n s t o " d i s tance" r e p r e s e n t a t i o n s by s u b t r a c t i n g the absolute value o f the c o r r e l a t i o n c o e f f i c i e n t s from 1.0, then s u b j e c t i n g the r e s u l t i n g matrix t o h i e r a r c h i c a l Q-mode c l u s t e r i n g , as d e s c r i b e d by Kowalski and Bender (5). The r e s u l t i n g dendrogram i s given i n Figure 12. In t h i s dendrogram (as suggested by the c o r r e l a t i o n m a t r i x ) , three main groups predominate. The l a r g e s t group c o n s i s t s o f sodium, magnesium, c h l o r i d e , calcium, potassium, and ammonium i o n : these species are probably a s s o c i a t e d with a s e a - s a l t background. The second group c o n s i s t s o f those species presumably emitted by the copper smelter mentioned e a r l i e r : copper, a r s e n i c , antimony, and cadmium. And the t h i r d grouping c o n s i s t s o f hydrogen and n i t r a t e i o n s . There remain only single-element "groups," showing l i t t l e r e l a t i o n s h i p t o the other s p e c i e s : z i n c , manganese, l e a d , and s u l f a t e . The s i n g u l a r i t y o f l e a d i s r e l a t i v e l y easy t o exp l a i n on the b a s i s o f many small p o i n t sources and short "plumes" mentioned e a r l i e r . Zinc may be s p l i t between a s s o c i a t i o n with smelter-emitted elements and with sea s a l t s , r e s u l t i n g i n r e l a t i v e l y low c o r r e l a t i o n s t o e i t h e r group. The s i n g u l a r i t y o f manganese and, e s p e c i a l l y , s u l f a t e are d i f f i c u l t t o e x p l a i n . In the case o f s u l f a t e , i t may be p o s s i b l e t h a t reduced emissions o f s u l f u r d i o x i d e from the smelter (from l e v e l s noted i n e a r l i e r sampl i n g s , where there were much higher c o r r e l a t i o n s between hydrogen
In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
β
f.H«t • Cl /o- ( ilj c**li i»r 20 1 υ Us.- 2«v ?2 5 *0*»G 3*0 2 -i20 ~tb -2C
N
S
)j- * *
κ
~-
1
II
t :
1*0
l
V; J i ,„ 200 *:r υ i ' J iov
( ·**>
•·» cJ U' ν · ·» 3 2/ 17( w. S 1 3.» 2 1 0 0.2> 0.*f ?*o 0.^.7 0. # 13 <"./0 />*.) 1* l 300 l<.0 υ» i t> î.o % i2 6 3cG0 ol/O 730 2 2 0 1· C. Î3 i ·ν> * 6/ ι 3oo 170 i 2 > l»ν l3 t. iO Ut •>.*«• 21 -0 77: lis*, 0. iu * · c υ. 1 · 2 i A 3.; * iij 11 •""·. i Λ tu ( lj ?*; 16υ0 1 3 * 2i> ^50 *. V 0. 1;« It .(· ).ï0 60 ι·; Γ 50υ O : ';t>0 17t ',UC 1 ÎO l JO VJ 7 C. 1·» 1 ·( J. Γ) 1<»ϋ0 t>* 1270 1. C 1 300 1*0 210 12 J 1 ι 0· i*> u c j.2o 32 *. c * i.» J llO * ν lHjf. » c, « «r 0 . 2 V υ. 7 3 21 *3t> *» 30 yJ 116 1 *30 1 »0 2 JV l 1J 12 C. 2v *.o 3.7 3 > 3 2nw0 lt>6 2 H2 0 63C 2600 ? * c 100 ι· V 22 G.*2 * 7 2.f. 2*0 vo 2 * 0 * 5 G * 0 1 . t 0 ,0b ^.c J.*6 t^Uv i l * {•i 'J0 ilOo ?Vv PiO t>; 72 w .ι·** j.io 110 ivo *300 30 26*C 1121 2*00 >oo ^70 s 0.17 J.*0 31 *·* C.3n 2 700 2*0 21'G *oî> a o c 320 2^0 2t>u 17 V.C J.3b 60 2200 il* •>wC 9C t; 110 l?0 12> 11 0 .0 7 3. t 0. 0·> 2* 120 305 1Π3 2*0 14 0 1 'V * < ι .36 ^ · V j · J i il I 13J ? 1 2200 106 1360 Î>CG 1700 90 2 30 1/0 6 0.09 7.C 0.30 Ti ** 33 3>00 60 ί<·0 *1C 3fC 2· îr 0.C2 3.12 2
3/00 3ουυ 2· 00
'•'..*
0
Values given as ppb of species listed. Volumes in milliliters per bucket. Refer to Figure 1 for station locations.
4
ν M.. ί>7 * · 3 ν •>i *.7', 2 s r . T t ; ·ΐ.*τ } H ?JNT * . o i 3 JA• * »-2 -1.7; 10* *.6C •Ο *.01, 6 ZtMÎ" 7 t νΛΌ PT. ί>2 *. 20 t 3-Tftt: *T. /•J 3.w-, «· 2 7C '..*v 10 l.MCff. •»< 2*0 *.6'y il à l < i fJHT 1>ο * . * o 12 VtH, C l « . *··»: *.*', 3C 13 5 V 14 fcJ1i>">uK PT··· o c li> CtoAh Ml. 277 •+.<*i 16 YtLL.'- K . 29d *.i)0 17 KcNil«<> M 1*2 *.lt 1*2 «.,30 16 OlVAU 19 ctcAPva 4 US *.OÎ> 20 Ν(1^Η A 6C 1*2 * . 3 t > 132 *.3l> 21 JUAN1TA 230 *.v0 22
iUTiJN 1 J f'P
Table I. Raw Data
84
CHEMOMETRICS: THEORY AND APPLICATION
and s u l f a t e ions) tended t o g i v e g r e a t e r importance to other sources o f s u l f a t e c o n c e n t r a t i o n s i n the sampling area. Eigenanalysis. D i a g o n a l i z a t i o n o f the c o r r e l a t i o n matrix, C., r e s u l t s i n an orthonormal s e t o f e i g n e v e c t o r s , E j , with t h e i r a s s o c i a t e d eigenvalues, j (where j = 1 t o n, the dimension o f the c o r r e l a t i o n m a t r i x ) . Rearrangement o f these e i g e n v e c t o r s i n order of decreasing magnitude o f t h e i r eigenvalues r e f l e c t s the o r d e r ing o f the amount o f v a r i a n c e i n the o r i g i n a l data s e t spanned by the i n d i v i d u a l e i g e n v e c t o r s : the q u o t i e n t o f the eigenvalue ( f o r a given eigenvector) d i v i d e d by the sum o f a l l the eigenvalues r e f l e c t s the r e l a t i v e amount o f " i n f o r m a t i o n " about the data mat r i x (variance) contained i n t h a t e i g e n v e c t o r . The r e s u l t s o f d i a g o n a l i z a t i o n and rearrangement o f the 1975 data matrix are given i n Table I I and are summarized g r a p h i c a l l y i n F i g u r e 13. These histograms are drawn as f o l l o w s : each h i s t o gram represents one eigenvecto r a t i o of i t s associate eigenvector. Within each eigenvector histogram, the "percent i n formation" represents the c o e f f i c i e n t ("loading") o f each s p e c i e s ' c o n c e n t r a t i o n d i v i d e d by the sum o f the c o e f f i c i e n t s i n t h a t v e c t o r (times 100). P r i o r to diagonalization, concentrations of a l l s p e c i e s are normalized to a mean o f zero and a standard d e v i a t i o n o f 1.0, and the c o e f f i c i e n t s i n the v e c t o r s (and histograms) are c o e f f i c i e n t s o f the normalized c o n c e n t r a t i o n s . The f i r s t eigenvector c o n s i s t s o f s i g n i f i c a n t c o n t r i b u t i o n s from a l l determined s p e c i e s , ranging from 12.3% i n f o r m a t i o n from sodium t o 2.6% information from n i t r a t e . While the l a r g e s t cont r i b u t i o n s are from elements t y p i c a l l y a s s o c i a t e d with "sea s a l t s , " c o n t r i b u t i o n s from other s p e c i e s are by no means minor. In f a c t , t h i s p a r t i c u l a r v e c t o r i s most h i g h l y c o r r e l a t e d t o the t o t a l c a t i o n i c charge c o n c e n t r a t i o n . T h i s p a r t i c u l a r v e c t o r spans 42.3% of the v a r i a n c e i n the data s e t (see Table I I ) , i n d i c a t i n g t h a t the l a r g e s t p r i n c i p a l component o f t h i s data s e t c o n s i s t s o f t o t a l dissolved ion concentrations. The second eigenvector, spanning 15.2% o f the v a r i a n c e , cons i s t s mainly o f elements probably emitted from the smelter i n Tacoma: a r s e n i c , copper, antimony, and cadmium. T h i s v e c t o r a l s o contains s i g n i f i c a n t c o n t r i b u t i o n s from other s p e c i e s , probably a r e s u l t o f the c o n s t r a i n t s t h a t a) the second e i g e n v e c t o r i s p e r p e n d i c u l a r t o the f i r s t , and b) the second eigenvector c o n t a i n s the maximum remaining v a r i a n c e . The t h i r d eigenvector spans 11.6% of the v a r i a n c e and i s predominantly n i t r a t e and hydrogen i o n conc e n t r a t i o n s , with, again, some c o n t r i b u t i o n s from other s p e c i e s . That these three eigenvectors are the l a r g e s t p r i n c i p a l components can be more o r l e s s p r e d i c t e d by the three d i s t i n c t groupings o f i n t e r c o r r e l a t e d s p e c i e s i n the c o r r e l a t i o n matrix, f o r which i n t e r p r e t a t i o n s have a l r e a d y been g i v e n . Together, these three eigenvectors account f o r almost 70% o f the v a r i a n c e i n the data. 5
,
6
In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
2.2
1.4
1.0
.5
9 • 35650
10 .22C10
11 •16440
•0Β173
>h
.2 0o .10b
-.0H2
• 4t>6-.
• 240 • 156
-.133
.054
-.1 -•u **
• 1 9o
.353
• ICC • 131 • 0 to-.263
«c4o -·υο2 • 145-•40C -•030 • 004 -.001 -.07^
• t -.US -· 263
.05*
.131 -.345 -.017
• 033 .373
• 109 — .14 J • 023
• 014-.053
• PJ9 • 136 -.172 -.047 -.232
.173 -.402 -.103
" Eigenvectors are arranged vertically by relative magnitudes of their eigenvalues. Columns are loadings on species listed at the top. "VAR." refers to percent of total variance in that eigenvector (eigenvalue divided by sum of the eigenvalues).
15 •00609
V9 — . ι 2
• 066-.116 -.504 -.067 -•15V .213 -.075 -•304
• 146-.i»i7
• 162 -.129 .143
NU3-
.163 • 034
• 41* -•14
-.22C
Ci-
-.J2 3 -.277 -•622 -.375 -•061 • 071 -.074 • 360 -•011
.174 • 033 - . 652
• 113 .047
.3*4 -.0t>2 -.364
.570 -.1*3
. 192
146 • 39 6-.117
2*6
• 003 .253
• 1?V • 064 - . loi
• 142 -.263
.362
JÎ
• 204 • 2b7 • 321 • 161
4>
• 164-.255 -.5Cc -•34»
-.205 -. οίο • 0
• OcO • 36 b .212 -.133 -.05 3 .176
.464 -.375
C:>
• 1 -•006 • 004 -•246 -•060 • 135 .246 -•195 • 671 • 0*4 -•157 • u32 -.11* -.533 16 • 00249 0.0 • J7C • £36 .676 .0b3 -· 320 -·2β3 -.002 • 049 • OiO -•006 -.028 • 001 - . 266
14 •02757
13 •03505
il
, .'4
• 23c .225 • 162 .205 .23*
• i>*'3-.473
.23'
(0
• 130 • 132 -•153 .1*5 -.?77 -·ύ^ ν • i J i • vdi -.263 • 261 -.035 - . 1 0 2 . 347 • 150 -.l'*3 -.Ol* - . 130 - . 3?s -.40H .407
.?>7
7>.
• lOi .27* -·<13 • 150-.4<,3 .393 -.22o -.231 -.312 .030 -.014 • 006 • 10H • 2ol -.204 • 447 -.193 -.264 .145 -.397 • 2 ••013 • 160 -.250 .034 -.192 .541 • 145 -•366 -.064 • 004 -.160 • 116 -•Old «464 -.402 .075 • 2 -•145• • 129 .194 -.196 • 3ab • 204 -.106 • 095 • 023 • ii>7 -.230 .235 .253 -.476 -•500 • 114
2·<ί
ο .4474ο
7 .466d0
4·2 2.9
*·2 7.2
6 •6 7660
5 1. 15000
4 1.3θ9υ0
3 1.65 900 11.6
.25
15.2
2 2·42600
*f
• 31C . 311
<: ι
3 . 09 2 .165 .1V7
• 1*·ο .3 51 .25*·
-.U2
42.3
1 6.76900
<
Ni*
c IGCNV/ALUL
Table II. Eigenvectors *
86
CHEMOMETRICS: THEORY AND APPLICATION
The remaining eigenvectors probably do not represent anyt h i n g meaningful i n terms o f e x p l a i n i n g the i n f l u e n c e s on (or f a c t o r s in) the observed data. Most o f them are most l i k e l y due to sampling and/or a n a l y t i c a l v a r i a t i o n s . Varimax Rotation. A method f o r " c l e a n i n g up" the eignevalueeigenvector r e p r e s e n t a t i o n o f the data i s the varimax r o t a t i o n . T h i s procedure maximizes the variance i n each v e c t o r , the e f f e c t of which i s t o decrease the number o f v a r i a b l e s with intermediate loadings and t o i n c r e a s e the number o f those with l a r g e and small loadings i n each vector.^'® T h i s r e s u l t s i n " c l e a n e r " v e c t o r s c o n t a i n i n g most o f t h e i r information i n o n l y a few v a r i a b l e s . The varimax-rotated v e c t o r s (which w i l l be h e r e i n a f t e r c a l l e d " v a r i v e c t o r s " f o r want o f a b e t t e r term) are given i n Table I I I and histograms are given i n F i g u r e 14. I t can be seen t h a t the v a r i v e c t o r s are indeed " c l e a n e r . " I t can a l s o be seen t h a p l a c e , i n terms o f not onl ned per v e c t o r , but a l s o the o r d e r i n g o f v e c t o r s ( i n terms o f t h e i r " v a r i v a l u e s , " a term which w i l l be used t o represent the l e n g t h o f a u n i t v e c t o r on the v a r i v e c t o r , analogous t o the term "eignevalue"). The magnitude o f the f i r s t two v a r i v e c t o r s (seas a l t s and smelter elements), i s approximately equal, i n c o n t r a s t to the e i g n e v e c t o r s , where the f i r s t eigenvector spans a p p r o x i mately three times the v a r i a n c e o f the second. But, as mentioned e a r l i e r , the f i r s t eigenvector i n c l u d e s c o n t r i b u t i o n s from a l l s p e c i e s and i s t h a t v e c t o r spanning the maximum amount o f variance over the e n t i r e data s e t , while the corresponding second v a r i v e c t o r i n c l u d e s l a r g e c o n t r i b u t i o n s from three s p e c i e s and small c o n t r i b u t i o n s from nine s p e c i e s (and i t i s not c o n s t r a i n e d t o i n clude the maximum variance over the data s e t ) . S i m i l a r l y , the v a r i v e c t o r spanning the g r e a t e s t v a r i a n c e cont a i n s major c o n t r i b u t i o n s from only four s p e c i e s : a r s e n i c , copper, antimony, and cadmium. And the t h i r d v a r i v e c t o r c o n s i s t s almost e x c l u s i v e l y o f n i t r i c a c i d . Almost a l l o f the remaining v a r i v e c t o r s c o n s i s t o f s i n g l e v a r i a b l e l o a d i n g s , l e a d i n g t o the assumpt i o n t h a t the data s e t c o n s i s t s o f three major p r i n c i p a l components and s e v e r a l s i n g l e - s p e c i e s f a c t o r s . F i g u r e 15 i l l u s t r a t e s t h i s p o i n t : the f i r s t f i f t e e n eigenvectors are given i n order (columns), and the v a r i v e c t o r s are rearranged t o correspond t o the b e s t f i t with the eigenvector o r d e r . In t h i s f i g u r e , the area of each box corresponds t o the square o f the l o a d i n g ( c o e f f i c i e n t ) for that variable i n that p a r t i c u l a r vector. A s e r i e s o f data were obtained by p e r t u r b i n g the o r i g i n a l data with assumed a n a l y t i c a l e r r o r s . These assumed e r r o r s are l i s t e d i n Table IV, and are given as one standard d e v i a t i o n ( r e l a t i v e , per c e n t ) .
In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
6· 2 6· 2
.9*730
.«.6550
t
*t
· ·
Ί 4
• 637 - . 12*
.?U
-.IMI
. 161
• J«C • 2o9 .H4
-.006 -.011 -.090
-.011 • 01 1 • d45 • 00 2 • JOb-.478
•2
•1
0. 0
14 .03937
15 .01009
16 •00341
.002 .001
.004
• 009 .001
.007
.09 7 • 011 • 001 .057 -•027 -.226
• 012 -.634 -.002
.156 -•055 -.741 -.032
""Varivalue" = length of basis vector, analogous to "eigenvalue." "VAR."= percent of total variance contained in that varivector (see Table II).
.002 -.028
.006 -•006 • 011 • 003 .762 -.005
.117
•3
.Ocl
13 .04224
-.064
.002 -· J02 -.017 -.011
.221 -· bov -.123 -•021
• 37v • 116 -.122
-.0»7 .30 3 -.wlH - . Oi* -.012 -•024 .020 -•0J6 . 15 2-· oc * • yJl -.045 > -•174 -.004 -.001 .010 -.02 5 .619 -.004 .029 -•OoO -.003 -•010 • 036
• 125 -.046
co -•Ji5 • 05 V .117
· ό / 9 -· ci. 5 -•i.
.0-7
.045 -· 16 1-.163 -•094
• 774
1· * 9
12
.^3 >-
.113 -.0*5
.046 -•049 -.071 -.593
.000
.035* • 103 .142 19 -.132 -.1«9 -· 166
• ?vl -·0?υ -.w
• 0d2 . 355 -•06t> .035 .053
.00 7 -.03t
. 7*(;
.177
. 161
• 30 9
• 015 - . 160 -.04c -.059
.072 -.06* -•09W -.010 .06 3 -•071 -•933
.145
.19*
. )72
.216
ô -..24
• 004 .012
-.224 -.135 -.0^4 -.173 -•15( -.004 -.067 -· 120
C)
• 3:Î4 - .00 r .07*
• 021 • 04 3 • U70 • 0? 1 .073 . 774
.676
.no
.477
.124
• Oos • 04t .053
NJ3• u04
o; 54 5 -.47V -.119
AS
a
.072 -•01* -.02r - . l i t -· 100
• 229 .13*
. 067
-.~v2 — .2 • --·
.135 — ·04 S • ot/7 -.0 3 .027 .05 7 • 0ου .0/5
• 036 -•06 i -· Ai 0 - . >t * -.103 -.15? -.134
• J7«. .019
;.4-
.1 Ί - . «*>5
- .
. 2vA .105 - · i « C- . l « t
.011
.07 7 • oov
. 0 7C . 14C -.0"'6
-•Jl
- . ( 4« -.1?:
» .'4 * i\ Cl 3 - . ·< i 7- . t tu -.(il? -.323 .1 ο .tw
(>
7<
.51? • ne .0- ·»
. 241
Ζ*
-.012 -.145 -.135 -.024 - . 240 -.134 -•073 . J3 3 • J * 4 .021 • 2s7 .0*7
• 040 • 05c -./*7
. >J<
.4/y
< r. ι — . i ο j- . »Γ;
• 1** .174 • *6 5 .jwl .232
• ονο .237
Ή -·*73
Varimax-rotated Vectors ( rows and columns as defined for Table II )
•14040
11 .30770
1C .63030
3·
6· 6 6· 3
7 1.0A500
6 1.05400
5 1 . 13 c JO
*· /. 1
11. 4 ν;
3 Ι.υ 1700
4 1. «οίον
1*. 6
2 3.13000
VAh ivALCc A » • 1 Î.26 ιΟυ 2 0.4
Table III.
88
CHEMOMETRICS: THEORY AND APPLICATION
Table IV. Species +
H NH Na Κ Ca Mg Zn Cu
+ 4
Assumed A n a l y t i c a l E r r o r s Error
Species
Error
6.7 6.7 3.4 3.4 3.4 6.7 3.4 2.6
Pb S0 Mn Cd As Sb Cl" N0 -
2.6 20 2.6 2.6 4.0 4.7 10 6.7
4
3
=
Using these e r r o r v a l u e s , a random-number g e n e r a t i n g program was used t o generate eighteen new s e t s o f data from the o r i g i n a l data. From each s e t o f these e r r o r - p e r t u r b e d data, c o r r e l a t i o n matrices and e i g e n v e c t o r and v a r i v e c t o r r e p r e s e n t a t i o n s were de r i v e d . A f t e r matching th corresponding v e c t o r s i produced some mixing o f the eigenvalues and v a r i v a l u e s ) , means and standard d e v i a t i o n s were c a l c u l a t e d f o r the l o a d i n g s w i t h i n each v e c t o r . These data are a l s o p l o t t e d i n F i g u r e 15: the v e r t i c a l and h o r i z o n t a l bars i n s i d e the boxes represent the means p l u s and minus one standard d e v i a t i o n , r e s p e c t i v e l y , o f the loadings· What i s demonstrated i s t h a t the loadings f o r a t l e a s t the f i r s t three eigenvectors are more o r l e s s i n v a r i a n t t o the expected a n a l y t i c a l e r r o r s . A f t e r the f i r s t three e i g e n v e c t o r s , u n c e r t a i n t y i n the loadings becomes s i g n i f i c a n t . Hence, i n t e r p r e t a t i o n based on a n a l y s i s o f the eignevectors i s more o r l e s s l i m i t e d to the f i r s t three eigenvectors as f a c t o r s o r r e p r e s e n t a tions of factors. On the other hand, the i d e n t i t y o f the v a r i v e c t o r s remains e s s e n t i a l l y unchanged by the i n t r o d u c t i o n o f ana l y t i c a l e r r o r (even though some r e s h u f f l i n g o f v a r i v a l u e s d i d take p l a c e , n e c e s s i t a t i n g c o n s i d e r a b l e r e s o r t i n g t o match v e c t o r s with the o r i g i n a l s e t ) . The reason f o r t h i s i s t h a t the varimax r o t a t i o n maximizes the v a r i a n c e i n each v e c t o r i n t o as few v a r i ables as p o s s i b l e , and minimizes the l o a d i n g s on the r e s t o f the v a r i a b l e s . And s i n c e a l l o f the v a r i v e c t o r s a f t e r the f i r s t two c o n t a i n high l o a d i n g s f o r o n l y one o r two v a r i a b l e s , the e r r o r s i n the remaining v a r i a b l e s become small by comparision and c o n t r i bute l i t t l e t o v a r i a b i l i t y i n l o a d i n g i n the one o r two major v a r i a b l e s ( i n c o n t r a s t t o the s i t u a t i o n observed i n the s e t o f e i g e n v e c t o r s ) . Consider, f o r example, the f i f t h e i g e n v e c t o r and the f i f t h v a r i v e c t o r : f o r the non-error-perturbed data, 39% and 78%, r e s p e c t i v e l y , o f the v a r i a n c e i n these v e c t o r s i s s u l f a t e c o n c e n t r a t i o n . For the eighteen s e t s o f e r r o r - p e r t u r b e d data, the s u l f a t e l o a d i n g s i n the f i f t h e i g e n v e c t o r v a r i e s from 20% t o 53% o f the t o t a l l o a d i n g , w i t h a mean o f 34% and a standard d e v i a t i o n o f 13.6%. For the corresponding f i f t h v a r i v e c t o r , the s u l f a t e loadings range from 68% t o 89% o f the t o t a l l o a d i n g s , with a
In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
5.
KNUDSON ET AL.
Rain Chemistry
89
mean o f 79% and a standard d e v i a t i o n of 4.7%. Hence, the cons t r a i n t s on the eigenvector r e p r e s e n t a t i o n ( o r t h o g o n a l i t y , and maximum variance over the data s e t ) , and the c o n s t r a i n t s on the varimax r o t a t i o n (not n e c e s s a r i l y orthogonal, and maximizing the variance w i t h i n each vector) produce considerably d i f f e r e n t r e s u l t s when e r r o r p e r t u r b a t i o n s are introduced: the former r e s u l t s i n c o n s i d e r a b l e "noise" i n the loadings o f v e c t o r s having smaller eigenvalues, while the l a t t e r seems t o be r e l a t i v e l y i n s e n s i t i v e to a n a l y t i c a l e r r o r . Features. P r i n c i p a l component maps over the sampling area can be constructed v i a the Karhunen-Loève transform. This transform c o n s i s t s of m u l t i p l y i n g the normalized data matrix by the transpose of the eigenvector matrix. The r e s u l t i s a " f e a t u r e " matrix of dimensions m by n, where m i s the number o f s i t e s and η i s the number o f v a r i a b l e s determined. Put more simply, the f i r s t feature value f o r the f i r s normalized observed value ing loadings i n the f i r s t eigenvector. The second feature value f o r the f i r s t s i t e i s the sum o f the products o f the normalized observed values a t the f i r s t s i t e and the corresponding loadings i n the second eignevector. And so on. These f e a t u r e s can each then be mapped onto the sampling area to o b t a i n a r e p r e s e n t a t i o n of the goegraphic v a r i a t i o n o f each f e a t u r e , by using the same mapping program as p r e v i o u s l y d e s c r i b e d . A s i m i l a r s e t o f f e a tures which s h a l l be c a l l e d "varimax f e a t u r e s , " ( f o r l a c k o f any other known terminology) may be obtained by m u l t i p l y i n g the data matrix by the transpose o f the v a r i v e c t o r matrix. The r e s u l t s o f these operations, the Karhunen-Loève f e a t u r e s and the varimax f e a t u r e s , are given i n Tables V and VI. Maps o f the f i r s t three Karhunen-Loève f e a t u r e s and the f i r s t three varimax f e a t u r e s are given i n F i g u r e s 16 through 21. It can be seen t h a t there i s very l i t t l e q u a l i t a t i v e d i f f e r e n c e between the maps f o r the f i r s t Karhunen-Loève f e a t u r e and the second varimax f e a t u r e . T h i s i s not s u r p r i s i n g , c o n s i d e r i n g the c o r r e l a t i o n between the f i r s t eigenvector and the second v a r i v e c t o r (0.966). Both maps can be compared t o the volume map, F i g u r e 22, which shows roughly opposite behavior. The c o r r e l a t i o n s between volume and the Karhunen-Loève and varimax f e a t u r e s are -.52 and -.49, r e s p e c t i v e l y . T h i s r e s u l t f i t s w e l l with the observation by many i n v e s t i g a t o r s t h a t the concentrations o f most s p e c i e s are n e g a t i v e l y c o r r e l a t e d t o r a i n volume (see reference 1). A comparison o f the maps f o r the second Karhunen-Loève f e a ture and the f i r s t varimax f e a t u r e , however, shows some s t r i k i n g d i f f e r e n c e s . R e c a l l i n g t h a t the v e c t o r s which g i v e these f e a t u r e s represent p r i m a r i l y smelter elements, i t might a t f i r s t glance appear t h a t the second Karhunen-Loève feature i s a b e t t e r p i c t u r e of the smelter "plume." But two t h i n g s should be considered: f i r s t , the wind was not blowing i n the d i r e c t i o n i n d i c a t e d on the
In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
KL 1 K L 2
• KFn = Karhunen-Loève n,' where n signifies the number (see Table II) of the eigenvector from which that feature is derived.
4 5 6 7 θ 9 1C 11 12 13 1* 15 16 17 16 19 20 21 22
J
JTàTIl/N 1 2
Table V. Karhunen-Loève Features*
ι
s
In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
Varimax Features*
Varimax feature "n," where n signifies the varivector (see Table III) from which that feature is derived.
• ?*7
V"»J Jfl4 V*5 v«7 Va t.ην V11C «ni «/Ml? Vri3 V* 14 ν «115 «16 • 724 .υ o4 • 604-.56 3 -.05 1 .342 -·Η30 .74. -. 31t -.037 -.274 -.131 .02 ο • 021 • J V l • · U1 'i • O • 29o-.352 -•3 76 . 333 -.0·>3 -.057 . 1
Table VI.
92
CHEMOMETRICS:
THEORY AND APPLICATION
map; and second, the p l e a s i n g appearance i s aided by the f a c t t h a t the mapping program f i t s contour l i n e s i n such a manner t h a t contour l i n e s which are d i v e r g i n g a t the edge o f the data space continue t o diverge t o the edge o f the map space. Of major importance i s the f a c t t h a t the second eigenvector i s c o n s t r a i n e d to be orthogonal t o the f i r s t , whereas the v a r i v e c t o r s have no such c o n s t r a i n t . Hence, the map o f the f i r s t varimax feature i s s u r e l y a b e t t e r r e p r e s e n t a t i o n o f the smelter "plume" than the map o f the second Karhunen-Loève f e a t u r e . Further l e n d i n g c r e dence t o t h i s assumption i s a comparison o f these maps with the maps o f a r s e n i c , antimony, and copper, F i g u r e s 2, 3, and 4. A s i m i l a r s i t u a t i o n can be noted i n comparison o f the maps f o r the t h i r d Karhunen-Loève and varimax f e a t u r e s , both o f which c o n s i s t p r i m a r i l y o f n i t r i c a c i d . The map o f the t h i r d KarhunenLoève f e a t u r e i s somewhat "spotty" (due, perhaps i n p a r t , t o the o r t h o g o n a l i t y c o n s t r a i n t ) , whereas the map o f the t h i r d varimax f e a t u r e , while s i m i l a r c e n t r a t i o n o f both hydroge These areas o f i n c r e a s e d n i t r i c a c i d c o n c e n t r a t i o n occur downwind o f h e a v i l y urbanized and i n d u s t r i a l i z e d areas o f the Puget Sound region. Maps o f the remaining f e a t u r e s are not i n c l u d e d , s i n c e , as has been noted already, the v e c t o r s from which these f e a t u r e s are d e r i v e d c o n s i s t mainly o f s i n g l e elements, and are not r e a d i l y i d e n t i f i a b l e as r e a l f a c t o r s i n the data s e t . Summary and Conclusions. In summary, i t has been shown t h a t by a n a l y z i n g f o r a v a r i e t y o f species i n r a i n samples gathered over a s u f f i c i e n t l y l a r g e geographical area, a grasp o f some o f the i n f l u e n c e s on atmospheric chemistry, as r e f l e c t e d i n p r e c i p i tation, i s obtainable. A n a l y s i s o f the data appears t o show a t l e a s t three major sources o f d i s s o l v e d species i n rainwater i n the Puget Sound r e g i o n : a s e a - s a l t background, urban sources, and an i n d u s t r i a l source o f major s i g n i f i c a n c e . Mediating these sources are such v a r i a b l e s as t i m e / s p e e d / d i r e c t i o n wind p a t t e r n s and time/volume p r e c i p i t a t i o n p a t t e r n s , which i n t u r n are probab l y i n f l u e n c e d by geographic f e a t u r e s and by the aforementioned sources themselves (e.g., i n p u t s o f heat, p a r t i c u l a t e s , and other chemical e n t i t i e s i n t o the atmosphere). Varimax r o t a t i o n o f the eignevector matrix d e r i v e d from the c o r r e l a t i o n matrix has been demonstrated t o more a c c u r a t e l y represent the major f a c t o r s i n the chemical composition o f the r a i n samples s t u d i e d . F u r t h e r , these s t u d i e s have r e v e a l e d the v a r i max-rotated v e c t o r r e p r e s e n t a t i o n t o be c o n s i d e r a b l y more s t a b l e to the e f f e c t s o f a n a l y t i c a l e r r o r s , than the eignevector r e p r e sentation. T h i s appears t o have been the f i r s t a p p l i c a t i o n o f both performing a r o t a t i o n analogous t o the Karhunen-Loève transform by u s i n g the v a r i v e c t o r matrix, and o f mapping the r e s u l t i n g f e a t u r e s over a geographical data space. The a p p l i c a t i o n o f these t e c h -
In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
5.
KNUDSON ET AL.
Rain Chemistry
93
niques t o s i m i l a r problems may reasonably be expected t o be a u s e f u l t o o l i n understanding some o f the complex i n t e r a c t i o n s encountered i n s i m i l a r (e.g., environmental) research i n the future. While the number o f data gathered i n t h i s study were l i m i t e d , they were nonetheless s u f f i c i e n t t o demonstrate the u t i l i t y o f the approach f o r studying sources and i n f l u e n c e s on atmospheric chemical composition.
Figure 1. Sampling sites. Φ -= Samping site.
In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
CHEMOMETRICS: THEORY AND APPLICATION
94
2Λ2227722 33133333333333333 2222222222222222222221 J2222722? 33333333333333333 2222?2222222222222222i UlKlcZr 33333333333333333 2222222222222272222221 2227272 3333333333333333333 2'ccc2ccccec222cZ22e?Ï 1111X11111111111llllllllxl 3333333333333333333 72222222222777722222* " llilliUllfimxliIlllflTTAx -1222211 l l l l l l \ l i m j r t l l | l l l l l 72222222 333333333333 3 333333 22222222222227.77 2222 i 2222222 333333323*3333*33333 222222222222222222221 11111 i V L ^ l M 11 i l l m i 2222227 333333333333333*3?*? Vt 22222e?u22c7c2222 I , Mi n i i l i i i i i i r l l l l11 l 222227 3333333333333333333333 c2ZZZZZZ7.Z22ZZ7.ZZ 1, 11*' fill 11111111111111111 1111 m i l u n i m i n u l l ίλ-ΛΖ* 435*333333**3333333333 22222222112222222 11 i l l l l l 22222 33333333333333333333333 227222222122272227 1 limV 111111T ' 111111111111 111111 77227 333333.333333333333333333 2222221212122222221 2222 3333333333333333333333333 2222222Z2222222222 ι 111111 m _l -"l ll ll ll li ll 2222 33333C33333333333333333333 222222222222227U2Ï Inm. m m l l l l l l lml lil l/ i l l l l 1 22222 3333233*3î****m*33**V*33 2ccccce22c2222221* *llllll l U i m i l / i l l i l i i 2*222 333333333333333333333333333 2c 222222222222221 flllll. 11111111111 >1111 11112*111111 22712 333 33333333333 333 3 3 3 3333333 33 227.27 227 7 22 7 222 \ m u m i l " illllllJffl11111 222222 33*3333333333*3*?*3 333333333 2122222222222221 I l l l l l l l l i F 111111*111111 222222 333333333333333333 4 ***3-3*3 ?2cc222c2222221 11111 2227222 33333333333333333 44444 33333333 22222222222222 21111111111 - m ' m" Ail ZccZcli 33*333332333*.?*** 444444*4 3333333* cl2Z22222222\ l l l l l l l l l i l l [111111 22222222 3 33 333 3 333333 3333 4444444444 3333333 222222227.2221 lllllllllill l l l l l I 221222222 33333333333333333 444444444444 3333333 222222222221 Hill Xlllll flllll 2222222 >^3333333332333333 4444444444444 333333 222222222221 *3333323333* 44**4 444444444 3333333 2272222222* _ 1 1 1 1 222222 u m r l 11"i 111 2222Ë222772 "Li 111 272c2t\Yt22i 1 222 ll 777222/222222 3 ----» 11 222J/2222227 - ~33jû33333333333323*3 4444444444444 333333 222222221 11i1 2 3 | l l A,11 m11i1l I1m 22ψ222222222 222222222 *%33ljl33333333333333*3* 444444444444 3333333 2222221* 7222221^27*33Β33333333333333333 44444444444 3333333 222222t. }3313333333*§β3333333333 4444444444 33*3333 222221 '-722272222 ,.33333 313^33333333333 444444444 33333333 22221 22__^ llxi .1111 Tl _,33333333V36'333333333333 4444444 333333333 222* 44444 3333333333 22* , «22 3333*3333318*^23*3*33*333*3 n u n 3333333330»3f 33332*33*333*33 4H44 333333333 i l '2 2122 3333333333#33§333?3333333333333 4444 333333333 1 ili'.'L mu ii ini ii imn 122222122 " 333 mf3*3l*333* 33333333333333 3333*3*33*31 2 '--^33 333 3 333333333 3 33 3 33333333333 33331 I333333C333 mi222 Ιϋίί 111114111 F222l 2333333Ι33| 33 3 3 33 3 33333333 333 333 2 3*33333333334 Ë222 2^ 1333333\3Î 3333333333333 333 33333333333333 33331 liii 2222 31 "3*303*3*333?3333333333333333333331 iiinur 33333333, 33333333 333*3*333333333333333 3331 J * l _227222 _ 1 / ^ m i n lil llull xl m 33 V l l 2 44. 3333331 3333 33333*333*33 33*3333 33333 33 331 ΙΓ2222Φ \e2c2i? 33* ..44 4 233 3 3 3333333 3 333333 33*33 3333333333333333* 111111111 •222222l222?22????2l2^22? 333 33 3 3 3233333 3 333333 333 33 3*32 3 3 33333333331 1222221222222 22227^27222 33 '44444 44 333333333333333333332333333333333333332J 44444 l22222tt\2222222Z%222 3*33344 33 3*333333 3 333333 33333 333333333 333 33 331 i22i22li7n2f%22777\c2 3333 444 4444 4444 3333 33*333333*33*33*3233i***333333*3*333J 122222222jf< f2g222227\7 333344444 3*3 444444444444 33333333333333333333333333333333333333335 122222 122222271122] \22222(3333 44444444444 333333332333333333333333333333333333333333} 3333 44444444444 3 333 33 3 3233 33*33333**33333 3 33 33333333333 321 12222222* 3333 4444444444 3333303333*3333333*3*3*333*323333333*333332A 12222 J 12 33*1 133333 (3333' 3333 44444444 333333333333333333333333333333333333333333331 4 333333 1333Π3 13333 ** 44*4444 33333?*333333*333*333333333333333*3333333333334 <,444444 3333333333333333333333333333333333333333333333331 1 333332J i3333|333 1333331 P333331 44414444 44 4 3333333333333333333333333333333333333333333333333ï 44V44 4 333333333333 33 3333*3333333333*333333 3333*333333 33321 I 3333f '333* 44 3*3*333***33 33*333*3*33*3333333*333333333333331 12 333 333333333 2222222 33333*3 33333333333333333333 33336 622 33 333333* 22Z2222222c2?2 3**33333*333*33333 33333333333 1 1222 3 C33333 2272222222222222727 3*33*333*3333333*3333333333t 12222 33333 22222227222222222727227 3*3 333 33 3 33 3*333333333331 12222 A 323 222222 222222222727277.22 33333*3333*3332333 3331 12222a .4*3 21227 1111 222^7227722722 33333333333333334 *2222T _ 33333 2272 l l l l l l l l l i l l 222?727.72222c22 233**3333*3331 I2j~ . 332332· 2c22 11X11 1 11111111111 277227722222222 32333333331 »i , , Y V , ô i i „ , r , , i U J i m i l l i m i l l 1 1 1 1 7 22722222222272 3333333* L l U X V 2 ? ^ ? 2 f c , 3333*3 2222 1111111111111111111 111111 2222222222222222 33321 i?22\2 33 22222 111 i l l 111111111 111 111111111 27 2211222*222722 7 - cccc^2722B.22 2722 22 2 111111111111111111111111111111 22 22222222222222221 .
1
L
L
Γ
m
m
e
1
3 3 J
5 l
3
3
Z
ZZ
z
%
*A\1 rti 1111111x1x1111111111111111111111111111 iiu i i ,i imi m m i iJ ? f yuA i*H i Hm i Am ! mi imi iuu ii iii ii im i i ii ii i 11111111111 i i i i iui n i i i i1111 i i ii ini ii li i ir iMi ii t m ml i iml i ii m m m i iU i i?i *i iii m 4i i m 111111111 i n 111111 n u i 1111111111111 l i i i i i i 11 m i i i i i i i i n i m i i i 1111111111 i n 11111111111111111111111111x1111111111111111111 1 l i i n x i i i i i i 111111111111111111111111111111111111 11111111111111111111111111111114 1 m m 111111111111111x111111111 .1111 i n η 1 m 12 χ m i n i m m n£
Figure 2. Map of arsenic concentrations. Contour levels: 1 —• 0.6-1.4 ppb; 2 — 1.43.4 ppb; 3 = 3.4r-8.2 ppb; 4 — 8.2-19.8 ppb; 5 — 19.8-47 ppb. 0 — sampling site.
In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
5.
KNUDSON ET AL.
95
Rain Chemistry
lllll lllll llilllllllilllljllllll U 111111111 * Uillll
11111 J 111*J*1 1 1 1 l l l l i l cc2">72^.72Z777ccZ22ZZZZZZZZZZZZZZZZc l l l l l l i l l f l l l l l l l l l l l ZZZ277ZZ7 722272ZZZ7ZZcZZZZZ2Z7ZZZZZZ iii ΛΪl*Lf*lliiilllll 7.ZZ227 222222222 Z2cd22cZZZZZZZ22222ZZ l l i L l i i m i l l l l l l l l l l 2227 77222222772Z222ZZZZZ2Z2222222222 ' ' T Ï Ï U 1111111111111 ZZZ7ΖcZ22Z2222? ZZZZZZZZZZZZZZZZZZZZZ 1111111 111 11 l l l l l ZZ2Z7Z7ZZ77ZZ7Z?.2Z2ZZZZZZZZZZZZZZZZ7< i i i l l l l l l l l l l l l l ZZZcZ?.77ZZ?7ZZZ?7Z2ZZZ2ZZZZZZ7.Z2?Z2ZZZ 1111111111111111 ZZZZZ7772777^'c7^7 ciclZctclZcclZicccZZc - i l l i l l l i l l l l 111 Z22722227772ZZ?7772ZZZZZZZZZ72ZZ2272222 I 111 il II 1111 111 2227.2ZI222227 722?'22 ZZZZ2ZZ7ZZZZZZZ? Ζ Z7.2Z [ u n α ι ι ι ι m i n i n u I i l l l l l l l l l l l l l ZZ7ZZZ2ZZ7Z22222222 ZZZZZZZZZZZZZZZZZZZZZ l l l l l l \ .l 1111111111111 .1111 i l l l l l 1111 ZZZZZZZZ7*Z77ZZZ27Z2L7ZZ2iccccZcZlZ'elZZZt l l 1l il il il i\l ll l11111 l l 111ulm l l ii lilf o u i i m i i i m zzzzzzzzz^zzz^/zz zzzzzzzzzzzzzzzzzz / l i l l * U i m i i 22/uçrt2<:2????'?2- 333333 ZZZZZZZZZZcZcZÙZ i l l ni ii li li i: \i l\lul lml lml ii m i / i i m i i i i i i i i ?ZZZTMJZ227777 3333333333 2z27zzzz7.7zz7.zz 722ΖΖΖΖ7*7.77??ψ 3333333333333 ZZZ? ZZZZZZZZZZ i l 1Λ..1111111' l l l l l l l l J l l l l l l l l l l l l 1111111/ 111111 l l l l l y n n m i 1111 zzzzzzzzzz2ir???v.3333333333333333 zzzzzzzzzzzzz 4 111 1111111111/ m i n i 111111111 ZZZ>
m
imntmti
U l l l l l l 1111 \777Z7ZZ7ZZZ 3333 33233333333333333333333333 ZZZZZ2ZZZ 333333333333C33333333333333333333333333333 ΖΖΖΖΖΖ7Λ 12222 ZZZZc 77777.1 i**333*****2333"3333***33*33 3333333 3333333 ZZZZZZZZ zzzzz\ 3 3 3 3 3 3 3 3 3 3 3 333333333333 3*23**233*** ZcctlZ'c • 3333 \33?3 3J33J ._-·53333333313333333J^ 1*33 £3323333 I 333 44444|4444 ,4 444 44 444/4 444444444 44444 441 PÙ33333333333333333 444444*444 333333333333 133^ #4444444, 133? 31444444 4444444444 4 à33*33323333333333* 33333333333333 222 Γ333331444444, 4*444444444 133**3??**3333*3***33**333333333333333333333 22 ii3233je 444 .4444444444; 444 I 3 333*3333 3 33 3 33 3 33 3 33333 333333333 33 2 33333C3 4 4 4^·ί^44 4___ _ "44444 13 333*333 33 3 3333 333333333 33333333333 4444444 33 ^ 4 44>»WgH.4444< 33233 33 3*33*33 333333 333 333333333333 3333 4V444X4|^#.44 4**44*4444 -3**3*****33*3**3 33 333 333333333 3333 4444444444-J ' 333 3 4VU44I4J4 33 33333333333333 33 323333 3 333333333 4,4,4444444 ' 333*3 44»fc4*44* 3**333 3332 33333*33 3**3333333333333 3333323 4**4444 333 333 3 333 3*3333 33 3*3*3*3333333333 444444 333333333 4|444 33333*3 33*333 333 33 3 33 3 3333333 33*33 44444441 444 3 33 3323 3333 33333333 3 333333333 33 3333 |444444 L33333 *4«J '3**333 3*33333 33333 3 33 3 33333333 33 3 2 444444 '330333333*?3**3332**3*333233333* -Γ3333 . 33335*33 444 •444444 *33*33 33*333 444 „ 44444444 33333 3333 3*333 33333*33 3333 333 333 13333333333- 333*3* 444 44444444 3^5**33*3 3 3*333 33 3 33 3 33333333 3 33 33 33 133333333333 33333 444 444444444 22?*?333333333333333333333333333333 • 133333J 333333331[?2 3 3333 «.44 4444444444 3 3 3 3 3 ?* 3 * 3 **? 2*2 * 3 3*3 3 * 3 3 3* 3 33 3 3 3 2 133332!.,. 3U33* 21 333333 444 p.44444444444 3 3 333 3333 3 33 3 33 3 3 333 3333333333333 133333333: ~I3333 13333* «.4*44 44«,444444444 *33*333333*333333*33***333333333 1333*3331 3f 333 33lâ3333 444444 4444444444444 333 3 3 33 3*3 333 3 3 333333333333 3333 S3333333J *\35333Γ 3*3 444444444444444444444444 33333333*333333332333333333333 ^3331333333, 333 4444444*44444*44444444444 33 333333 333 333 33 333333333 3333 £33*133333 333 444444*4444444444444444444 **3333333333333*333333333333 3333133333 333 44444444444444444C444444444 3 3 3 33 3 33*33333333333333333 2 3331*2*23^ 444444444444444444444444444444 33333333333 333333333333 J333 13V3333.7 3 44444444444444444444444444444444 3333333333*33*3333333 I33333T ,4 4444444444444444444444444*444444444 3333332333333333333 |4 44444%4 444 44444444444 444 4444 44444444 444444444 333333333 33* 33 3 3*3 444444144 X444444444*4444444 44444 44444 4 444 3 333333333333333 1
r
r
Γ
»
>4 44 4rfU*^5iku;>ij;\ 44444444444 33*3332* 444444444,4,4 44 **2*?333223333*. i)5Sl5 44444444 333323333333*33 44444444444 3333333333333336 t>!>t|i> 444444 333333333332233*333 444*4 3333333333333333" rtSbbïïîij 44444 33333333*333333333333*33 333333233333333333?
5^b55btl b 444 3333333333333333**3333333333333333333333333333333 * 4*4 ,333*3 *23S3333**?*33333?32*33333*3333*333*333*3
. 54444444^*^4444 3333 222 333333333333*333333333333333333333 44TT»#V4 4 44 444^j%4 4 44 4 3333 2ZZ2ZZZ2ZZ7ZZ 3 3 3*3333*333333333333333333333 44>44 44 4 4||*lrf*544 4 4 44 3*33 ZZZZZZ22Z2ZΖZZZ?Z7 3333333333333333333333333333 ^44 444/44 04 444 4 44 3*33 cZuZc2277c227.1cc7722' 3333333333333333333333333 33^^ Wà44444444 33333 ZZZZ77.ZZZ'cZ7ZZZc777Z7.77Z 3 3333333333333333333 33 I* 3 J S J H T X 444444 4 3*33 ZZZZZZZZZZZZZZZZZZZZZZZZZZZ 3333 3333333333333333
1331
e
f3333Ti33JA 3 3333 ZZZZZZZZZZZZZZ722777722222222Ζ 33333333333333333 « . 33331jj|m333333333333 ZZZ 7 7 Z7 ZZZZZZZZ ZZ 772 7.Z77.7Z2ZZZZ2 33 3333333333333 JtZZZ 3*3^*3*333333333333 cZ2ZZZ7.Z2ZZZZZZZ2ZZZZ?ZZ7ZZZ?.ZZZZZ 33333333333*3 TZZtZtZ 3?3*33**33*33333 3 iZLZcZ7ZZZZZZ7ZZZ77cZZ77ZZ27ZZZ2ZZZZ7. 3333333333 AZZiZZZZZZZ 333333333333333 ZZZZ ZZZZZZZZ 7.7 7ZZZ7 772Z?7?Z?2 ZZl <2 i'cck? c 3?*33 17.ZÀ IcYzSLZ^ZZcZZZc 3333*333332333 ZZZZZZZZZZ 7 Ζ ZZ ZZ 27 Z777. Ζ ZZZ 2 7.7 ZZZZ Ζ ZZZ 2 f7Z72Z2777ZZ227 3 3 3 3 3 3 3 3 3 3 3 ZZ2Z2222ZZ2Z22Z2227777Z7.22277.7722222ZZZZZZZ7ZZ îiiiïîUHHWAlliZZlU*. ?2ZZZZZZZZZZZZZ7ZZZ77Z7ZZZZ777ZZZ7ZZZZZZZZZZZZZZZZZ \Z^227Z7777?Z 3 332 333 3 ZZZZZZZZZZZZZZZZZZZZZZZZZ272222ZZZZZZZZZZZZZZZZZZ 2 2
l l < \ \ A \i\i i' i w^^ z
z
z
z
z
z
z
z
z
zzz zz
z
z
z
z
z
z
Z
^
Figure 3. Map of antimony concentrations. Contour levels: 1 = 0.0-0.07 ppb; 2 = 0.07-0.16 ppb; 3 = 0.16-0.38 ppb; 4 — 0.38-0.85 ppb; 5 = 0.85-2.1 ppb. 0 -= sampling site.
In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
96
CHEMOMETRICS: THEORY AND APPLICATION
*3333 3 3 3 3 3 2 3 22*22^222222*2222222?2?22????2 ι l i i i i i i i i i i i i i i i i i n i i 13333 lllWlltllllil,,,,13333 Ϊ{{Τ{&22ΖΖ?2Ζ2ΖΖΖΖΖΖ?2Ζ?ΖΖ??. I J i U l 11111 i 111 11ÎUII 133>3_ 13333333333 ] ]^Z', ^ZZZZZZ-ZZZZZZZZ ΖΖΖ u i i i i u i n a i i i n i h ) |33233 332333f 333333 y^J\ j?Z2Z?ZZZ?Z?.ZZZZ?ZZZZ??Z???ZZ?Z? 1111111111lîîiîîîîîîî• 333b333 ^33 ?33*33 ZZZcZlZ????ZZZZ?222?ZZZ???2?Z7? 11111111111111111111j 333333 ttlWt ÎW^ltlQlliiHttïU2ttUUZ î iniiiiiiiiiiiniiiiî 333333 3 3 3 3 3 3
Ζ
r
C
z
2UZZ 111111111111111111 8 3333 M^?|?1^!IHI 33333"33|3333 2Γ^2'22>22222222222?^Ι ΙΐΜf2?2??2|22_lil 11111111111111 ^333333333 33.33f »3 33 Z2W**ll*lll* AiVwte& 77ZZ22Z l l i l l l l l i H i j l i J 3333 ZZZZZ 1111111111111012 i33 , 133333333333' 1333333333 ^ 1333333333 3)03 333333 333333333 ΖέΖΖΖΖΖ??ΖΖ 3333333333 ?3333 lice l l i l l î l l l i l ! <3333333332FI333333 33 33333333333 2? 3333333333 4444 3333 2222 11111111112 1333333333231333333 33 3333333313333 .333232333233? 44444444 3333 222 i l i l l i l l l i l 1*33333333-3133333 3333 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 44444444444 3 333 i22 lllll1111I I33333J333331333333 p^333i3333 3 3 33 3 33333 11111111J 3jli#333333 _ 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 4444444444444444 333 222 ïîïï ï îïî 33333333l3333333 33333333333331333333: " 2 3*32?3 23:J*S?33? /133333333J#3333333. 33U*|3333333333 44444444.4444444444444 333 222 i l l l l l 3|3333333ir333333333^33*033333333 44444444444444444444444 333 2222 1111i 333331333333^3333333*33«3>i3?.333333 444444444444444444444444 333 2222 l j i 3 3332-1333333, 44444444444444444444444444 333 ZZZZ l l i 33333*3333 " 3333333333393313333333 3333333333/333133333 «MK4444444444444444444444444 *33 2222 1 33 33 , 333 23Jii*^fe33^3333 4*41444444444444444 444 44444444 333 ZZZZZl 33333. 333Γ**33 _333 444^4<1L44444444444444444444444444 333 ?ZZZ\ 33333 333333V33Î. 3333 44444%^\444444444444444444444444444 333 ZZZ* 333333ΛΑΕ\33 3 3 3 3 44444ÛJ 444444444444444444444444444 333 2e\ 333333333V «,4444444444444444 44444 33 i "33 444444 3333333333_ — 44<.444f4,', 44444444444 bjîbïb 4444 333? 13333J 3 3333333333 444444C44 44444 bbbbbbbïbbtbbbb 4444 1 333>|Λ3. i33333>\3J4444444I44 444 bbbbîbbbtbbbbblbbbbbbb 44444 5333J3\î3 33 3 3^3^44444V Wbbbbbbbbbbbbbbbbbbbbbbbb 4441 '3333l»3333l^333333 4-»*4i Λ4 Obbbbbbb*bbbbbbbt>sbbbbbbbbb 1 · /*«3*33"233 3333123333333333 44444444 >44 bbbbbbtbbbbbbbbbbbïbbl î/3333 3*2333*2? \ 32333*3 [4444444444 444444444444 obïbbiobjοbbbbbbbbbb i •33333 33333333 21?/ 33333 44 444 444444444444444444 bbbbbbbbbbbbbbbbbb* 133333 " 323333 Z?\? 33333 44 444 333 44444444444444444 bibbbbbiî 133333 , 222212 33333 44. 3333333 44444444444444444444444 bbbbbbi 1333333333] 3 cceXi 33333 444 [444 4 3 3 3 3 3 3 3 3 3 3 3 3 4444444444444444444444 bbbbl *33333331 33 V33333 .„ , 44444 3?3ii2?333?233?3* 444444444444444444444 »>:,! 333 44444444 333333 3333333333 444444444444444444444 b s ^ ^ J , 4444444 333333 ZZ?? 3333333333 44444444444444444444} 333333f*3*:> 444444 4 3333 ZZZZZZ2 33333333333 444444444444444444i 333 444444 * 3332 S ί ? Z< ce i??. **333?*333 444 4444444444441 3 4444444 33333 ZZZ?ZZZ?2?ZZZ? 333333333 44444444444442 33333 /1 ^m? }* ^? l22ZZ2?.Zc?.2227c?.22 ?*3333333 4444444444^ , ίn ! * 333333 ZZ2ZZ7 2ZZ?ZZZZZZZZZZ 333333333 4444444) a% 3 3 444444144444444 33333 ZZZ2ZZZZZZZZZZ?ZZZ?ZZZZZ 3333333333 44441 \ \ ,! 2ΖZZZZZZZZ?Ζ2ZZZZZZ2ZZZZZΖ 33333333333 441 «444 yi\\^ ??? ZZZ ZZZZZZZZ? ΖΖ?Ζ??? ZZZZZZ 333333333333 j 4444 33333 ZZZZcZZZZZZ???Z??ZZ?7.Z'<eZ'cZZci 333333333333fc ^3333 3^22*2222222????222??2??22?222<2? 33333333331 044 33333 ? ??.??Z22222Z??Z22?22??2?2222222ZZZZ 33333333 1 i>i>t>ï>ï> 4| ^ 3333 cZZZZZZZZZZΖ ΖZZZ22222ZZZZZZZZZZZZZZZZZZ 33 I4444t4 3333 ?Ζ2ΖΖΖ2ΖΖ2ΖΖΖΖΖ->?.ΖΖΖ22Ζ???ΖΖΖ?Ζ?ΖΊΖΙΖ2Ζ2Ζ2 313 ET4 4 4 44 44J . 3333 222222222222222 2222222???22222222222222222 33* 44444444 5
z
4
y
,
33
33
33
4
4 4
4
4 4 4 4 4 4 4
4
444
3333
4 4 4 4 4 4
4<.45»4444J 444% ! 3 S I S I I I ^ 1 1 1 1 i l i l 1111 l 5 ^ f $ili$i5^ilSSf illzp* 44 , Ê ^ V ^ 1 1 1 111 ZZZZZZZZZZZZZZZZZcl 3 3 ^ 3 ^ 2 2 1 1 1 ^ 1 ^ i3*2>^3T31 111111111 ZZZZZZZZZZZZZZZl 3333Γ$Β33/ ^3*23331-* 2 | | | i i i i f f . | f | f f f l l t i i l i i t H l H H » l l l l l l l l l i l l ZZZZZZZZZZZZZl 3
3
3
3
2
2
p ? ?
3
l l l l l l i l l i l l x l ZZZZZZZZZZi l l l l l U l l l l l l l l l l ZZ2Z22ZZ i 23322333 ? IcZcZl 11111111iil 1111112 111 I i l l l l l l l l l l l l l ZZZZZZl
Π3323333
aBPâifCiliiiilliifliifflffffff *|tni|iiii«t|*îïîlltimiHiiïft«fiii;««î ï^ /->o , , ?333333333 2?ZZZZZZ III liJ 11*. l i l l U 1111111111111111111111111111111 i
33
1
3
3333
*??????illlllillUU^îlllîH l *>ηΛΙΐίϊ?, -&%γΛ>"Hïlll^îîîîll**fll\i 111111\ Λ\11 \f?lA?ill>>^\\ii\îll\ \liUllliiiî, * lHttï \ZZZZZ???.2??.Z? 2
ίΖ
3
z z
3
l
9
u
*Zc222cZ2Z2ZZ?ZZ2 •
]
•
££ f
VJlllil < iiiiîiiii 2 2
2 2 2 2 2
k\Wuiiiiiini i i i n i111111111111 i uî im i iiiiiiiiini ii iinii n ii iiiiiiiiiii iii iii iii iii iim ii iiiiim ii iiee; ι ιnι uι iι iμi ιi ιi iηi η ιη:η 111111111111111111111111111111111111111111111111 inUiUiHUnniniiiiuiiiiiinu lllllllllllli
—i£i££ 3^—ïîïïïïîi î ï ?ï} î î î ï | 2 ÎÎÏÎ2ÎÎÎ JÎÎÎ* î ^ ^ î ^ l " ^ * ! * " ^ *
Figure 4. Map of copper concentrations. Contour levels: 1 = 2-5.4 ppb; 2 = 5.4-15 ppb; 3 = 15-40 ppb; 4 = 40-108 ppb; 5 = 108-290 ppb. 0 = sampling site.
In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
5.
KNUDSON ET AL.
^
*
i-
*
97
Rain Chemistry
2
•
3
•
*
•——5.
_
•.-—7.——g*
\bïbb<X>b55«L4444444444444444*4 313333333 Z2ZZZZZ2222222ZZ2 Ill 11111 l l l l l 1 l l l l l l l l li^5i>5b>ÎW?V4<»444444l44444444 3Λ3333333 ZZ? Z'cZ'Z Z7Z772Ï2? i l l i I*i 11111 111 111 111 1 1*>&«)5^ί>ί?>ί>Λ4444444|44444444 ^33333333 ZZZZZZ? ZZZZZZZZZZ Ζ 11 U111111111111111II \b^b\bbbbbbb\ "" · \ ·-^ίί^' tZ2ttil272ZZZZZZZZ l l x l l U l l l l i i l i l l l l l l _7^>t5ti>i:3!>Sf5 44 444#44*/T4T4 333333333 ZZZZZ ZZZZZZZZ Ζ ZZZZ 1111 111 i l l l l l l Illlll* l>SHki^5t&5 5|65 44tf#44|44444 33333333 Z???.ZZZZZZ?Z?.???ZZZZ l l l l l l l l i i l l l l l l l l l i ίίΪ^ίΤΛ^ΙΓ^ΙΙέΙ.^^ ! ? U U U 222222?2?22??22?2?2 l l l l l l l l i i l l l l l l l l l i 15%bbbbb§bibbt55W5 ©44 4/444 4 4 3333333 clcc2cc2?722?22?7272ZZ 1111111111111111111 Î^»&bii>Vt5i55555î>5^4A4C44444 333333 ZZZZZZZZZZZ?ZZZ?ZZZZZZZZ 1111111 111 1111 111] l ^ S ^ ^ a r r ? ! ? * ^ ^·»144444 333333 ZZZlzZZZZ?22Z22ZZ2222cZZZ 11111111111111111 V£b\*bbbb\,bbbbbbbbbb 44#444444 33333 ZZzVZ2ZZZZ???Z2???Z22222ZZZ 1111111111111111 \bbibbbbb\bbbbbbbbb 444I444444 33333 ZZZZZÎ2ZZ22ZZ2? £ ?ZZZZZZ l l l l l l l l l l l l l l l l J>bbbîb\bbbbbbbbb 444#044444 33333 ZZZZZZZZZZZ?.f 3Î333 ZZZZZZ 1111111111 ill j 1 ^ΐΛϊ^ί^^ΐ1\\\^ 4444/*4 44 44 33333 ZZZZZZZZZZ? ? 333333333 ZZZZZZ 11111111111011 *b^Q5^bb^t\bbbbb 4444/444444 333333 ZZZIZZZ7Z? 3-33333333333 ?2«i2<< l i i l i i Illlll • l ^ W s t t i o î 5Vt5î> 4 4 t 4 j A 4 4 4 4 4 4 332333 ZZZZZZZZZ 333333333 3333 22222 111 II 11111111 Ibbbbbbbbbbbjbbb 4444i*4444444 3333333 ZZZZZZ 3333333 A 333 ZZZZZ l l l l l l l l l i l l ibbïbïbbbbbËKbb 44441*444444* 333333333 22 33333333 444444 333 22222 11111111111 ibbbbbbbbb&bt 44444*444 4444 3333333333* < 33*32>?,3*3 4*444444 333 i?<2 11111111111 Zb'ybbbbbbb^^fX 44444|4444444 333333333333*23333333 44444444444 233 2222 1111111112 ii^uVjtiSi/* J444444f4444444 3 3 3 33 33 3 3 3 3 332 33273*3 44444444444444 333 2222 U l l l l l l l ibboïbïbbîî J 4 4 4 4 4 4 I 4 4 4 4 4 4 33333333333333333333 4444444444444444 333 222 U l l l l l l l ibb'JbbJi^^bi Λ . 4 4 4 4 4 1 4 4 4 4 4 4 3333333333333333333 444444444444444444 33 222 U l l l l l l l li>i>5iiliiiL*
5 0
4 4 4 4 4
b
2?
4
J
ι
(
r
1
ZZZZZZZZZZZZZZZZZZZZZ
É
)
b b b
J
F l
4444444 5
4 4 4
4 4 4 4 4 4 4 4 4 4 4
4 4 4 4
3
1
Figure 5. Map of cadmium concentrations. Contour levels: 1 = 0.0-0.04 ppb; 2 = 0.04-0.09 ppb; S = 0.09-4.20 ppb; 4 = 0.20-0.43 ppb; 5 = 0.43-4.92 ppb. 0 = sam pling site.
In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
CHEMOMETRICS: THEORY AND APPLICATION
Π33333333|333333 444*4444444 5555555555555 4444 333333 ZZZZZZZZZZZZZ) |33333333l 333333 44/44444444 555 555555 5555 4444 33333 ZZZZZZZZZZZZZZl .3l333333f 33333331 É44444444 5555555555555 4444 33333 ZZZZZZZZZZ c'ccc\ 333333333133333331 _ «,444444444 5555555555555 4444 33333 ZZZZZZZZZZZZZZl 3 ,.444444444 5*5555555555?5 4444 33333 ZIZZZZZZIZZZZZ* 333333333Γ 3333333 33 444444444 5555 55555555555 444 33333 ZZZZZZZZZZZZZZl '333333 33 44444444 5555555555555555 444 33333 222222222222221 33 44444444 5555555555555555 444 33333 ZZZZZZZZZZZZZZl i?3 4444444 55555555555555555 444 33333 ZZZZZZZZZZZZZZl 333333 4444444 555 5555555555555 4444 33333 ZZZZZZZZZZZZZZl 33 444444 5>55555;>5ί,55555555 4444 33333 ZZZZZZZZZZZZZZl Î 3333: 33 444444 5555555555555555 5 44 333333 ZZ2Z22222222ZZ1 33 44444 5555555555555555 4 444 333333 ZZZZZZZZZZZZZZl 33 444444 545505555555; 444444 333333 <*:2t2222?222Û2 i 33 444444 55555555555 44444444 333333 ZZZZZZZZZZZZZZl 33333. 33 44444*4 55£*55&ί>5 444*444444 333333 ZZZZZZZZZZZZZZl 333333 33 44444444 !φ555 44444*4444 33333333 ZZZZZZZZZZZZZl .333333333 333 44444444^ 44*4444444444 33333333 ZZZZZZZZZZZZZi 1333333333 333 444444444444444^444444444 3333333333 222222222222j 1333333333. 3333 44444444444444444444444 4 333333333 ZZZZZZZZZZZZZ 23333333333 33333 44444444444444444444 3 33333333333 ZZZZZZZZZZZl 133333333333 333333 4444444444444444 333333333333333 ZZZZZZZZZZZl I33333333J33 133333 4444444444444 3333333333333333333 ZZZlZZZZZZi ,333333133333333 13333] 53333333\3333333 3333J3,- §444444.44 333333333333333333333 2222222221 •3333] 533333333|3333332 533333333' .4* 33333333333333333333333 ZZZZZZZZl* 13333*13333 33 33 33 33 33 33 33 33 33 3 J3 33 33 33 333113 3333333 133333333 5U333333 3 44 3333333S333333333333333333 ZZZZZZZl Λ*Μ**&31*$32ϊϊν. 3333313333 33S3333333333' 33333 4 444 3333*333333333333333333333 ZZZZZZZ'i 33333/333333633333332 3' 333 44 Λ44 33333333*3*33333333333333 22222221 &33333iÇ33333- * 33 444 , ^44 3333333333333333333333333 ZZZZZZZl 4 3333333333333333333333333 ZZZZZZZl 33333&333C " 44444' 4 333333333333333333333333 ZZZZZZZZl 333333133333 444444 33333333333333333 3 ZZZZZZZ* 1 3 3 , 44444444 1333333! Î333333Î 4444444 33333333333 ZZZZZZZ ZZZZZZZZl Ï3333333 333333 2? ZZZZZZZZZZZZZZZZZZZZZI 33333|333 " 444444J 44444 ,,3333333 13333 3333 ZZZZZZZZZZZZZZZZZZZZZZZZZZZZI .I 4444 3333333^ B33 Z Z ' C Z Z Z Z Z Z ZZZZZZZZZZZZZZZZZZZZZ I bm -333333 33 2222 22222222222222222222222222224 3333Î 433 33 3 f " 2222222222222222222222222222222221 33? 333333.. ~22Γ222222222222222222222?2222222Ι ZZZZZZZ 222ΖΖΖΖΖΖ? - 333333333 222?ZZZZZZZZZZZZZZZZZZZZZZZZZZZZI JtzzTiΖΖΖΖΖΖΖΖΖΖΊ 3 33333333 __2?2222222222222222*2<2222222222221 ZZZZZZZZZZZL Î Λ22ΖΖΙZZZZZZZZZZ ZZZZZZZZZ1 3 3 3 3 333 ZZZ?ΖZZZ22ZZZ?2ZZZZZZZZZZZZZZZΖZZZZ *cZZZZtΊΖΖΖΖΖΖ u 2ZZZZZ22ZI ZZZZ2222??2Z??2ZZZZZZZZZZZZZZZZZZZZZZl ZZZZ ZZZZ7.ZZZZZZIZZ7ZZZZ2ZZZZ ZZZZZZZZZZZZZZZZZZZZZ1 xzzzzzx ZZZZZZZZ 2222222222222222222222222222222222222222222221 ZZZZZZZ* ZZZZZZZZZZZZZ222ZZ7ZZZZZZZZZZZZZZZZZZZZZZZZZZI ,,33 33 3 32202 bZZZZZZZl IZZZZZt 33 33 33 33 ZZZZcZ22Z222tZZt<272??77.2Z222I'cciZZZZZZZlZZZlZZZZZb IZZZZZZkf 33333 ZZZZZZZZ2ZZZZZ2Z2Z2222722ZZZZZZZZZZZZZZZZZZZZZZZI iZZZZZZZ 444 33333 ZZZZZZ?ZZ?ZZ?ZZ2?.2ZZZ??ZZZZZZZZZZZZZZZZZZZZZZ\ ΙΖΖΖΖΖΖΆ 4444 3333333 22222202222222222222222222222222222222222222] IZZZZZZl >5 444 3333333 ?22tlZZZZc2?.2ZZ?ZZ22Z2lZlZZZZZZZZZZZ22ZZZZZ\ ^555 444 3333333 ?.?ZZZZ22ZZ?22ZZZ222ZZZZZZZZZZZZZZZZZZZZZZZ* 33 •555 444 3333333 ZZZZZZZZ22ZZ?2ZZZZ22ZZZZZZZZZZZZZZZZZZZZZI *333 5 444 333333 7ZZZZZ7Z77ZZZZ7ZZZZ2ZZZZZZZZZZZZZZZZZZZZ21 55 55*b555!> 3333 ~ 55 444 333333 222222222222222222222222222222222222222221 5551555 444 333333 ZZZZZZZZZ722227222ZZZZ2ZZZ2ZZZZZZZZZZZZZZI , 551555 444 33333 ZZ7ZZZZZZZZZ2ZZZ2.ZZ? Ζ ZZZZZZZZZZZZZZZZZZZZZI 5551555 444 3333 2222222222222222222222222L222Îί2Σ2222222222! i'JiSiiui 444 3333 ZZZZZZZZZZZZZZZZ?.ZZZ???ZZZZZZZZZZZZZZZZZZZZZi 555555515 444 333 ZZZZZZZZZZZZZ?2222?22222ZZZZZZZZZZZZZZZZZZZZ21 444 333 ZZZZZZZZZZZZZIZZ12Z2ZZZZZZZZ2ZZZZZZZZZZZZZZZZZZI 44 333 ZZZZZZZZ'tZ2Z22l?l7Z2222Z72l 12tiiZ'tcZZZZZ?ZZZZZZZ* 1 444 5555555 444 33 222 7272?72227?222222222ZZZZZZZZZZZZI 1333307 m
É
K
e
r
r
J5
Λ
H
... 444 33 222 i l l i l l l l l U l l l 2ΖΖΖΖΖΖΖΖΖΖΖΖΖάάΖΖΖΖΖΖΖΖΖΖΖΖΙΙ -,555 444 33 22 011111111111111111 22Ζ22ZZZZZZZZZZZZZZ?ZZZZZ?I
5555 444 33 222 1UU1111111111111111 ??ZZZZZZZZZZZZZZZZZZZZZι 4444 33 222 111111111111111111111111 ZZZZZZZZZZZZZZZZZZZZZI 444444 333 2222 illlll111 111111 3111111111 iciZZZZZ?ZZZZZZZZZZI 4 33 22222 11111111I11111I11111111111 ZZZZZZZZZZZZZZZZl 31 2 ZZZ 33333444444 22222 l l l l l l l l l l l l i l l l l l l l l l l l l l l U ZZZZZZZZ ZZZZZZ I 3333333333 ZZZZZ 111111111111111111Ulllllll1111 ZZZZZZZZZZZl xrAAll^^WllllHWJ 1111111111111111111111111111111111 zzzzzzzzz* 1Ç111 2222 33333333333 222222 l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l ZZZZZZZl * U i i 2222 333333*33 22222 111111111111111111111111111111111111111 222221 l l il l l l ZZZZZ 33333 ZZZZZZ 1111U1 l l l l l l i 111Π H U H L l i i l l i l i U l l i i l 221 11111111 ZZZZZZZZZZI UU4UU.222«U
Μ
τττττ
%
I11111111 ZZZZZZZZZ l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l U' ill l ! J ί^ϊϊϊΓ,-ΓΓϊr, , , lllil*lllllllliil111111111111111111111*11111 l l l l l té*. .*.HiAHUiillllllllll i l 11 I i l l l l l l l l l l l l l 111 l l l l l l l l l l l l l l l l l l 11111 l l l l l 1111* ϊτϊϊϊτίϊίΓΓΐ m u m u u i i i i i i î ï î ï i i i i i m
www 1
n
1
2 2 2 2 2
1111Illlll111111111111IllllllllllllliillllllllliIlllll111111111lllllllllllll1 Figure 6. Map of pH. Contour levels: 1 — 4.59-4.75; 2 = 4.43-4.59; 3 — 4.27-4.43; 4 = 4.11-4.27; 5 — 3.95-4.11. 0 = sampling site.
In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
5.
KNUDSON ET AL.
11111 lllll._ iiimTTimi îniiiuiiiii ^'illllllll 1111
Rain Chemistry
n m 111, . lui Il 1111111
ZZZZZZZZ 333333333*333 4444444444 55555555555551 ZZZZZZZ 33333333333333 444444444 555555555555553 11 11iUflZZZZZZZ 33333333333333 444444444 5>555555i55*551 1JL1 [ îf* ZZZZZZ?.? 333333333333333 44444444 555555555555551 T l 1 i l 1 ZZZZZZZZ 333333333333333 44444444 bbbbbbbblbbbbb* •111111 2222222. 3333333333333333 44444444 bbbbbbbbbbbbbb 1 l l l l l l l ZZZZZZ ^3333333333333333 44444444 bbbbbbbbbbbbbbi f l l l l l l ZZZZZZ 33333333333333333 444444*4 55*555555:, 5555 J Jn^^;ïl;ίnmΓl^iîîl^*î IZZZY 3333I3333333333333 44444444 ibVbVbVbbb'bbbil ^ li îi n u îmî luî. ïi îi îi îi îiiiîiîuî.tui ii îI li iilUî ii iJlJ 22222.3333333333333333 55^5555555^5551 ZZZZ 3333333333333333333 44444444 444444444 5555555555555* îaH i n l l l l l l l l l l l l l l l l l l l l l ZZZZ 333333333233333333 444*444444 bbbbbbbbbbbbbl un U M l l l l l l l l l i l l / O l l l l l l 222 3 333333333333333333 4444444444 bbbbbbbbbbbbbl u n . i U l l l U l A l l M U x l i l 2222 32132J333*2*3?333?3 4444444444 'jïsbïbbbïblOil n m . l i l l l l l l l l / l l l l l l i 2222 33 3323333333333333 444444444444 555555555555* m m " l l l l l l l l j n l l l l l l l ZZZZ 33333333333333333 3 44444444444 5555555555551 i η 11111 u ιj ,.1111111/11111111 ZZZZZ 3333333333333323 4 444444444444 55555555555S3 l l l l l l l l l i l l l l l l U l / U l l i U i 22*22 333332333 333*** 4444444444444444 555555555551 lllllllllill lllllfllllll ZZZZZZ 3333333333333 444444444444444444 bbbbbbbbbbbl euuuiUji, UlUlil ZZZcZZZZ 3333333343333 44444444444«,444444444 bbbbbbbbbb? η 111111 u n • ?ZZ?Z??ZZZ 333333323333 444444444444444444*4444 bbbbbbbbbi uni ZZZ?ZZZZZZZZZZ 3333333333333 4444444444444444444444444 bbbbbbbbbi ι ι ZZZZZ\ 33333333333333 444444444444444444444444444 bbbbbbbbl ιζζζζι U33333333J 44444444444444444444444444444 bbbbbbbbl +zzzzz __.33333333 ^33333333^ΠΒ3333333333 3333333333 4444444444444444444444444444444 5555555* 323333333 . 33333333.3 444444444440444444444444444444444 555555T "33333 44|444444 333 4444/4444444 33 444J/444444444 4444444444 L44444 4444 2ZZ2 • 4444 4444 444444444444444 444444444444 5551 »444 bb 44/444 444 5 5 .bs <,44444444444444444444444444444 551 » 55515 655 44444444444444444444444444444 55J 2222221 55555. 555 44444444444444444444444444444 51 .555*555 É>555 44/44444444444444444444444444 * 44 55555550. S5555 4 4 4444 44444 444 444 444 44444 4444 1 ... _ 144'. 555551,5..-. |t J5 444444444444444444444444444441 iZZZZ? 333 S4444f444 555555|55 1555 444444 444 444 44 4 44 4444444 4444441 55555*55 2?Z???Z .33314444/44444 5 44444444444444444444444444444441 55155 4 44 4 4444444 44444444444444444 44444 Γ 44444444444444444444444444444444 44i ?>«3333 Λ4 4 2 3Λ333331 44444 - ^*44f444444444444444444444444444441 1 1444444444444444444444444444444441 clcZZZZZZ 2 33333335 4444444 I_ __. . , , , , , , 4 4 4 44 4 4 444 444 4 44444 4444444444444441 ZZZZZZZ 1 333333 . 33 444444 • U 22 333 3 4444*444444 44444444,44 4444444444444444444 ZZZZZZ 33 333333 ZZli 1111 2 *222 333^ - 3333- ZZZZl ^....-.«-L33332 4444444444444444444444444444444444444] «2Γ ZZZZZZ l l nl lil ^Ζ\ΖλΖ 33331 3333 3333 ZZZZZZ\zttZZ 333333 4444444444444444444444444444444444441 u 332 cùc'ccZl U 3331332 cCc'cc2lZ%l\ZZ 333333 4444444444444444444 444444444444444 i l l ll ll ll il ll cZlcM%? ^..«ίΖΖ 33333333 444444444444444444444444444444441 ?Z 11111111 ZZZZZZZZZZZZZZZZ 3333333323 4444444444444444444444444444445 51111111, ZZZZZZZZZZZZZZZZ 333333333333 44444444444444444444444444441 lllllll 2222222222222*? 3 3333 3333333333 44444444444444444444444444j -.ZZZZZ_l. ZZZZZZZZZ2ZZZZ 333333C33333333333 1u m 4444444 4444444444444444 3 12 U IIU1V2222Î ZZZZZZ 33333 33233323333333333 4*4444444444444444441 12222 1 l l ÏKZZ 3333333333333333333333333333 4444444444444444444 *zzzzz N i * " " * ' IZZZZZ. izzzzzg 222 12 33333333333333333333332333333333333333 44444444444444441 133331 222 2\3333333333333333333333333333*.333?3*333333 444444444*44441 533333 333333333333333333333333333333333333333333333 444444444444] ιζζι . . . ^33 .^3 Λ 333333333 2333333333333333333333333333 4444444444] M 44|444 333333 ZZZ? 3333333333333333333333333 44444446 1333 ^#«••14441444 33333 ZZZZZZZZZZZ 3333333333333333333333333 44444i Γ4444444|>4 3333 ZZZZZZZZZZZZZl? 3323333333333333333333333 444] /44444444I444 33 2222?2*2η22<22??? 33333*2333333333332333333 1 ^,44444444|44 33 ZZZZZZ ZZZ?ZZZZZZZZZZ? 333333333333333333333333331 44444444/*<4 î33 222 1 ZZZZZZZZZZZZZ 3333333333*33 3333333333* . . . 444444^444 333 222 11111 111 ZZ???ZZ1ZZ 33333332333333333333 J 55 44y^4T444 333 222 I i l l l l l l l l l l l l l ZZZZZZZZZZ 333333333333333331 444/4404444 3333 222 l l ' l l l l l l l l l l l l l l l l ZZZZZZZZZZZ 33333333333332J fc44lbM,44444, V:>3 222 1111111111 1 l l l l l l l l l l ZZZZZZZZZZZ 33333333333Ï ^444\4444 3333 222 111111111111111111 1111 11 'r Ζ Ζ ZI ce ce III 333333337 3J 3332 ZZZZ 1111 i l l l 111111111 111 11111 ZZZZZZZZZZZZZ 333331 - / 2 3 44^1^444 33333 ZZZZ 111111111111111111111111111 ZZZZZZZZZZZZZ 3332 ^/ , *4*«444 33333 ZZZZZ 111U11111111111U1111111 l l l l l ZZZZZZZZZZZZZ 1 , , / 3 3 4 44444444 33333 ZZZZZ lllllllllill111112111111111^1111 kccZZZZIZZZZZI3 V,}i 4 4*444444444 33333 C1Î22 ZZZZZ 1U11111 11111111111111111U111UU11111111 l l l l l l lm l l 111111111 l l l l l l l i i iZZZZZZZZZZZZ+ l i ZZZZZZZZl 133/40444444444444 22222 ZZZZ l l lUl lUl l U L lll U l l ll 1U1 U 1 1 I Ul lm l l il ll l l lUl l il iUl ll li l il l l lZZZZZeZZZil l i ZZZZZZi 33/44444*444444444 33333 ZZZZZ ZZZZ l l11111111111111111111111111111111111111 l l l l l l l 1 1 1 1 1 1 1 1 1 U 1 1 U l l l l l l l l l l l l l l l l ZZZZZZZZl 111 ZZZZl 3» 4444444444444 33333 zzzz u i i î i i i u i u u i i î i S i i i i i i m u i i i i u i i i i u i i zz* 122 1333 . 4 333333 ?£2JT 3333333333333 4 33333 «.ccc * 1*1 x; m ΐ ϋ * ι j 1 1j j 111111111111111111J111111 «£_ 7 Û i U i , l î î » " " " 3 33333333 Î 3 3 3 222 22 î 11111111U11111 îiîÎÎ î î î î u i ï ilUU1U1111111111U1111111UU u h l l l l l l 1 1 1 111 II1111 l l l l l l l l l i l l i) il???,3333333333333. V0222 | | | ^ H | 3 33 2< UU 1U1U11111113 11111U111 U l i l l I I A I U I I U U I U 1 I i???5fl, ll 22 u u t î i h ^ m l i i ï î i 11111111111111111111111111111: \A?A?î?li^\\\\\i\\\\\\kî\ i n a n u i i i i i i i i i i i u n 111 n i i i i n i î u i i i i n u i n i l r
J
b
e
R
Î
t
f
1
r
3
3
3
e
3 3 3 3 3
A
5
3 3 3 3 3 3 3 3 3 3 3
3 3 3 3 3 3 3 3 3 3 3 3 3 3
#
i££££|£fliîi£.233333333 33 3 33 3^
Figure 7. Map of sulfate concentrations. Contour levels: 1 — 1100-1450 ppb; 2 = 1450-1900 ppb; 3 = 1900-2500 ppb; 4 = 2500-3200 ppb; 5 = 3200-4300 ppb. 0 = sampling site.
In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
100
CHEMOMETRICS: THEORY AND APPLICATION
3333 33 3333/3333333 3333 3333 3 333333333333333333333333333 3 i J*3 i _33333U|*W3M3333333313333 33 333#»33 33333333333333*333*333 33 333 33 333 333 333 33333331 1333*33?333·>333\3333333ΐ3 333 33 3 j l f 33 33 33 3333 33 3 3333 333 3 333 33 3 33 3 3 J 3 333 333 33 333 3 3 131 1333333333333313333333f3313Jèé*73333333333333333333333333333333333333333333333333) •ffe333333333#3333333/33/3333333333333333333333332333333333333333333333333333333* A3J3333^333333Λ3Ϊ|0«33 3ΐ3333333 333333 333333 33333 333 3 33 3 3 3333333333333 3 333 3 3 3 3 33! 43|33333l33 33333W3l»333/3 33 33333333333333333333333333333333J333333 3333333333 3331 33331333333333333333 ~T)3 33 3333 33 3333 333333 3 333 32333333333 333333 3333333333333 33i 3333^33333333333333 33 33 3 333 3 3 3333333 333 3333 33333 333333333 333 33 3 333 33 3333 3331 33333X33333333333333 3 3 33 3333333333333333 «33 3 3 333 31.3333 33333333 333 333 33 33 3 3? J 33333%3333?3?333333 33 33333333 3333333 3 33 3333333333 3?333 333333333333 33 33333 3 31 "3333313333333333333J u33333333333333333333333331333*333?33r33 3333333333333 3331 333333X33333333333"' 33333*3333333 333 333333333313$3333333333 3333 3333 333 333)03i 3 "3333331333333333" 3333 33333333333333 333333 3 3Ï3 Ϊ33333333333333333333333333333* 133^333333 43333333331, J3333333/333333333333333333333333333333333*33333333333333333333i33331 1333333333"' 3333 333/333333333333333333333333333333333333332 333333333333333333 3331 1333333333 33333133333)33333*3333333333333333333?33333333333333333333333333332i 23333333332. 333 3313333 3 333333 33 3333 3333 33 333333 3 3333 33333333 3333333333333333)332 13333333333) 3 333313333 333 33 3 33 3333333333333 3333 V333 3 3 3333 33 3333 3 333 3 33333333333.» 1 333333333) 333331333333333333333333333333333333333333333?3333333333333333333321 12 3A33333 . 3333313 3333333333333333333333333Λ?3^333*333333333333333333333333333 Τ Γ3333333^33333333/1333333333333333333333333333333333)3333333333333333i uzzz 33*3333313333333 II 222 UZZZ 12?333333313333333Î 33333J3333333 2222 333(13333333 3333 ********** 3333233333333333333333333333333333i i033 h ************ 3333333333333333333333333333333331 ^5333333333 ;_3 ^ *t ****************** 333333333333333333233333333333333 ι 3333333333 .Jlixl 3333333333 3 3 l ****** *******±********* 3333333333333333333333333333333Î ********** 3333333333333333333333333333331 ******** "_333333333j ***** 55< I********** 333333333333333333333333333333Ï _1111 if. 3333333* ****** b*>~ ****** 33333333333323 33 flllllJ **** 2333333333 ZZZZZZZZZZ 333* 33 ******** 55555 ** 3333*3333 ZZZZZZZZZZZZZZcZZZ \ 1111 29MK2W3 33 ******** ii 5?5pi ** 33333 ZZZZZZZZZZZZZZZZZZZZZZZi ******** 55" 11111 ZZ2\Z 333 3 333 ?ZZZZZZZZZZZZZZZZZZZZZZZZZZÏ l l l l i l l . l l l l l l l ?2?|2 3 ! ! * . . ****** % ** 33 Ζ ZZZZZ?.? ZZZZZZZZZZZZZZZZZZZZZÏ tt ************** k i l l ' l l l l l L I U ΖΖΛΖΖΖ 3 2?2???.2222? ZZZZZZZZZZ ZZZZZZZZZZZ** ******** ' ' lltl/%LL#222 33133J l l i < l l f t f l f 222_ ??zzz?zzzzzzzzzzzzzzzzzzzzzzzzzzz\ ,3 zzczzzzzzzzzzzzzzzzzzzzzzzzzzzzzx ΠΤ1 '333333 ?Z2222Z222?ZZ2ZZZZZZZZZZZZZZZZZZx .111111 I l l «111111111 33233 , Il 111 1111 n"r 11111111111 ?22 22 Έ.2 ZZZZZZ ZZZZZ ZZZZZZ ZZZZZ ZZZZZZZZZZI
Î
r
β
t
111111111 11111111111 mill i i u m m i
\ZZZZZZ2ZZ2ZZZZZ??ZZZZZZZ2ZZZZZZZZZZZZZZZZZZ2Z+ ZZZZZZZZZ??ZZZZZ?ZZZ??ZZZZZZZZZZZZZZZZZZZZZÏ 111 ZZlZ?ZZZZ?ZZZ????????lZc'U2Z2tZc22ll2l'c2Z2i 111 2ZZZZZZ2ZZIZZZZZZ?ZZZZZZZZZZZZZZZZZZZZZÏ I l l 111 ZZZl?2?22?2ZZZ?ZZZZZZZZZZZZZZZZZZZ2ZZ\ lllllllll 222222222222 2222ZZZZZZZZ222222b
111
lillîfll
V zzzzz 3^ZZZZZ 2222 IlI lKi l l zzzzz 33 222 l l l i l l l l l l l l l l l l l il* Z2Z2?22Z?.Z2Zl < * 33 222 l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l ZZZZZZl ** 33 tel* llliiUClllniiUUlllllllllllillllllll cc\ |3333l><,4 5,
*** 33 ZZZZZZ l l l l l l l l l l l l l l l l l l i l l l l l l l l l l l l l l l l l l l l l l i I 5550., 5 **** 33 222222 11 111 111111111111111 H U H 111111111111111 • 5555 ******** 33 2222222 11111111111111111111111111111111111111) ****** 3333 ZZZZZZZcC 11 3 3 1U1U11 111 111 111 l l l l l l 1111111 ******** 3333 ZZZZZZZZZZZZZ 111111111111111 111 111 1111111111 « \******** ZZZZZiZZZZZZZZ llllllllilUilllllllllllllllll ********* 333333 ZZZZZZZZZZZZZZ? 11111111111111111111111111116 §******** 3333333 2222222222???22 llllllll1111111111111111111 lu*,*,** 33333333 22222ZZZZ2222? ?? I l l 111 11111U1111 l l i l l l i I 333333333 ZZZZZZZZZ?ZZZZZZ? Ill1111111111111i 3333333333 ZZZZZZZZZZZZZ 2Π?2?ΖΖΊΪΖ llllllllllî ****** 333333233333 ZZZZZZZZZZZZZZZZZZZZZZZZZZZZl111 111* ,, 3333333333333 ZZZZZZ2222???Z?2222222ZZZZZZZZZZ III i t l f Vi ί ^ ί ^ ί I! , ,! , , ,! 2ZZ222222222222ZZZZZZ2222222222ZZZ I ***Mt* C $ 4 , * * * * * * * * * 3 333330 3333 333333 22222?2222222Z??Zcc2Zc2ZZ2Z2ZZZZ2l -*?MÉ*A*\ # I ! ! zzzrz?zzzzzzzzizzzzczzzzzzzzzzzzi
333
r
3
4
3 3
3
3
33
3 3 3 3 3
3
33333
3 3 3 3 3 3 3 3 3 3
2??ZZZ2Z2ZlZZZZZZlZZLZZZZZZZZcl ???Z22?222222ZZZZZZZZZZZZZZZZl
&*ΐΐ*ϊ5%^ΕΉΉΉ?Ή
\\\\\\\\\\\333333333333
3 3333333333333333 33333 3 3 ZZZZ2ZZZZ22Z222ZZZZZZZZZZZÏ ?!!?H?nT ^^J !IU? ?Z2?ZZ?ZZZZZZZZZZZ?ZZZZZZl »o ?f? fH!** *}tt^ *! 333333333333333333333?3 ZZZZZZZZZZZZZZZZZZZZZZZI iîn^HimiUi******* 33 33333333333333333333333333333 ZZZZZZZZZ ZZZZZZZZZZZZZ* JP20 $333*333333 333333333233333333333333233333333 ΖΖΖΖΖίΖίΖάZZZZZZZZZZl ii?ff!* lîI?ill?!!IIHHHI !l^l zzzz2zz22222zzzzz\ > f l i ^ ? £ i > l ? l l l l l l l ^ l l i l l l l l l i m ? i m | 3 3 3332i33 3333333333333 ZZZZZZZZZZZZZZZll 5fff?£5^ii* f2if!22^ zzzzzzzzzzzzzx ΐωΙ!>!>Ι?!& ΙΙΙ?ΐ2^ zzizzizczzzzi 5>>f >^$>v$lfT?.»o ll?lllllH^I^^IIIII5 3333 333333333333333 333 33 3333 ZZZZZZZZZZl -IfIII! f^lllim^ll^^uèi^ 3 3 3 3 3 3 333333333333333333 ZZZZZZZZZÏ mllllllïm^^^^****^^ l
< f < ,
33
3
3 3
33
4
3
3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 ? 3 3 3
33
3
3 3 3 3 3 3 3 3 3 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3
3
3
33
33
33
3 3
3 3 3 3 3 B 3
iiΛΙΙΙΙΙϋΙΙΙΙΙΙΙ^ J
2 2 2 2 2
/
2 2 2 2 2 2 2 2 2 2 2 2 2
333333333333333333333333333333333333333333333333333
22222
ζζζ22ζζι
|3333333333333333333333333333333333333333333333
3333
ZZcl
Figure 8. Map of nitrate concentrations. Contour levels: 1 = 170-250 ppb; 2 = 250380 ppb; 3 = 380-560 ppb; 4 = 560-840 ppb; 5 = 840-1250 ppb. 0 = sampling site.
In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
-dl%s 2ut\duiOS = ο '
dd
ic
dd
i(
dd
?
β
Î**7***^^***T***\****\*********^
I H ^ W r ^ H V ^ H ^ ^ ^ n ^ V ^ ^ ^ **** ****************************** C€CC€CCCC(CCl • ****************************************************************** €€£C€€ECC£*
ε ε ****************************************************************** εεεεεεεεει εεεε ****************************************************************** εεεεεεεεε εεεεε ***************************************************************** εεεεο?ι εεεεεεε ***************************************************ν*^******·.***^ εεε£€τ ι εεεεεεεεε **************************************************************** /Τε9 :εεεεεεεεεεε *****************************************************************! ει εεεεεεεεεε ***************************************************************** Ι 22 εεεεεεεεεε ************************************************************/**> ζζζζ εεεεεεεεε *******************************************************•**£**.. *ζζζζζζζ εεεεεεεεε ********************************************************/**\^* χζζζζζζζζζ εεεεεεεε ********************************************************* ιιζζζζζζζζζιζ εεεεεεεε *************************************************** 12 ζζζζζζζζζζζζ εεεεεεε **************************************>nl^*******i ι ι ztzzzzz εεεεεεε ************************************w**Y****v 22 ItîîUT ZZZZZZZ εεεεεε ******************************* *\****Î***V 122 ιτπττττ 22222 εεεεεε ***************************** *vâ***X** 1222 ιττηιιτιτττ 22222 εεεεε ************?**************εο**Λ****\ 12222 Ι Ι Ι Ϊ Ι Ι Τ Π ί Τ Ι Ι Τ 2222 εεεεε ************************** hr5" i νι"·ι .. . 12222^2 τπτηηηΗΗχητΓ22222 122222 ιτ ιι ττττττττΐτχ 2222222εεεε εεε************* *********** *******^Λ******^2^ *ΛΜ ****r +ζζζζζζζ u u n n t u n i n i εε εε **v>y^*********p^ ί 222 22 22 22 TTTlTTTtT tit TTiTiTxnit 22222 ******** ι ****** .****! ζζζ*ζεεεεε εεεε ****** esol^scs 1 22 22 22 22ITUUTtTTTÎIUim £ £££ ******** ς ς * | ' , ς * ΐεε .22222222_Ttlî!UUU *****! WWW 9έέεε ζζζζζζ ττπτηππτπτιιπτ ζζζζζ εεεε ***** c i *****! ΪΪ__22??Ζ C€€€ τεεεεεεε ζζζζζζ ' * ***** ççcl Ι Ϊ Π Τ Ϊ Ι Ϊ Ι Ι Τ Ϊ Ι Ι î 1Τ Τ Τ Τ 2 2 2 ε ε ε ε ****** ***** , ****! **** εεεεε 22222 τιιιηπΐΐιτιττΐίττπ ζζζ εεεε ************* *****t ι****** εεε2ε2ε222 2222 τιτπιιιιιιττττιιιπ ζζζ εεεε ************* *****! ********** εεεε 2222 τιτιτιιχτττττιττττ 222 εεεε ****\k**** *****Ι ιιπππιιητ 2 2 2 ε ε ε ε ε ***Χ*****{ ********* εεεε ζζζζ ι ι ****• ς 22222 ιιττττοττπττ 2 ? ? ε ε ε ε ε ε ε /***** *****Ι δς ςς ςςςςςς**** ** ** ****εεεε εεε ζζζζ "\\\\\\\\\ i | M * - « f f « « « * - * * " ς ς ς ς ς ς ς ς ς ς ς ***** ε ε ε ε ζ α ******Ι 2 2 2 ε ε ε ε ε ε ε * / ε ε * * * ' ******! ςςςςςς',ςςςςςςςςς **** εεεε 222 ιτιχ 2222222 εεεεεεεεε εε_**' , ******t ^Ιίςςςζςςςςςϊςςςζςς **** εεεε ζζζζ ζζζζτ&ζ-.ζ εεεεεεεεΙΧ '*******ς ΙΙΙςΙ^ςςςςςΠςςςϊςς*; ***** εεε ζζζζζζζζζζΧ&ν. εεεεεεεεεε\-. *******! ********! iqwçww^w^^m ***** εεε 222jr~' *****! ****J*ÇJ*****t ε εεεεεεε**εε! ************** ^ε τ
:
A
rν
l i Z 2 l
C
•δδςςς«δδ55<δ<ςδδδδδ5δδδδδδ«_ Λ " * i i f f c J f
5SÇÇÇ5SÇ€5<:«S^S<;Ç<;ÇÇÎ5Î&66564SÇ5CY*\*** εεΐ !** $ςςςςςςςςδςςςς<;ςΐ55ϋς5δδ'.ίϊδ6ς.ι JA**|******J ,
*********m*^ * * * * * * * * l * * * * * ******** ******ί«Ι ****•;·>£ ****L
**** ςδδδδδςδδ',δδδδδ'.δδίδδ^δδδδδ*; ****J******^_ , .^δ<Λ ** ςί<«;ςςςδδδζδδ<δςδδς<.δδςςδ *€***/********« εεε *** c.cq^cçççcçr^^qqç^-^ **|**/» ε *******1 •εεεε *** δδδ<δδ<;<ίδίδδδδδδδδδδς ***&•&** ε ****ν ιεεεεε *** ^ςδδδδδδδ^δδδδδίδδ ****γ\ εεε ***ji ι εεεε *** δδδδδίδδδ^δδδδδδδδ ******\*Λ εεεεε, 122 εεεε *** <;δ<;ςδδδ<;$;ς<;ςςςςδ ******%i# ^εεεεί 1222 εεεε *** ςςδδςδδ^ίίδδδ; ********** εεεεε* Î2222 εεεε *** δδδςδζδδδδδδδδ ********** εεεεεεν, ********* ^ttççoj ********* fttcff.fl . ********* εεεεεεεε^χ
>
***•/**!" • £***/
Γ
ΐζζζζζ εεεε *** δδςδί^δδδδδδδ 1222222 εεε **** δςςδδίδδδδδδδ ιζζζζζζζ εεε **** ςςςδδδδδδδ<;δ xûizzzizz 12222222222 tzzzzzzzzziz zzzzzzzzzzjz
εεε ****** δδδδδ* ************* εεεεεε * _ _ ... εεε ****** ς?δ **γ*********^** ************|******|***********t εείε ****** ^**********************************Ι**4***Ι***4******* ?
Ι
I II?222222;22 ΓΕ£ ***********************4********************I******^**V*******I i t ζζέζζζζζζ-ζζ «εε /*.****************************4************χ********4»*********[ ι τ τ 2ζζ2222Β22·> ε ε εε^"******** ****************** ********* *****/********/Τ**********ι l i t ζζζζζζζζζζζ εεε\***************************************/*********1*^*****>ώ**ι • i l l ζζζζζζζζζζζ zii *************************************/>*********v******Ç^**+ - τη τ ΖΖΖΖΖΖΖΖΖΖΖ ΙΙΖΖ *****************ϋ*****************y'***********V*****VF ι τ τ τ 2222222??22 εεεε *********************************Pf*************\*****«VfcB»t n u i 2222'2 2-?2 εεεε *******************************4»**|*********4**********|*τι ι ιτιι 22222222222 εεεε *********************************|***»**********\*****|**Τ i n n 2222222222 εεεε ********************************f***************V****1**T ίττιιττ ζζζζζζζζζζζ εεε *****************************************4*******γ*****ι**ι ττΐτττ 22'2^'2?2^2 εεεε *******************************/***^r*>b*****,»*i*******] 1Τ1ΙΤΙ ζζζζζζζζζζζ ειεε ***********************************T/^X******|a****t*i ι t T t i t i 2222222222 εεεεε ******************************1***/^***ν***^*^*^^ • UU1TT ζζζζζζζζζζ εε£εε *****************************jJ**/*******#*4********^t#j t i i T i i u ζζζζζζζζζζζ εεεε *************************>* i ι^***********************;*» iiTiuti ζζζζζζζζζζζ εεεε ***********************y<********l»*******vK^'rfcfc^fc*fc^l i i t u i T i ζζζζζζζζζζζ εεεε ***********************/****************^*ν*ΐίΑίθΡν??Ϊ; ΐ Ί ΐ ι ι ι ι ι τ 2222222222 εεεεε *********************/**********»********»^Ll·>»**»****! ,,I
?
4
,
1
RitsttuatiQ wtog
TOT
-qv xa NOSOONX -g
CHEMOMETRICS:
102
THEORY AND APPLICATION
b* ITÎT111 î î n î î ι î î î i l uiui 1111111111 l m m i m m i i i u .111111 îiHîiiiiiiiiïîiu .lllllll 1" * Ι Ϊ ί ϊ ϊ ΐ Π Ϊ Ϊ Ι Ϊ ί "Ι Ϊ"Ι m ... lllllll llllllllll llllllll ** 1111111111111111111 l l l l l l l l l l l l l l l l l l l l l l l l l l l* i i u n m u n u i n l l l l l l l l l l l l l l l l l l l l l l U l l l l l l l l l111 * '111111111*' 11 lllll 1111111111111111 • Μ Ι Μ Η Η Μ Η Μ Η Η Η i i i i l 1ΪΪΪΪ1 i l l î i i u i i i i î l i Hi Iilllllllllllll l i i m m i m Lllllllll1 r f i l i u m f 1111 m m i u i ul ul lul li lnl 111111111111111111 n ni li lil il il l '1111111 l ι i""llilliiillHHti" n nnuliiniiiiiu n u . m ui nηliiiin ni nηi n Ulllililllllllliillllllllli lllllllll •ι n u n n n n n u n i n j l r " 11111111 .111111011 I l l l l l mm\ 1lllllllllllllJlllll l l nl l l l l—l l l l l l l l l l l * n 11111111111111111 l l l l l l l l l l l li lnl lul il n l l ul li ljli u i i u n i i n i n l i n n l i i i n u i u u i u n j i n i n i i i i i i i i n i i n i i : 1111x11111 lllllllll 1 1 1 1 1 1 l l l l l l l l i i l l l l l l l l l i l l l l l l i i i n i i i i i i i n i i_ _ l l l l l l l l i l l il nl lil l111111 l l l l l i n n11111111112 n u n i i n η ιι 111111111 i n m illm llllllliillllllllli llllllll ZZZZZZZZZZZZZ22ZZk \\\\\\\\ 111111111111111111 l l l l l l l ] \ZZZZ2ZZZZZZZZZZZZZZZZZZZ Illlll'
ÎT11
ιιΙΙΙΐέΚι ι ΪΓΓηίΤ l l l l l l l l l u i i i i u n n m n i
ΗΙΗΜΠΗϊίρΐΜ
ÎtiltiiÎiii
1
ll
ιιΗΗΜΗΜΗΗ"
m i l l
in l l l l l l l l l l l l l """ llllllll ZZZZZZ 1 ι 22 33333 ZZZZZ 11111111111111___ III""* 33333333 ZZZZZ l i m i m 333 772222? 1111111111111111 33333 3333333222222 111111111111111 1- n m 3 44 - 2222222 1! 333 ZZZZZZ 111x11 l l l l i H i l l 11 44444; 33 222222 1111111111111111 1 '4444 \4 23 12222 1111111111111111" i n n 1* 44 555* 23 2Z2Z 11111111111111111 l l l l l l l l 4 555555 3 222 2 l l l l i i n i l i l l l :* 1111U1J 44* 5555550, 1 l l l l l l l l i i l l l l l l l l l *-111111 i 55555 l l l U U U l n i l l l l U l l Illlll 4 < "114 444*44 --11111111111111111*· Hi 444 C l l l 11111111111111111 lllllllllllll 3333' 11111111111 -111111111111111111111 111 3333 _ 11111111111111-111 l l l l l l l l l 222 \ZZZZZZZZZ m i n n 1111111 i n n 1111111 n i i u i * lllllll) ZZZZZZZ null luiuiuiiiiiii-11111111111" 11 i l l l i l l l l l l l l l u l l l l l l l l " 1111 m i i m n i m m i i Ill111_ " 1111111111111111 illtMllllHl. 1111 11111111 l m m i i i m m lllll u m i i m il ii ii ii m m ii i m i m m : 111 111 11111111 Miiililliithi- WW111 1 *- I i l l l l l l l l l l l l l m i " l l l l l l l l l l l l l * - 1111 -Iilllllllllllll lllllll __lllll llllllllllllllll l l l l l l l l l i l l 11 n lllllll 2222 11111111111 ll ii n l l l l l l l l l l l l li nn n m 11111111111111111 zzzz i i i i i m u Illlll 11116 llllllllllllllllllill 11 ZZZZZ l l l li lil il il ln iuii m izzzzzz u im i i i m l lI li ll ll ll ll ll ll ll ll ll ll ll ll ll l l l l l l l ll lllll]l OZZZZZZZ 11111111111111 2ZZZZZZZZ l l l l l l l l l i l l llllllllll11111lllllllll 1 yzzzzzzzzzz i m m i i i l l"l"l lH H H l H M l i i l " " l" il ll llIll»ll J* ZZZZZZZZZZZZZ lllllllll 111 22222222222 HlllllililMH... m i i l l i U C l l l l i l ZZZZZZZttkkZZZkk 1 U l l l l l_l l_l l l l l l l l __ ll llllllll ^ ' l l l l l l l l ZZZZZZζζζζζζζζζζζζ l u i i i i i ζzzzzzzzzzzzzzzzzzzz n m11111111111111 u m i m i m i 1111111111111 m i i 1117 11111111 ZZZZZZZZZZZZZZZZZZZZZ? llllllllllllllllllill " inn - -11111111 lllllllllilllllllll "Illl 11111111111 ZZZZZZZZZZZZZZZZZZZZZ2Z 2. ZZZZZZZZZZZZZZZZZZZZZ 722 11111111111111111 l l l l l l l l l ZZZZZZZZZZZZZZZZZZZZZ?Z272Z Iilllllllllllll "'lllili llinium * i********* u , ZZZZZZZ ZZZZZZZZZZZZZZZZZZZZZ 11111111*1111 inn* „20222222 i l l l l l l l l l l l ZZZZZZZZZZZZZZZZZZZZZ?77222222 lllllllllill Illlll ZZZZZZZZZZ 11111111 ZZZZZ2ZZZZZZZZZZZZZZZZ2222222222 llllllllll lllllll) 222222222 1111 _ 2222222222222222222222 VcZ? ZZZZZZZZZZZ 11111111 . . l l l l l l l l Jl 222, 2 11 ZZZZZZZZZZZZZZZZZZZZZZZZZZ \ZZ27.2Z27Z22?2kZi lllllll llllllllll 1111 111 U 7ZZZZZZZZZZZZZZZZZZZZZ2ZZZZZZ222222ZZZ2ZZZZZZ lllll— llllllll l l l l l l l l 1 1 111 ZZZiikZkcZZZZZZZcZZZZZiZZ'cZkkZZZZZZZZZZZZZZZZZZZ — 111111) 101111111111111111 ZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZ22ZZZZZZZZZZZZZ 111. l l l i l i ,111111111111111111 ZZ77ZZZZZZZZZZZZZZZZZZZZZZZZZZZZZ2222ZZZZZ2ZZZZZZZ 111 mill 11111111111111 222222222222222222222222222222222222222222222 • l l l l l l l l l l llllll l l l l l l 2222 UkZZZZckZ22ZZZZl222ZZZZZZZZZZZ?Z?ZZZZZZZZZZZZZZ 111 111* i i n n i zki "222 u
hip
111 .'.111
t
Figure 10. Map of lead concentrations. Contour levels: 1 — 3-14 ppb; 2 — 14-25 ppb; 3 — 25-36 ppb; 4 = 36-47 ppb; 5 — 47-58 ppb. 0 — sampling site.
In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
5.
KNUDSON ET AL. H* NH4 + M A Κ 1
H* NH4* N4
CA
MG
l*
eu
PB $04 .15
.05
-·10
.49
.37
• 31 .36
• 13 .40
1
.46
.24
• bl .46
• 41 • 05 • 26 .03
• 01 .71
1
Κ
• 61
CA
. 71 .94
*G
103
Rain Chemistry
1
.64 l
.94
.47
.47
.52
.34
MM CD AS SB CL- NO 3• 10 • 29 • 41 • 40 • 27 • 74 .24
.07
.23
• 68 -•11
• 33 .39
.39
.57
• 92 • 47
.42
• 61 • 24 • 19 • 33 .29
.50
• 4b .09
.33
.62
.43
• 61 .56 1
.39 1
ZK
• 21 .48 .27
• 43
.66
• 1Θ
.52
.23
• 11 • 62 .94
1
.52
.27
ce
• 05
• 85 • 36 • 14
• 11 • 10 • 33 • 53 • 35
• 29 -.04 1
MN
• 88 • 42
• 44 • 26 • 41 .45
.14
1
S04
• 26 .49
• 38 • 33 • 22 • 19 .36
• 46 • 38 • 2 1 .49 1
Cl
PB
.54
.14
• 12 • 35 • 26 • 16 • 09
.55
•03
• 24 .50
1
.55
• 64 • 32 • 12
AS SB .92
CLNO 3-
.68
1
• 31
,74
Figure 11. Correlation matrix. Values in lower left half are correfotions > 0.7.
"Similarity" Values
Figure 12. Hierarchical dendogram. "Similar ity values** are absolute values of correlations or averaged correlations.
In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
CHEMOMETRICS: THEORY AND APPLICATION NR CL MG CA SB Κ CD ZN CU PB MN AS 1= NH4 m z H+ IZZI S04LZD N03CZD 0 10 20 PERCENT INFORMATION
AS C CU C SB C CD NH4CZZD MG LZZJ PB C D CA C D +
K • MN Ο
EV2 !L S
EV3
α
CL 804 • NR 0 N03I ZN I0 10 20 30 PERCENT INFORMATION
o
10
P E R C E M T
20
30
40
I M F 0 R r t R T I 0 H
Figure 13. Histograms of eigenvectors. EV = Eigenvector^ "n," where "n" is the magnitude-rank of the eigenvalue (see Table II). "Percent in formation' is equivalent to the loading for a particular species divided by the sum of the loadings in that eigenvector. n
In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
KNUDSON ET AL.
Rain Chemistry MG NA
AS • CU [ SB [ CD C H+ NR •ZN • CL • CA SO4 0 MG I Κ I PB I MN I NH4I N03I
CA
• •
0
•
MG • NH4 0 PB Ο
CO i
MN I S04I ZN I 0 10 20 30 40 50 PERCENT INFORMATION S04CZ PB • CO • SB 0
κ
I
MN CA Β
ΝΑ I
CU I N03I NH4I CL
ι
10 20 30 PERCENT INFORMATION
VM3
AS I CA I SB I
ο ο
0
MN 1 = CO C Z Z CA Ο Κ • NH4D ZN •
NA •
m
1
Κ LZZI PB • ZN • N03D SB • S04D H+ 0 CU D MN 0 AS I CD I
10 20 30 PERCENT INFORMATION
CU I CL I
LZZ]
NH4I
N03C
κ
C C
VM'5
ZN I H> I 1020 30 4050 60 7080 AS 0 I. PERCENT INFORMATION
SB I PB I H+ I N03I CU I AS I CL I
0 10 20 30 40 50 60 70 PERCENT INFORMATION
ZN ΠΖ CA • CU • MN 0 CO 0 NH40 PB 0 ΝΑ Ο α ι
VMe
MO I SB I 6041 Κ I Η* I N03Î A8 I 0 10 20 30 40 50 60 7080 PERCENT INFORMATION
PB ΓΖ CO ο 804D CA Ο C A Ο α Ο MN 0 MO • N030 ZN 0 Η Ν • CA D HO I NR 0 MO 0 ΝΑ I N030 Z N 0 SB I 8 8 0 α ι H+ I VM7 N R 0 V M S 8041 NQ3I MN I V M 9 Η * I PB I κ I PB I CO I NH4I CU I CU I H* I 68 I CU I 6041 Z N I A S I A6 I A6 I CO î . Κ ! . 0 1 0 2 0 3 0 4 0 9 0 6 0 7 0 30 40 SO 60 70 0 1020X405060708 0 1020 I PERCENT INFWHRTION NFDRMRTION PERCENT INFORMATION Figure 14. Histograms of varivectors. VM = Varimax vector "η." Interpretation simihr to Figure 13. n
In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
106
CHEMOMETMCS: THEORY AND APPLICATION
Ει E E E E. 2
œ 3
4
m
Β
Β
Ο
m
m
m
•
t
Β
β
m
m
m
Β
m
m
9
m
t
m
EB
m
• • m
m
Β
•
Β
•
•
Ο
β
s
m
ο
m
m
m 9 •
•
»
0
•
m
V• Vi Vm V V• 2
9
4
•
Β
•
m
m
ffl
•
•
•
Β
•
Β
S
t
Β
9
9
m
Φ
m
•
•
m
•
0
S
1
φ
•
m
•
»
m
m
m
m
•
m
9
m
•
I
•
Φœ œ' œ
•
•
*
•
1
* SI
m
Β
5
m
• • •
ffl
3
m
•
m
•
œ
œ
ffl
m
•
m
•
ffl t
a t
•
9
m
•
t
•
œ
•
•
Figure 15. Comparison of eigenvectors and varivectors, including error data. E — Eigenvector "n" where η is the magnitude-rank of the eigenvector (see Table II). V = Varivector V where η is the magnitude-rank of the varivector (see Table III). Note: Varivectors are not ordered by varivalue ranking in this figure—rather, they are ordered by the closest conespondence to the eigenvectors in terms of the "identity" of the vectors, based on their relative loadings. n
n
In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
5.
KNUDSQN ET AL.
E
6
+
Φ
NHÎ
•
H
ΝΑ
Κ CR MG ZN
Cu
•
œ ι
Ε
7
Rain Chemistry
E
8
m
Eg Φ
m
+
SI
m
Φ
ι
Φ
m
Φ
œ
m
œœ •
s
*
PB
E10
107
Vu V12V14V13V15
*
œ s
•
Φ
Φ
•
•
œ
•
•
• φ
Φ
•
t
HH
•
ES
•
•
•
ffi
S
Φ
•
m
φ
β
s
•
Φ
m
•
m
•
•
•
•
•
|-f•
•
Φ
•
•
œ
Φ
1
œ
•
Φ
•
•
m
•
ι
0
Φ
ffi
m
J »
m
•
•
m
CD
•
•
\
m
m
•
•
fls
•
•
t
•
1
9
•
•
SB
•
œ
I
0
«
•
CL"
Φ
•
Φ
•
t
9
m •
œ
NO3
m
Φ
•
a
m
•
•
SO4
MN
EE
•
•
1
•
•
Figure 15 (continued)
In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
CHEMOMETRICS: THEORY AND APPLICATION
108
En
El2 Ε13 Eu El5
•
Φ
m
NH;
Φ
Φ
φ •
ΝΑ
*
*
Φ
κ
IB
CR
*
H
+
MG
•
I
φ)
Φ
t
φ
Φ
m
•
t
Φ
ΕΒ
Cu
• ι
I
m
•
a
PB
II
•
Φ
•
α
•I ·
Φ ! '
•
Co
jffi
α
s • œ
fis
;œ
Φ
Φ
SB
Φ
Φ
CL"
! m
NO3
ι* j
•
•
•
Φ \ Φ Î
• m
Φ
*
ι
•
œ
œ 19 • Β LB
Φ
Φ
•
Φ
ιj j j ji
œ Φ
ι Γ
m
φ
φ
ι
mœ
• Ι !
•
•
•
φ
Φ
Φ
φ α a
7
ι !
•
+
Î
10
•
•
•
ν ν ι
•
φΒ
•
9
•
•
•
φ
m
»
ι
MN
8
Φ •
*
SO4 ! Φ
6
I
ZN
β
ν ν ν
φ
ΒΒ
F= ΕΒ
m
•
t
•
ι
Figure
In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
•
Φ
In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977. = ο οεει-ιζβ'ο = 9-mo-ziso — f-zworwro — ε Έοτο-θοεο= s 'α»Ό-£l£Q— =* j :$ydoa\ mo}uoj wnpnj. dadoq-ηθηηψΌ^ ÎSÎÎJ. /ο CÎPJ^ *9f dmHig tzzz εεεεεεεεεεεεεεεεεεεεεεεεεεεεεεεεεεεεεεεεεεεεεεε ζζζζζζζζ ιχιιιιιτιιιτιχτιιτ.ί'τ 1222e εεεεεεεεεεεεεεεεεεεεεεεεεεεεεεεεεεεεεεεεεεεεε ζζζζζζζζζ ιτττητττπτιττττΓ • ζζζζζζ εεεεεεεεεεεεεεεεεεεεεεεεεε«ΐεεεεεΐεεεεεεεεε*ε ζττζζζζζζζζ TTIIITTITITTTITU χζζζζζζζ εεεεεεεεεεεεεεεεεεεεεεεεεεεεεεεεεεεεεεεεεεε ζζζζζζζζζζζ π \\\ ΪΙΤΙΤΤΠ r?22222222 εεεεεεεεε?εεεεε^εεεεεεεεεεεεεεεεεεεεεεεε ζζζζζζζζζζζζζ ηττπτιιτττιτ ιζζζζζζζζζζζ εεεεεεεεεεεεεεεεεεεεεεεεεεεεεεεεεεεεεεεε ζζζζζζζζζζζζζ τ η τ u à ιζζζζζζζζζζζζζ εεεεεεεεεεεεεεεεεεεεεεεεεεεεεεεεεεεεεεεε ζζζζζζζζζζζζζζ χ τττττίϊίι • ζζζζζζζζζζζζζζ εεεεεεεεεεεεεεεεεεεεεεεεεεεεεεεεεεεεεεε 2?2??22?222222' TTxxinxa Τ2?22222????2?2?? εεεεεεεεεεεεεεεεεεεεεεεεεεεεεεεεεεεεεεε ζζζζζζζζζζζζζζζζ τ β τ π ιζζζζζζζζζζζζζζζζζ εεεεεεεεεεεεεεεεεεεεεεεεεεεεεεεεεεεεε ζζζζζζζ ζ ζζζ ' ζζζζnui ιιζιιζζζζζζζζζζζζζζζζ εεεεεεεεεεεεεεεεεεεεεεεεεεεεεεεεεεεεε ζζζζζζζζζζζζζζζζΐι τζζζζζζζζζζζζζζζζζζζζζ εεεεεεεεεεεεεεεεεεεεεεεεεεεεεεεεεεεεε zzzzzzzzzzzznzfz \ • ζζζζζζζζζζζζζζζζζζζζζ εεεεεεεεεεεεεεεεεεεεεεεεεεεεεεεεεεεεε ζζζζζζζζζζζζζΐζζ Τ! XU ζζζζζζζζζζζζζζζζζζζ εεεεεεεεεεεεεεεεεεεεεεεεεεεεεΓεβεεεε ι*ζ>ζζ?>ζ?ζΑζ? iuu\ ζζζζζζζζζζζζζζζζζζ εεεεεεεεεεεεεεεεεεεεεεεεεεεεεεεεεεεε ujzzzzzzzTz ιχτίϊίχττ ζζζζζζζζζζζ izuzz εεεεεεεεεεεεεεεεεεεεεεεεεεεεεεε II%JMLZZZ\ZZZÎ à itXTiTtTTTT ζζζζζζζζ ζζζζζζζζ εεεεεε εεεεεε ε εεεεεεεεε εεεεε εεεεεβτ 3*22222, ΑΧΧΧΧΧΧΧΧΧΧΧΧΤ ζζζζζζζζζζζζζζ εεεεεεεεεεεεεεεεεεεεεεεεεεεεεεεχεε τ%2ζζζ( πτττπττττΐτττιι ζζζζζζζζζζζζζ εεεεεεεεεεεεεεεεεεεε εεεεεεε>·*ε Ύ·Ϊ?Ζ χτΐίτττττιτττττττιττ ζζζζζζ ζζζ un εεεεεεεεεεηεεεε •»•·•> εεεεοεε5Ιεε~ ΐτττιτΐίττιτττττττιχττ ζζζζζζζζζζζζ εεεεεεεε^εεε η ^ ν ^ rrLfcfTrrrTf iTiitttTTTtTTTTTTTTtiTiit ζζζζζζζζζζζ εεεεεεεεεε ν*»******^ ΜΤΤεεε — • XXX. ΧΧΧΧΤΧΧΧΧΤΧΧΧΧΧΧΧΤΤΙΙΙΙΙΤ ζζζζζζζζζζ εεεεεεεε ιχχχχχιχχχχτχχχχτχτχχχχχχχχχιχ ζζζζζζζζζ εεεεεεε *>*>*>*? . ιχχτχχχτττττττχχχττχχχχτίχχχιχχχ ιιζζτζιζ εεεε*ε *ν> ςς<ί rxxxxxxxxxxxxxxxxxxxTxtxxixxxxxxxx ζζζζζζζζ εεεεε ν^ν ς^ο] ifT'tTXTXTTTTTXinTXTXtxcxiitiitxit ζζζζζζζ εεεε ςςς\ 9ττχχτχιτχττττχττττττχττιτιτιιιιιιχχι 222222 εεε *>*>*>>> ΐίττχχιιχχχίχτχχτχχιττχχτιιχιιιιπιτττχ ζζζζζ εεε **ν* πχχχχχχχτχχχχχχχιχχχχιχχχχχιιιιχχχχιτιτι ζζζζ εεε uxxxxxxxtXTtTTtxxxxxTixxxxxxxitxixxxTTxtxT 222 εεεε ιχχχχχχχχχχχχχχχχχτχχιιχχιχχχχχχχχχπττχχτχχχ 222 εεε εΐεε •χχχχχχχχχχχχχχχχχχχχχχχίΐτχιχιχχχχϊχχχχιχχχχχ ζζζζ εεεεε \ Λ : Ζ Ζ Ζ Ζ \ ΪΤΤΧΧΧΧΧΙΧΤΧΧΤΧΧΧΧΧΧΤΤΧΧΧΧΧΧΧΧΧΠΧΧΤΙΤΧΧΤχ-ττΧΤΧ 22?' 'ZlklllZl^ , πχχχτχχχχχχχχχχττχχχχχχχχχτχιχιχίχιχτοχτχττχττχτ ζζζζζζζζζζζζζζζζΛζζζ ïxxxxxxxxxxixTxxxxxxxxxxniitîiîiixxixxTXtiutrxxxr zzzzzzzf• ΠΧΧΧΧΧΧΧΧΧΧΧΧΧΧΧΧΧΧΧΧΙΧΧΙΧΧηΠΠΤΤΧΧΧΧΤΧΧΤΙΧΤΧΤΤΙΧΙ ΤΤΪ\ΙΧ î XX t l .ΙΧΧΧΧΧΧΧΧΧΧΧΧΧΧΧΧΙΧΧΙΙΧΧΠΧΧΙΙΐΙΠΐΤΧΧΧΓΧΧΠΙΙ 31TTXTTTÎ ζζί CXXXXXXXXXXXXXXXIXXXTTltlTIXXltlUIIXlITIlIlUTTITTTXXlXl ϋτχτχτττ Τ\ΤΠ1ΧΤ1 nmiinif ιττχχτχτττχχχτχχχχχχττχτιχχχχχπ iimxTxmxTij *Μ11Ι1ΧΙΙΧΧΧ|ΤΙΙΙΙΙ . . . _ Ίτχχπτχχτχ!|ιΐιχχιχτΓίΓηχχχτχί ÎXXXXXXXXXXXXTXTTXIXXIXT.IXXIXIT.Ï1TTTTTXITITTIÏ? Trxxxxxx τχχτ ίΤχτχχχ χ ττχχτ#χχχτ χ* •τττττ IXITTTXXXTTTTTTTlT tTT ttlttX L IttlltITT xxxxTxxxxTTxxTxxxTxxxTxtx ζζζζζ ττιι ίΧΧΧΧΧΧΧΧΧΧΧΧΧΧΧΧΤΧΧΧΧΧΙΧΧΧΧΧΧΙΧ ζζζζζζζζζζζζζ χχχχχχχχχχχχχχχχχχχχχχχχχχιχχχι 2, ΖΖΖΖΖΖΖ ΓΙΤΙΤΤΧΧΤΐΤΤχ* ΤΧΤΧΤΧΧΧΧΤ 1022._ ζζζζζζζζζ ΙΧΤΧ ΧΧΧΙΤ ζζζζζζζζζζζζζζζζζζ ζζζζζζ ζζζ *Cl 2 ΖΖΖΖΖΖΖΖΖΖΖΖΖΖΖΖΖΖΖΖΖΖ ΖΖΖ ζζζζ. XXXΧ ΖΖΖΖΖ2ΖΖΖΖΖΖΖΖΖΖΖΖΖΖΖΖΖΖζζζζζζ "ε! fllr ε| ζζζζζζ £ Χ Χ Χ ΖΖΖΖΖΖΖΖΖΖΖΖΖΖΖΖΖΖΖΖζζζζζζζζζ εεεΙ εεείεεεε εε ζζζζζζ rχχχ ζζζζζζζζζζζζζζζζζζζζζζζζζζζ ε εεεί ε εε* 222? t χ χ χ ζζζζζζζζζζζζζζζζζζζζζζζζζ εεε ::~εε ζζ . b*>*>* εεεεει - S<* ε εχεεε •χχχ ζζζζζζζζζζζζ esfetc tχχχχ ζζζζζζζ εεεεεεεεεεεεεεεεεε. *> εεεε _ εε*ϊ£εεε txxxx ζζζζζζ εεεεεεεεεεεεεεεεεεεεεεε εεεεεεΑΡίε ε_: : r χχχ χ χ ζζζζζ εεεεεεεεεεεεεεεεεεεεεεεε εεεεεεκεεε ε *>h*/*t*t*>*> ~ ί£ίί τχχχχχ 2222? εεεεεεεεεεεεεεεεεεεεεεεε ?εεεεεΤ t ζζζζ *>*>*rεεεεεεεε, εε εε εεεεεε εεεεεεεεβΤεεεεεεεε' εεεεεε, |εεεεεεε|εεεεεεεεε_ IT ΐχχχχχχ 2 22 εεεεεεεεεεεεεεεοεεεεεεεεεεε ?ζζζζζζ?ζ< ?εεεεε|εεεεεεεεεε εεεεεείεεεεεεεε " 2222 εεεεεεεεεεεεεεεεεεεεεεεε ζζζζζζζζζζζ αχχτχ χχ 22? εεεεεεεεεεεεεεεεεεεεεεεε zzzzzz??zzt it 1ΧΧΧΧ XX χχχχχχχχ 222 εεεεεεεεεεεεεεεεεεεεεε ζζζζζζζζζζζζι τχχχχχ εεεεεεεεεεεεε^ζζζζζζζζζζζζζζζζζζζζζζζζζζ\ζζζζζζ\ ΐχχχχχ |^ 2||2.^εεεε .22 εεεεεεεεεεεεεεεεε ??222 εεεεεεεεεεεεεεε ^22>2
{
1
r
Γ
τ
'"XTITXIXXXXXTXTXXTXXTXX ΖΖΖΖΖΖΖ ΖΖΖΖΖΖΖΖΖΖΖΖΖΖΖΖΖΖΖΖΖ ZZZXZZZj χχχχχχχχχχχττχχχχχχτ ζζζζζζζζζζζζζζζζζζζζζζζζζζζζζζ \ζζ τχχχχχχχιχχχχχχχχχτχχχ UW'îî ζζζζζζζΖΖΖΖΖΖΖΖΖΖΖΖΖΖ£ ίΧΧΧΧΧΧΤΤΧΧΧΧΧΧΧΧΧΧΠΧΧΧίίΧ'2222222?22 πχχχχχιχχχχχχχχχτχχχχιχχιχχ ΖΖΖΖΖΖΖΖΖΖΖΖΖΖZZZZZZJZZZζζζζζζ ÏXXXXXXXXXXXXXXXXXXtXXXXlXIXl Zllcl Zlï'TZZlZZZTXZZZZZZZZZ
60Τ
faisiwdHO î U u v
, r i v
1
3
NOScniNx -g
HO
CHEMOMETRICS: THEORY AND APPLICATION
12222 IZZZZ^ IZZZZZ ... IZJZZZZZZZZZZZ ZZZZZZZ p i l l
yZZZZ?ZZZ^ZZZ22Z2iZZkZcZ?Z?2ZZZ 1 111 111 111 1111 l l l l l l l I l l l 1 *?ZZZZZZZZjZZZZZZZZZZZZZZZZZZZZZ?? 111111 i l 11111 l l l i 1 1 1 1 1 1 1 j \%\ii^Umii\ll%U^ll l H l i i i i i n l i i i i i n i i n u ni zzz Zi
2 2Z
ZZu4 \ίίΨΤζΖΖΖΖ?Ζ?ΖΖΖΖΖΖΖΖ?????Ζ2Ζ2 11 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 * r-ZZJ^^ZZZZZZZZZZZZZZZZZZZZZZZZZZZZ l l l l l l l l l l l l l l l l 11111"
2a2l22222|| |||||2y ^2c\ZZZZZZZZZZZZZZZZZZZZZZZ?2222ZZZZ? lllllllllllllll111U |f
ΐίίΐηϊΠηηίΐιι
Hl!!I!I!!I!l!?III!III^ ZZZZZZZZZZZ μζζζζ^ζζζζζζζίζζζζζζζζζζζζζζζ ~~z?zzzzz~lîîinUiïîîîin ZZZZZZZZZZZ fZcikZZkZ ZZZZZZZZ ZZZ?ZZZZZZZZZZZ? l l i l l l ^ ^ ^ ? ^ ^ ^ ^ .^}^^ <Λ\Η}-^}^}η^ι* Alllll* zzzzz i i i n i i i 1111 2222222222q22|||||^2222^ Jzzzzzzz 333333333333 22222 l l l l l l l l l i l l J 33 * ?. 3 3 ? Γ ? 3 3 3 ^ZZZZ 2222 l l l l l l l l l l l l l ...33*333?? ?333 ZZZZZZZ' [222 222 22 2 2 2 2 2 2 2 2 2 1 2 2 < ? ?i 11*3* 33333333 4 3333 2222 l l l l l l l l l i l l ^ZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZ 2?
ζζζζ
ZZZZZZZZkiZciZkkZZZcZkZZZZZiZ ZZZZZZmzzzzzzzmrzrzr?^
_>33333 ZZZZZZZZZZZZZ _33333333 44.4444444443333 3 3 3 3 22 2222222 333333333 44444444444444 2222 , . 5 3 3 3 3 3 3 .|3?fl333 ^. ---333333333J333333 333333 1333333
llllllllll 1111
33 4444441 333 133 44444JK4444 3 i 33333333 44444444444444444444444 33 ZZZZ 1 I333 444>*4444444 3 -_UJ3333333 444444444444444444444444 33 2222 1 _ 11_ _ . ,3333 4f4444444444 3^|33333 444444444444444444444444 3333 ZZZZ11113 33333 3A1333 444444444444444444
3333
444 ,"-k4444444
33333 2222 11
3333333333333 ZZZZZ 1
. . . '4444444 444 3333333333333333333 ZZZZZ 1 4 444|r4444444 4,414144 333333333333333333333 ZZZZZZ 1 44f4444444 5 33333333333333 3 ZZZZZZ 4414444444 bb S*,33333 ZZZZZZZ ZZZZZZZ 333333Λ44444Ι44 ., 44444 , jLZZZ 3333Λ J 44 444444 IV ζζζζζζζζζζζζζζζζζζζζζζζζζ **, ?222222 333j333 4444 " 22222/XU6333 33 ZZZZZZZZZZZZZZZZZZZZZZZZZZZ 1111 133 -_—2ZZZêfZk?TÊ ,,^2 S2**!!*2222222222 2 111 ZZZZZ ZZZZZZ ZZZZZZZ l l l l l l 1114 Î333323 2Tznzzzzi%zzzz ZZZZZ? ,2222 11 11 lllllllllill* ZZZZZZ? 1 ZZZZZZZ 2 ç i n u i i i i i i i i i i i i i i i i ii•* fzzzzzzzzz^ i m m i u i ï H - l l»l »l l l l l l i l l 3331 3333.-, ,333333^ 333333..
i-
k
ΛΛ
A
,
??
J
1
m
lllllll 11* \ZZZZZZZZZZ i l l l l l l l l ll il il ll ll ll ll ll ll ll ll ll il l l l l l_1 ^^ZZZiZZZZl l l l l l l l11
I l l l l l n n1 1u1 1i 1i1u1 1i1l i il lhl il li lul lul il . lllll
llllllll b 111
1111
ZZZZZZ
izzzzzzl izzzzzzl
m
.ZZZZ IZZZZ
m
1222
22l
HiUliilll
11 l i i m n n n u i u u „uiuiiinîïnï rwÎîîîîlî"»""*' lllllllllllllllllli. Fill H H r W * l l l l l* l* li lilllililili li li l 1111 l l l l l l l l i i l l l l l l l l l i } > | i i i i m u i i i i i i i i u i i ilililil il il il il li li lilililil tl i t i . l l l l l l l l l l l l l l l l l l i 115 llllilllllllllllllllllll --îiuiiininuiiiiii 1111 ZZZZZZ
RttH..
11
f
; î U i n i i u n n i u m i i 1111111111111111111111 ZZZZZZZZ IZZ 2ZZ2 . Λ 1 iîiiiiiliîP11»1 - \ 33333 ZZZZ 11111114
22221 2220?
iZZki
m
AA
i i n nίίίί..;;. ïjiiïiiîîiîii IHli i i i l f H
-.A4 , ,Al^Azzl î i i nm n i i i i i l lu li i iM i m i i f i i i i n . 11 u »l liM \*,. . ^ \ ^ m i ml li lmil n 155 444 333 ZZ222 1111" l l l l l l l l i i l l l l l l l l l i 1 Γ44^ >l>55|^55.4444 3333 ZZZZZZ U 6 i H i f t i i i î m " » » »11» ^5|55|55;4444"3333"ll|l2^ i 55555|û555 55 44444'33333"" 222222 " " î î î i î î î Î H Î Î I ï î î î î î î î 11 n m 111 Γ33 444/-*5555555 444444444 3333333 ZZZZZ I i l l l l l l l l l l l l l 11* 3333313^^44''555555 4444444444*~~ 333333" " 2 2 2 2 2 2 ÏÏÎIÏÏUÎÎÎÎÎÎÎnî J333333Ai^Ji 44444 44444444444444 3«333 PZ75???* ~ ~~ î "îï î îîî llllll 277ι llllll 2 3 3 44 4 4 <
fzzzz
4 4 4 4
4 4
4 4 4 4
111
z z z z z
K
ΛΛΛ
^ ? ^ * l f * * * *<»* »444444444444444444 333333 zzzzizzzz ïïiïïîi £?i? ^ ^ ^ 3 3 3 3 3 3 3 " zzzzzzzzzz n i ? 'ίέΛ..^»» " -»-»-»-»-»-»H-»«»«I«444444^ 444 444444444444 33333333 ZZZZZZZZZZZ 1 ί 2 2 2 Μ ^ 3 3 3 3 3 444444444444444444444444444444444 3 3 3 3 3 3 3 3 ZZZZZZZZZZZl , tZZZTZ 3 3 3 3 3 3 44444444444444444444444444444444444 3 3 3 3 3 3 3 3 3 ZZZZZZZZZl I Ζ? ? ΖΖΖΖΖΖ 3 3 3 3 3 3 44 444444444444444444 4444444444444444 3 3 3 2 3 3 3 3 3 ΖΖΖΖΖΖΙ II 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 4444444444444444444444444444444444444 3 3 3 3 3 3 3 3 3 3 ΖΖΖΖ* 101 ΖΖΖΖΖΖΖΖΖ 33333 444444444444444444444444444444444444444 3333333333 1 1111 2 2 2 2 2 2 2 2 2 3 3 3 3 3 4444444444444444444444444444444444444444 3333333333331 ti?Wtt}}l\\„ 444444444444444444444444444444444444*44444 3 3 3 3 3 3 3 3 3 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 ZZZZZZZZ 3333 4444444444444444444444444444444444444444444 3 3 3 3 3 3 3 3 3 ! fnVllllllllll ZZZZZZZ 333 44444444444444444444444444444444444444444444 3333333ft 1 / r n n i l l l l U i l 2 2 2 2 2 2 2 3 3 3 444444444444444444444444444444444444444444444 3 3 3 3 3 1 3 3 l l
4
3
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
3 3 3 3
ZZZZZZ?Z
apiiiiiimiiiπ 22222 333 44444444444444444444444444444444444444444444444 3331 n i i u u i u i i u i u i 22222 33 4444444444444444444444444444444444444444444444444 1 P f J . ^ W H P H U * 22222 33 4444444444444444444444444444444444444444444444444441 Z2ZZ tHHjHlHUillll'l 444444444444444444444444444444444444444444444444444* u i i u u i i u i i u i n i i i 2222 3 4444444444444444444444444444444444444444444444444441 1
1
1
Jllllilllllllllllllllll
3
3
ZZZ 3 4444444444444444444444444444444444444444444444444442
Figure 17. Map of second varimax feature. Contour levels: 1 = —0.631 0.303; 2 = -0.303^0.025; 3 — 0.025-0.353; 4 — 0.353-0.682; 5 = 0.682-1.010. 0 — sam pling site.
In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
5. KNUDSON ET AL.
111
Rain Chemistry
222 22222ZÉ22Z?ZZZZZ222Zh?^ZZ i 2*~'Λζζζ MZ7ZZZZZZ 22222ΖΖΛΖ ΖΖΖΖΖΖΖΖ 12222-, ζζζζζζζ*\ΖΖΖ2ΖΖ2Ζζ1ζΖΖΖΖΖΖ?Ζ?2222??π u&z\zYzYzzYzXz~z~z~2ZZZ ζζζζζζζ] •^-zzzkzzizzzr ZZZZZZZJ ζζΎ^ζ^ζζ ζζζ ζζζζζζζζζζζζζ^ 222222 Will*' \Z2ZZZZ1 ΗζΖΖΖ\ζΥΖ'ΖΖΖ?Λ ïûHllïïiîzWzt&zzzfâ YZZZZ
5
izz^zzzzÀfzWMVam Ζ2Ζ\
u c «- ι
CC.CC c
tZZZZZZZZZZcl ζζζζζζζζζζζζ* 222222222221
.. . . . .
ΖΖ.Ζψ 122? Zccc22iZ22222Q222222 2222? 2222?222222222ΖΖΖΖΖΖΖ2222 i121ΎΖ222Ζ21Ζ\ l l ΛζζζΛ 2?22i2? ZZ ZZZZZZZJ? zzzzzz η 22222222 ΙΖΖΖΖΖΖΖΖΖΖ2 ZZZZZZZZjZZZk ZZZZZZZZZZZZZ? ZZZZZ 22? ??:???? ZZZZ? ZZZZZ Ζ 2 ZZZZZZZZ izzzzzzzzzt - l lll^llll^llizzzzzz?zzzzzzzzz2zz?.???z?z?z?zzzzzzzzzz?zzzzzzzzzz\ iZZZZZZZZZZll?22222φ2222Ζ2 „l J^lUUUU l 22 222222222222222222222222 2?2???2k2k2kc2cU ΙΖΖίίΖΖΪΖΖΖΖΛ ZZZ22j\?222?Z2ZZZZZZZZZZZ?ZZZZZZZZZZ?.22222222??22222222222222 uzzzzzzzzzz, IZZZkZZZZZ'\zzzz?z\zzzzzzz^ " ZZZZZZZZj izzzz l l l l l l l i 222 122ZZ \?2222Ζφ2ΖΖ??.Ζ ZZZZZZZZZZZZZ ZZZZZZZZk???ZZZZZZZZ?ZZZZ?ΖΖΖΖΖΖΖ 1111 lllllllll 1 2 < *zzzz ?ZZZZ&2222222Z2J&lZZZ22?2??2222 I J l U i l J **^**^***ρ2222222222222222?Ζ222222??222222Ζ2Ζ22ΖΖΖΖ22222ΖΖ21 Ullli 1111llifll11111 .1111 IJni l l l l l l l ΖΖΖΖΖΖΖ ZZZZ? 22 ^? 2222 22222222222222222222 ? ι i n Mi m i l m i rΖΖΖΖΖΖΖ ~2ZlZ2ZZ2222Z2U2222222Z22Z?2LZ'clZZ222222222222222\ Jil iiftfiiiiniiii \Z2ZZZZZZZZZZZZ?.22?222222222ZZZZZZZZZZZZ22222222\ 02222222 It 2222222222222222222ZZZZZZZZZZZZZZZZZZ21 ???22>0|αΐ11111111/ I? 22222.2? 2222 2222222 ~"222 222222??2222222222222222222222222ZZZ% l111111 l l i1 ^22222222?2?2222222222222222222222?21 fill ll,l ll ll111 222?2222222??2222222222222222222222\ _ 2 Ζ ? ?. Z\ Ζ ? ? 2?V Ι Γ ^ " ^ .1111111111 l l l f l ^ l VZ2222?22222?222ZZ2222222ZZZ22ZZZZZ* 22?????2Z?Z??tZktZZkiiZZZZZZ2Z?2Zz\ ΖΖ??ΖΖΥΖΖΖΖΖΖ*%1Ϊ1Ί\1 1222222222?22222222 ZZZ2222ZZ221 2???22?*4ΛΛΧ41$ l lUi ll ll lllll ll ll Cl' /, 2Z2222? ~"'222222222222221 33333 ?222i 2222ZZZ2 III 1111 11111? ΖΖΖΖΖΖΖΖΖ S i l l 22 |22 Z2ZZZZZ2ZZ?} 22?2?21?2 U? 221 "2ZZ2£&2M2Z2 _ £22222T~, N i l ! l l33333333333333333333333 _ 22|f22?Mp2222l ll!lliUiiiii Hi m i » VitZ?i<^,,^ί^ 22?!?2?22 ΖΖΖΖΖΖΖΖ __Γ???^2?*222222%22222222 ΐΓ*222ΐ222£222£222ΐ222222 333 33 m333333233333 33333333333333333333333333 33* *222Z2 - -*2 2 2 2 l ? 2 2 2 2 33333 ^3333333333333333333333333333333333333333333^ 122222~222??222GK222 333333 122222 ZZZZZJZZZ 333333.Ϊ3333333333333333333^ΐ3333ίί·»333333333333333|333ΐ B333333333I33333333333333333333333333333333333Î Ukilki'cZZ 2222212? 33333333.^ 533333^3233333333333333333333333333333333333331 12222222~* 2??22?V333330ii3*7 5222122 âilliliHiiliilillillll lii?iHl%nillît3itiiilliyiillUi|t 1222222 1222222 f i l l 1222222 * 3333*33333333*333333333333 CZZ.kk .122222? cZZZX £?!£L!!!?HH!2 !2 ?2! 333333333333333333333333333331 m s s r * ç220ï|33.HÛl||35333 333333 2222ZU222ZZ 333333333 3 3333333333333333* _
CCLCCCCCCCCC
c
z
z
zz zz
z zzzzzzzzz
V
1
μ
2 2 2 ? 2
Î
2
?
22????
3
3
122222, 12222
i *i lll^! ' o î H f V , . , , ,^333 3
a
a
3
3
3
3
3
3
3
?
3
3 3 3 3 3 3 3
2
2
2
2
2
2
2
2
,,,.^^ 33333333333333333333333; 222222222?222222222?22??222 333333333333333333331 3
2
2
2
2
2
2
2
2
2
2
?
?
?
2
2
12222 2222222Z22222222ZZ2Z22222222222222 3333333333333333321 b22Z2 _3J33Al_i^<,«^3 i2 222c2?22i2222?J2222222?22222222??222 33333333333333331 IkZZ \ 222?.2222222222?222222?2??.22Z2??2Z2kkk 33 J3333333333 1222 1A1 ZZZZ ZZZZ?ZZZZZZ?ΖΖΖΖΖΖΖΖΖΖ 333333333331 IZZZZ rzzi Olllllllilllllll 2212222222?2???2222222222 333333331 1 2 l l l l l l l 111111111 111 2222222222222222222222222 333331 Wzzi i l l l l l l l l l l l l l l l l l l l l 2???22222?222?2iiii2222222 333J llllllllllllllxllllllll 222?2?2222222222222222222 i 12221 -Tj,^ *'* H i f f f ^ î H H H H i f i , , * Z22alZZZZZZZZZkZZZZZki IflliiÇwi?!?? UHHU.K ^ ^ , , ? , ^ ^ ! 22222222222222222222222k li223, ïimàlllSlî'ilîiilZl 3 "U2\ ! f H l l ^ l i l f ? l ? f I i WW 11101111 H I 111 l l l i 11 H I 1111 1 22222222222222222222 Z22k2Z222222222222\ Ιΐϊϊ^$Ώέ> Λ\ Λ\ } ^ i H M ^ U ? 4 * "iiiiinii 2222222222222221 fAllîQtëifclW ?. f . , f U U , , } K W J i i | H i i i i i i i i i i i i i 2222222222221 lP^VplppIIIM^^^^.UHUffHHU^^U* !! ! ! ! ! ! ! ! ! Z222222222) '-?7?τ? ϊΙΛΙΙΪ,ΙΙΙΙΙΙΙΙ HiMHUHf J ^ J H ^ H U H ! ! 2222222* l??t Λ HîHïîiUîîlil \HHHHUiHfJΚ ,!, mi AU 11111111111 222221 "t h l
i
222
1
1
1
1
1
2
ζ
1
1
1
1
1
1 1
1
2
l
A
l l
1
l
1
1
zzzzz
1 1 1
1 , 1 1 1 1 1 4 1 1
l l
2
1
1
l 4 1 1 J 1 1 1 1
1 1
1
1 1 1 1 3
1 1 1
1
1
1
1
l
if
Z
ι
Z
iÎilîîEl^i£i'ï '
Z2
z n 2
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1 1
11
fill
!
1 1 1 1 1 1 1 1 1 1
1
1
}
•
"il 4
' 11 H i H i t i i i i m i i i m i i H i H
m i " 1111111 m i π n i i n i i i i i i i i i i i i i i i i i i i i i i i j " — · —s • 6 • 7 • e*
Figure 18. Map of second Karhunen-Loève feature. Contour levels: 1 — 0.5590.2J7; 2 = 0.217 0.125; 3 0.125—0.466; 4 0.466 0.808; 5 = -0.808—1.150. 0 == sampling site.
In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
112
CHEMOMETRICS: THEORY AND APPLICATION
l l l l l l l l l l l l l l l l 111 11111 ιι u n a i l l l l l l l li Γl l l l l l l l ] i l l l l l l l l l l 1 1 1 1 1 1 1 1 1 1 1 Ï U Mil 1111 J Uiwmiiiiito _ H H H H i m H H H i l H l H H l l l H m i111U1U iiiniHiiiii lllllllllllll " llllllllll, _ _ _ l i i i i i i i i u u i i 111111 i n i l n n u n n i i x i i n i nπη η * lllll' i H 11 If m l ï1n1 1nl ïl Îl lîlîl m l l li lil n l lul lul li lilil il lml lm l l l l l l l l l l l lm l l lml lil il m l l ii lil i1 i1 i ^ il l l l l l l l l l l l l l l l l iï î ï Ï Illllllll.1.11111.11 Ï Î Ï Ï Î Ï l l l l l l l l 1 1 1l 1l 1l l llll 1 l1l1l l l l l 1 1 1 Illllllllll1111111 llllllll ιι i n 11111 n n i i n π i n m i n i i n u i i i i i i i i i i i i i i i i i n i lllllllllill lllili lllllllllill ΐ ι ι ΐ ι ι η η ι π η η ι ΐ ι ί π η η ϊ η ι η ι η η η η η η η η η η , η ι lllllllll, "Illlll] m n n m i a j o m m n m m i ^ m i ^ ^ ^iiiîi'Mîîïîîiîîiiiîflîi i i l i i i î i i i i l i i i n i v ^ΐΐϊϊΠϊΓιΐιϊηίϊΐϊηϊΐϊηΐιϊΠί _ niinniiiiiniiiiiii* k l l l l l l L ιLIί ιΠιιηιιιηιιι ijn n n i î u i î i i i î u i i i l l ι ι η ιι n . 2 ? ? " . . i n n n i i i i n i i u i i i i .jllJlllllllllllllllllllllllll ZZZZZZZZ l l l l l l l l i i l l l l l l l l l i fi 11111 il TllxllllllÛJIlllii] imiiiiiimiimi .iiiiiiiiiiiiiiiiiiiiiiiiiiii ???z.?.?2
t
v
m
Ï
1
r
4
2
Figure 19. Map of first varimax feature. Contour levels: 1 = 0.453-0.012; 2 — 0.012-0.428; 3 = -0.428 0.869; 4 0.869 1.309; 5 1.309--1.750. Ο = sam pling site.
In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
5.
KNUDSON ET AL.
113
Rain Chemistry
ΖΖΖΖΖΖΖΖΖ A33333333333333333333333333$Sl3333333333333333331 2222i22li22222222p33333333333 333333333333333ï33:3333 3333ÏSÎ3333333] «222222 222/333333333333333333333333333333333333333303333333) ÏZZZZZZZl ΖΖΖΖ2Λ**Γ 33333333333333333333333333333333333Î3333J333333331 ζζζζζζζ] iWzzzzzz ZZWZZZZZZ 333333333333 333333333333333333333333333333333333* ΛΖΥάΖΖΖΖΖ 33 33 33333333 33333 3333333333 33333 3 33333333333 33 3 31 ZZZZZ II ZZZ\ZZZZZZ 333333333333333333333*33333 ZZZZZ ZI ίΖΖΖΐΖΖΖΖΖΖ 33333333 333333333333 3333 3^^ ccZ\ZZZZZZZ ZZZZZXZZZZZZZZZZZt ^ZZZZZZZZZZZZZZtZZZZZZZ 33333333333333333333.33333333 ZZZWZ2ZZZZZ 333 333333333Γ3333333333333333333333 33333333 3333333333333331 3333333333! \ZZZZZ\ZZZ2ZZZZZ?ZZZZ\ZZZZZZZ 3333333333333333333333333 2 33333333333333331 *~ZZ2222222222?\2222222 3333333333333333333333 ΖΖΖΖΖΖΖ 3333333333333331 ΛΖ222??Ζ2ΖΖΖΖA Ci UZci 33333333233333333333 Ζ Ζ Ζ Ζ Ζ Ζ Ζ Ζ Ζ Ζ Ζ 33333323333331 \ZZZZZZ\ZZZZZZZZZZZXZZZZZZ 33333333330333333333 ZZZZZZZZZZZZZZZ 3333333333031 ZZZZZZZZZZJZZZZZZZ 3333333333333 33333 7Z7Z7ZZZZZZZZZZZZ 33333333333* •2l ,222222222/22222222 33333333333333333 ZZZZZZZZZZZZZZZZZZZZ 3333333333J ÏZZZZZZZZZZ,Z2Z?iZZZjfZccZ2 22ZZ 3333333333333333 ZZZZZZZZZZZZZZZZZZZZZZ 3333333331 izzzzzzzzzzt ZZZZZZZJrZZZΖΖΖΖΖΖΖ 333333333333333 ?ZZZ?Z?ZZ7?ZciZUZtZ2UZ 333333331 IZZZZZZZZZ? ZZZZZiTzZZZZZZZZZZ 33333333333333 Ζ Ζ Ζ Ζ Ζ Ζ Ζ Ζ Ζ Ζ Ζ Ζ Ζ Ζ Ζ Ζ Ζ Ζ Ζ Ζ Ζ Ζ Ζ Ζ Ζ 333333331 ζζζζζζζζζζζ \ZZZZZZ\ZZZ?ΖΖΖΖΖΖΖ 3333333333333 22222227.722222222222222222? 33333332 iZZZZZZZZZZZ\ZZZZZZ\ZZZZZ?ZZZZ 3333333333333 ZZZZZZZZZZZZZZZZZZZZZZZZZZZZ 33333331 IZZZZZZ2ZZZZZZZZZZMZZZZZZZZZZ 3333333333333 2Z2272222222222Z2Z22ZÎZZ2ZZ2Z 333332Î ΖΖΖΖΖΖΛΖΖΖΖΖΖΖΖΖΖ 3333 3333 3333 3 ZZZZZ7ZZZZZZZZZZZZZZZZZZZZZZZ 333333J 3333333333333 122?7?.???.722??222212ZZ2ZZZZZ2 333331 ZZZZ\ZZZZZZZZA -.1111 ZSZZZZZZZJT 3333333333333 ΖΖΖΖΖΖΖΖΖ7ZZZZZZZZZZZZZZZZZZZ? 3333* m i l m i Tzzzzzzzz 3 3333333333333 ZZZZΖΖΖΖΖΖΖΖΖΖΖΖΖΖΖΖΖΖΖΖΖΖΖΖΖ 33331 lllllillll ZZZZZZ* l l l l l l l l l i l l zzzzz .11111 ^1111 zzzzl %333 33333 222222222Z2ZZ2ZZZZZZZZZZ2222Z23 lllllllllll 3133 4444444 33333 222222?22222222222222222222" - * ^111111 i i u : i ζζζζ 33333 2222222222222222222222222212) β 444 , 111 IU.22/2 3333 2222222222222222222222222222} 444 55' iii<Ti_, ZZfTS 44 555. 44 3333 222^222222222222222222222221 JZZZZZ 111ÎTV11__ 4 3333 2?2Z2?.Zc2l2ZZ2ZLZZZZZZ22222* 2if 3 44 555S5 lllillllîîll3333 ??.?.????ΖΖΖΖΖΖΖΖΖΖΖΖΖΖΖΖΖΖΖΖΖί 22|33 44 55550, lBM^l Vzzzzzz 2 l l l l11111_ 333 Ζ ΖZZZZ2222222222 2227722221 44 555, 333 2222222222222222 1111 2222221 444 55^ 2222 1111_ 444 C44 33 22??722222ZZZZZZ l l l l l l l i 222221 222222^111 ZZZZZZJ Î3 4444«4 53 222222?.?7772222 l l l l l i l l l l 22* ZZZZZZZZZZZZZ i l l l l l l l l i l l l l l 1 VZZZÏ"'22C222727? 1111111111111111111II 'ZZZZZZ ΖΖΖΖΖΊ 1333333 22272222 1 i l l 111111111111111111î ~" 222„ 333333333 ZZZZ "III Si! . 1ZZZZ 3333333 ...22222222 111111111111111111111111I \2222222222Z 2 22«Î2?????2?2? U l l l l l l l l l l l l l l l l l l l l l * +22222*22222222222, 22222??22 222222?22222 111111111111111111111111 izzzz?.zzz\^2222222222 UZiZZZZZZ ïzzzzz ZZ2l22222^^""" ^ 222222Ï& IZZZZZ 333 22 2222222222272222222722222 7222222 l l l l l l l l i l i l l i l l U S bZZZZZZZj .13333 3333 222222222222222222222222272222222 22 111111111111111î 222222222?2222222
H
T
)
k
à
r
,
r
Z
w
3
mm
2 2
1
iiiiiliiiiiiiLtiiiiiJiiiiiiiii3iilii.ll} Figure 20. -0.178; 2
111u1
1
IP ii|iiiiiii i^iiiiiiiii^iiiiJiiiiji
Map of third Karhunen-Loève feature. Contour levels: 1 = —0.4170.178-0.061; S = 0.061-0.229; 4 = 0.299-4.538; 5 = 0.538-0.777. 0 = sampling site.
In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
114
CHEMOMETRICS: THEORY AND APPLICATION
-—
ZiZZiZtciZfLZcZzZucic 3i3?*?33 ZZZZZZZcZZ U i l l l U l U l i U l l · ZZZZZZZZZZ 'ZZ ZZZZZZZZZJZZZZZZZZZZ 333333333 ΖΖΖΖΖΖΖΖΖ l l l l l l l l l l l l l l l l i ) 'ZZZZZZZZ IZZ ZZZZZZZZjy.ZZZZZZZZZZ 333333333 ZZZZZZZZ? l l l l l l l l l l l l l l l l i ' • C C c c t <. t. wZZZZZZZZ .. <- u [ZZZZZZZ tZZZZZZZZZZZtZZZZZZi ΖΖΖΖΖΖΖΖΖΖΖΖ 333333333333 ?222^22"Λ1111 l l l l _ ZZZZZZZZZZZ'' nin \ZZZZZ 2 ? lif|||||f|ff!iifl^§S$§§ifiiiS$3iIi?IiniIill33i$!H$f iîMttfflmilH *zzzzz\ lZZZZ\?cZZ '.1^222?22212?2 3333333333333333 Z??ZZZZZ 111111111111111 i \zzzzz* ξ2ξζ?ζ2^^?ςζ<| 33333333333333333 ?Z'tZZlZc i i i i * i i t i 111 111i \cZZ2Z ZZZZZZi\Z?.Z?.ZZ\ZZZZZZZZZ?ZZZ 333333333333333333 ZZZZZZZZ I i l l l l l l l l l l l l l zzzzz ZZZZZZZZZZZZiOc A\ZZ^ZZZZ^ZZjZ?ΖΖΖΖΖΖΖΖΖΖΖΖ: ,333 J3333.3333333 3 333 ZZZZZZZ 11111111111 ICI i \?zzzzz ZZZZ.ZZZZZZâzZZZZZZZZZZZZ 33 §3333333*333333333 ZZZZZZZ l l l i l l l l l l l l l l * ZZZZZZZ ZZZZZZ?ZZ/ZZZZZZΖΖΖΖΖΖΖΖΖ 33333.3333^3333333333 '222222 l l l l l l l l i l l l i i 2ZZZZ/Z2ZZZ?ZZZZZ2?ZZ? 33333333333333333333 ZZZZZZZZ l l l l l l l l l l l l l zzzzzzzz- iïUîi&i^Âlfïiiii'ii'Z 3333333333333333333333 ZZZZZZZ l l l i l i l l l l l l i zzzzzzzz 22ZZZZfZZ^ZZZZZ 33333333333333333333333 ZZZZZZZ l l l l l l l l l i l l ZZZZZZZZ. ZZZZZZ\ZZZ222c ΖΖΖΖΖΖΖΖΖ2 33333333333333333333333 ZZZZZZZ 111111111112 ΖΖΖΖΖΖΖΖΖ.22222212222222222222222 3333333333333333333333333 ZZZZZZZ 1111111.111* ZZZZZZZZZZ 333λ333333333333?333?3?333 ZZZZZZ? l l l l l l l l l ) ZZZZZZZZ'~ HttlUWMillliittl 112_ZZ\Z'<\Z2Z?ΖΖΖΖΖΖΖΖΖ 3333333333333333333333333333 ZZZZZZZ l l l l l l l l l 22412 3 3 3 33 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 333 ZZZZZZZ l l l l l l l l j zzzz AZ22ZZZZ2%ZZ22ZZZf • ZZZZZ 3333333333333333333333333333 ZZZZZZ l l11 llllll* 33333333333333313333333333333 ZZZZZZZ 222222 zzzl ZZZZ 33333333333333333333333333333 ΠΙΠΙ\ zzzzzzzzzziz f IZZZZZ 2221 ZZZZZZZZZZ! 333333 2Z2ZV'3333333J 33333333 \2cZcZZ2c ZZZZZZZ 33333333 cZZZZiizzzzzl 3133. .44444444.33||3333333333333 3 1J1 333333333JB3.,-. zzzzz] 33333333J33313 444 zzzzr zzzzzz\ uni 444 33333333 ZZZZZZ'c22ZZZZZZZ 444 5'_, JZ 2 2 2 aC2 2 2 ZdffZZTB^i 3 3 _ -in* 44 333333 ZZZZZZZZZZZZZZZZZZZ 444 555 y?2 222^2 2 2 2 T J 2 2 ^ ^ a 3 M 3 3 3 3333 222222222222222222222 1114 ΖΖΖΖΖΖ\ζΖΖΖ^\$ψΤ& 33 444 5>555' 3333 ZZZZZ?ZZZ?ZZZZZZZZZZZZ llll 1 ' 2 2 2 2 2 2>M^2 2\àl 3 3B 3 3 "3 444 5555-1 33 2222222222222 11 l l l l ZZZZZ2222V2^3 3| 3 3V33 444 555. mf »444 5 ZZZZZZZZ Ζ 2Bf 2 2 |33 l l l l l l l l l l l l l llll ll li 33 ZZZZZZ???? ??? l l l l l l l l l l l l l 4444 _ 22222122212222/ 33 l l l l l i n i i i l l l i l l l J i - l l l l 1114 44 I 3 ZZ 133 33\3 l-"11111111111111111111111 llll mi im ziii^riY^mzzzi 133 3333 11C11 UllTÛllllf 2 1 îiiiiiiiiimiiiimi un lllll i. l. l l l l l l l l l l l l l lllllllllllllllllllll „ nn IlliUilUiilU 1222222222? l l l l l l lllllllllilllllllllln 11111111111111111111111 ZZZZZZ??? ll111lllllllllill - -^lllllllllllllllllli l l l l Ill l l l l1l1 1 l lul lil m i i i i i m i u u111114 li^ll l l l l l l l llll lllllllllllllllllli' liiiiniiiiiiiiη 111 ' 2222 l l l l , l l l l l l l i l l l l l l l l i l lllllll. I 2 ZZZZZZ l l i l 7 . _l l l l l l l l l l l l l l l l l l i - , l l1l1lllllll ll ll li ll l l1l1l l l l l 1 1" lllllll ll 2222222 2222. 222 l l l l l l i i i i i i i i n i i n i i i n i i .11111111111111111111111 --11 2222221 \Z 3331333 222 l l l l l l l l l l l l l l l l l l l l l l l l ; 111111llllllll1111111111 1 33 / 33 ?> llliillillllil.111111111 • - • l l l l l l l i l l l l l l l l i l "
2
1
ZZZ
22
i
?
11....
J
k
ZZiZ
?
1
1
1
,— - _ — • — — 7 - . — • 6
1»
Figure 21. Map of third varimax feature. Contour levels: 1 = —0.452-—0.218; 2 0.218-0.016; 3 = 0.016-0.249; 4 = 0.249-0.483; 5 — 0.483-0.717. 0 = sampling site.
In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
5.
KNUDSON ET AL.
115
Rain Chemistry
l * 3 * 3 l 3 3 3 j f c 3 3 3 3 3 3 3 J 3 § 3 2 3 3 3 3 J 3 3 3 J & 3 3 3 3 3 22227222??????? 333 44444 5555555555555551 l 3 3 3 3 l l Î T O 3 l H 3 3 3 3 3 3 3 3 l 3 3 3 3 3 3 3 3 3 / 3 3 3 3 3 3 222222222222222 3333 4444 555555555555555* Ι 3 3 3 Τ 3 ^ 3 3 2 3 3 Λ 3 3 3 3 3 3 3 | 3 3 3 3 3 3 3 3 1 | 3 3 3 3 3 3 2222222 22272272 «3*33 4444-0555555555555551 I I I I I I I I I I ι I I 11 I I I I I I if I I M Ι Ι Ι Ι Τ Μ Ί Ί Ί Ί Ί 222222222 7227 7? 343Γ- 444* 5î^î> 5 5 b 5 b î>S î> 55 !>5 J * 3 ê 3 3 / 3 ~ 3 3 3 3 3 3 J 3 3 3 3 222,22?27222????? 3333 4*44 555555555555555* l X * l * ^ 3 3 3 3 3 V 3 3 3 ^ / 3 2 i l 3 3 3 3 * * 3 3 3 : > 3 3 ? ? < 2 ί 2 2 . « 2 ? ? Σ ? ? 2 2 333 4444 i555>555555555t; I2|333»323333VyTO333|33333333 ?2222?22??22|22??? 333 4444 5555555555555551 3 33333333 2Z??Z2??22????^?7? 333 44444 555555555555551 |3333333^ I33333333333T333J132333333333 ZZZZZZZZZZZ7 71ZZZZZ 333 44444 555555555555551 13333333333333331•33 33*32 33*2 ? ? 2 * < 2 ί : 2 Γ 2 2 ? ? ? ? ? ? ? ? 333 44444 555555555555551 133333333333 ΖZZZZZZZZ?Z??ZZ??ZZ 333 44444 555555555555551 *|33333333333333| 2ZZZZ222ZZ22?2222222 33* 44444 5>555555555555 J 23333333333333f1*3333333*3 ^3333332333333"IC333333333 ZZZZ?Z?Z?ZZ??ZZ2ZZ?2 3333 4444 555555555555551 Î 3 3 3 3 3 3 3 3 3 3 3 J-333 33 3 3 33 33 22722202222Z2222277Z 3333 44444 55555555555C5* Ί 3 | | 3 3 3 3 3 3 |J *ï 3.3.33 33 33 33 22222Z22?222??????77 33333 4444 5555555555 555* . 3 3333333/33333333.3333 ZZZZZZZZZZtZ?ZZ??Z?Z 33333 44444 55555555 55551 Ι 3 3 3 3 333 3^3 3 33 33 3333 3 * ' 2 2 ^ ? ? 2 « 2 2 ί < ? ? 2 ? ? ? ? 2 *33*32 4444 5555555555551 ZZIZZ12?ZZZ????ZZ7ZZ 333333 44444 555555555551 1233333*331 33333333/**333333333333 3333333_ 4444 55555555555J 13333333333 ~333333f333333333333333 ZZZZ?ZZZZZZZ7??????? 7ZZZZ? 33333333 444*4 55555555552 23333333333 _.333333133333333333333 Z22ZZZZZZZZ717 Z 322*3333 44444 555555555* 1333333333331|333333l3333332333333 Z?227?2t27??ct<2Z?*2 ZZ ΖΖΖΖΖΖΖΖ ZZZZZ?? 3 3333333333 4444 5555*5555 * 1333333333331 33333313 c£\i?ZZiU-Ucl?2ZZZZi?t2222 3232333333333333 44444 &:,5555551 1 ZZZZZZZZZZZZZ 3333333333333333333 4444 ^5555551 *tZZZ f?.zz?zzK?z?zz?zi^' 2222?.222\2222??2 *2222 1222; Z2ZlZZZZ?ZZZZ\ZZZZ2Zi i22222Z222\22'ccZ.cc ÛZZZZZ 333333333 33333333333333333333333333*3 44444 555551 z? zzzzzzjrzzzzzzzz^ 022222 233333333333*33333333333333 444444 555553 7^22222 I222?.?tjjkc? ZZZZ 2ZZZ77&Z 111 222. ZZMZZZZZ 33333*233333333333333333333 444444 555551 "TB3333333333 44*4444 555551 JZZ?ZZ?£2 111111 22 ζζζψζζζζ 3133333 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 555551 122222M2222ÙB^I 11 11111 ' Z2JÊËZZ2ZZ *3* 4444444444444444444444444 555551 zzz&zzzttiTmswi " zzfïzzzzzz 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 44 555555* 22Z2Z\ZZZ27%222^^ 2MZ2ZZZZZZ 444444 55555555551 2Z2222*£Z2 zzTzzzzz 444 5555 55*5555555555555555555 551 33333, _)??->????Jmm 4 55555555555555555555555555555551 233*33' f333^tf?222222222 _ 555555555 555 55 5 55 555555555555 5555) 33 33333* |333 333122222222222?|2222| 13 33 €44 555 5555 55555*5 5*5555555555555555554 ZZZZ AZ??\?Z2Z~ 5 5 5 Î 5 5 5 5 5 5 5 5 5 5 5 555 55555555555555551 4444* •3 333 31 _ 333< "55055555555 555 555 5555555555555551 JP444444 3333' 73333 #444444 5 , -55555555555 5555555555555555555 55) 333333333! 44 4 " 5555 555555555555555 5555555555555555551 -"2333* 44444444444^ 53333 333 555555555555 55555555555555555 5555* §5555555555 2333 444 4444444444 • 3333 2333 444C 44444444 51 w>5555555555555555555555555555555555555555551 12 33 44 555555555 555555555555 555 555555555555555555555J 444444 5 5 ? , 1222 ,5555555555555*55555555555555555555555555555551 44444444 559 ÎZZZZZZ 555555 5555555 5 5555555555555555 5555555555555551 122222222 44444 55ol 33 4444 555 5 5 5 5 55555 555555555555555555555 55 55555555555555555 bZZZZZZZj •33 3 4444 i555555555555>5i5555555555555555555*555555555555551 1222222c . _ 33 444 55555555555555.555555555555555555555555555555555552 I 227721 IZZZZZJZZZ 333 444 5 5555555505 55 55 5 555 5555555 55555 5 55555 5555 5555551 II ,, _222 333 4444 55555555555555*55555555555555555555555555555551 lllll llllllli 4 H U A < * I * i>5555555555555555555555555555555555555555555* n i c f l • m i n ! 55*555555555555555555 555 5555i,>555555555555i 11111L l l l l l l l 2222 333 4444 lllllll] " l l i x l l ZZZZ 333 4444 555 555555555555555555555 5555555555555551 111111 11*111." l l l l l \ ZZZZ 3333 444444 555555555555555555555555555555555555î lllllll f a i l l i 22222 3333 4444444 55*»5555555555555555555555555555555l 111111 i l l i r i l l l f 22222c 3333 4444444 555555555555555555555555555555556 binil î i l l f 2 2 2 2 2 2 2 2 2 33333 4444444 5555555555555555555555555555551 r, * UU|0?. 2222*2/i 23333 4444444 5555555555555555555555555555* lllllll i l 111 l j f l l l l l l l l l ZZZZZZZZZ?? 333*33 444444 55555555555555555555555555J 1 1 1 1 1 1 1 1 Î J ZZZZZZZZZZZ? 3333333 444444 5555555555555555555555551 ZZZZZZZZZZZZZZ 3333333 44444 55555555555555555555* cCilZ\Z2c222222c 33333333 4444444 55555555555555555J 3333333 44444444 5 55555555 5 5551 . i i i l l l i / M f T T l l l l ZZZZZZZZZZZZZZ?7ZZ ZZZZZZZZQZZZZZZZZZZZZ 323333 444444444 55555555551 JlllMÏl'Jlliilx 33333333 444444444 55555551 , . w i l V i i i 6 i i i i-----i i i i i ZZZZZZZZZZZZZZZZZZZZZZ?? 333333333 4444444444 55557 |i^i 1 i>ù iiï\i 111111 ZZZZZZZZZZZZZZZZZZZZZZ??? ^ l l l l l l l 22 2222 2222222222222222 2 222 333333333 4444444444 ; l l l l l l l i ZZZZZZZZZZZZZZZZZZZZZZZZZZZ? 333333333 4444444444441 'Îiîiir"' l l l l l l l Γ Τ Ί l i l l l l l l l 2222222222\t22ZZ2ZZZ2??<2????2? 3*33333333 4444444441 ' l i l l H l l l l l l l i i l l l l l ??22222222222222222222222???2?22 33333333333 44444441 - l i l l l l l l l l l l l l l l l l l l 2222222222222?7222222????????????? 33333333333 44444* 1 1 0 1 1 1 1 1 1 1 1 1 1 l l l l l l l i 2222222Z2222222222222277222777722Z22 33333333333 44] 11111)111111111111111 22222ZZZZ?Zf2ZtZ?2i?2'>??272??.Z2?iZZZZ2 333333333333 i l l l l l l l 1 1 1 1 1 l i l l l l l l l 7?22222222227722222Z27227772222222222222 333 3333333 3331 .11111a f 111 1111)1 U i l U 2<.2ctcZZ2Z2ZZî22ZZ222222?222ZZ2ZZZZZZZcZ 3333*33 3333) l l l l l 111111 l l l l l 1111111 2 ZZZZ ZZZZZZ7.ZZZ ZZZZZ? ????ZZ? ?.??? ZZ2ZZ22222 333333333* **11111111111111111111111 22222272222222222222????22222222222222222222 33333331 f U l l l l i l l l l l l l 1 1 1 1 l l l l l 1 1 1 ZZZZZZZZZ??ZZZZZZZZZ????Z????????ZZZZZZZ22Z222 3333331 - 1 1 1 U 1111 l j l l l l l ) H i l l l l l cZZc ZZZZZZ22ί227?72?77?222?.????ZZZZZZZZZZZZZZZ 3333) 12 l l l l l l l l l l l l l l l l l l l l l l l l l Z Z Z Z Z Z Z Z Z Z Z ? Z Z Z Z Z Z ? ? ? ? ? ? ? Z Z Z ? ? Z Z Z € 2 Z Z Z C C \ L Z 2 Z Z Z Z 331 \c l l l l l l l l i l l l l l U l l l l l l * * * . ΖΖΖΖΖΖΖΖZ2222222ZZ222272?22227?222222222£222222222 * 12 1 1 1 1 1 1 l l l l l l l l l l l 1 1 1 1 l l l l l 7222222272222222222?222?7772??2227222227?7Z22222277X 1? 1111111111 11 1 1 1 1 1 l i l l l l l l l 222222222222222 2222222227222222222222222222222222211 3
ν
Λ
Λ
4
J
3
t
r
* — — i - - — * - - — ? — -*- —- 3 — - . 4 · — . 4 .
-*—·κ...·*.·..^.·..*·.·.7...·4·—*
Figure 22. Map of rain volume. Contour levels: 1 = 30-84 ml per bucket; 2 — 84137 ml per bucket; 3 = 137-191 ml per bucket; 4 — 191-244 ml per bucket; 5 = 244298 ml per bucket.
In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
CHEMOMETRICS: THEORY AND APPLICATION
116
Literature Cited
1. C. E. Junge, "Air Chemistry and Radioactivity," Academic Press, New York, 1963. 2. "ARTHUR" is an integrated package of computer programs de signed for pattern recognition and factor analysis. It con tains, at present, approximately thirty interactive programs. "ARTHUR" was written by D. L. Duewer, J. R. Koskinen, and B. R. Kowalski, and is available from B. R. Kowalski, Labora tory for Chemometrics, Department of Chemistry BG-10, Univer sity of Washington, Seattle, Washington 98195. 3. Τ. V. Larson, R. J. Charlson, E. J. Knudson, G. D. Christian, and Η. Harrison, "The Influence of a Sulfur Dioxide Point Source on the Rain Chemistry of a Single Storm in the Puget Sound Region," Water, Air, and Soil Pollution 4, 319 (1975). 4. E. J. Knudson, thesis Universit f Washington 1976 5. B. R. Kowalski an Linear and Nonlinear Methods for Displaying Chemical Data," J. Am. Chem. Soc. 95, 686 (1973). 6. D. L. Duewer, B. R. Kowalski, and J. L. Fasching, "Improving the Reliability of Factor Analysis of Chemical Data by Utilizing the Measured Analytical Uncertainty," Anal. Chem. 48, 2002 (1976). 7. P. Horst, Factor Analysis of Data Matrices, Holt, Rinehart, and Winston, New York, 1965. 8. H. F. Kaiser, "The Varimax Criterion for Analytical Rotation in Factor Analysis," Psychometrika 23, 187 (1958). 9. Y. T. Chien and K. S. Fu, "On the Generalized Karhunen-Loève Expansion," IEEE Trans. Information Theory 13, 518 (1967).
In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
6 Analysis of the Electron Spin Resonance of Spin Labels Using Chemometric Methods JAMES R. KOSKINEN* and BRUCE R. KOWALSKI Laboratory for Chemometrics, Department of Chemistry, University of Washington, Seattle, WA 98195
Spin labels are bein membrane systems and biologica ing technique involves incorporating a nitroxide free radical (the spin label) into a membrane system and studying the free radical using electron spin resonance (ESR) spectrometry. Lipid spin labels that are diffused into a membrane orient themselves in a specific configuration and undergo anisotropic molecular motion. When this motion is rapid on the ESR time scale, the ESR spectra that are observed can be correlated with the struc ture of the membrane. Molecules have been constructed so that the long axis of the molecule is parallel to one of the principal axes of the nitrox ide. Anisotropic motion about the long axis of the molecule corresponds to rotation about one of the principal axes of the nitroxide. The ESR spectra of this type of molecule in a well defined inclusion crystal have been studied and synthesized in order to better understand the membrane spin labeling experi ments (3, 4, 5). Studies using s p i n l a b e l s i n v o l v e a considerable e f f o r t f o r the chemist i n the c o l l e c t i o n and a n a l y s i s o f the s p e c t r a . In the p a s t , s p e c t r a were c o l l e c t e d as two dimensional p l o t s on a p i e c e o f paper and the u s e f u l information e x t r a c t e d from the p l o t s using a r u l e r and a p e n c i l . With the i n t r o d u c t i o n o f l a b o r a t o r y computers t h i s task has been made much e a s i e r (6). Spectra are now c o l l e c t e d by computer c o n t r o l l e d spectrometers and are saved i n computer compatible format ( i . e . , on paper tape, magnetic tape, o r d i s k s ) . T h i s use o f computers a l s o allowed •Present address:
Ford Motor Company S c i e n t i f i c Research Laboratory, Box 2053 Dearborn, Michigan 48121
53061
117
In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
118
CHEMOMETRICS : THEORY AND APPLICATION
some simple data a n a l y s i s techniques t o be performed on the spectra. These techniques i n c l u d e d b a s e - l i n e c o r r e c t i o n s and s p e c t r a l smoothing. Computers have made i t r e l a t i v e l y easy to c o l l e c t and s t o r e a l l the ESR s p e c t r a f o r a p a r t i c u l a r study. T h i s paper w i l l present examples o f the use o f computers t o a i d the chemist i n the a n a l y s i s o f ESR s p e c t r a . The f i r s t a p p l i c a t i o n w i l l i n volve the use o f Chemometric methods t o study two s p i n l a b e l s i n d i f f e r e n t i n c l u s i o n c r y s t a l s . T h i s a p p l i c a t i o n w i l l demonstrate the general usefulness o f chemometrics t o a n a l y z i n g ESR s p e c t r a . The second a p p l i c a t i o n w i l l concern s p i n l a b e l s i n a model membrane system. Methodology The data a n a l y s i s methods used i n t h i s paper come under the general heading o f Chemometrics (7/8) · The methods used are the ones t h a t w i l l e x t r a c t f e a t u r e the importance o f the e x t r a c t e o f the s p i n l a b e l , and f i n a l l y , d i s p l a y the r e s u l t s . A l l the s p e c t r a used i n t h i s study were c o l l e c t e d under computer c o n t r o l and s t o r e d i n d i g i t a l form as 980 data p o i n t s . The 980 data p o i n t s can be used as f e a t u r e s t h a t d e s c r i b e each spectrum. However, such a l a r g e number o f f e a t u r e s can present d i f f i c u l t i e s f o r some data a n a l y s i s methods. A method t h a t reduces the number o f f e a t u r e s d e s c r i b i n g the s p e c t r a without l o s i n g chemically u s e f u l information i s c l e a r l y needed. The method o f choice i n t h i s study i s the F o u r i e r transform. F o u r i e r transform methods have been used q u i t e e x t e n s i v e l y i n other forms o f spectroscopy f o r a v a r i e t y of purposes (9). The e f f e c t o f the F o u r i e r transform i s t o condense the information o f the t o t a l ESR spectrum i n t o the low frequency end of the t r a n s formed spectrum. Figure 1 shows a t y p i c a l ESR spectrum, the r e a l and imaginary p a r t s o f the F o u r i e r transform o f the spectrum, and the power spectrum. The low frequency end o f the transformed spectrum contains a l l the information needed t o r e c o n s t r u c t the o r i g i n a l spectrum v i a the i n v e r s e o r back transform. This process i s g r a p h i c a l l y presented i n F i g u r e 2. The f i r s t 64 p o i n t s o f the transformed spectrum are r e t a i n e d while the r e s t o f the p o i n t s are s e t t o zero. The i n v e r s e transform r e t u r n s the o r i g i n a l spectrum showing t h a t no information l o s s r e s u l t s . The spectrum r e s u l t i n g from the i n v e r s e transform appears t o be smoother than the o r i g i n a l spectrum because the high frequency noise has been d i g i t a l l y f i l t e r e d by the transform. By using the F o u r i e r transform, 64 features t h a t completely d e s c r i b e the spectrum have been generated out o f a spectrum o f 980 data p o i n t s . Once the f e a t u r e s have been generated i n t h i s manner, the other Chemom e t r i c methods can be a p p l i e d . Two s t a t i s t i c a l methods are used t o determine the importance o f the generated f e a t u r e s i n modeling a property o f the s p i n l a b e l . The property o f i n t e r e s t i n the f i r s t a p p l i c a t i o n i s the
In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
6.
KOSKINEN AND KOWALSKI
Figure 1.
Electron Spin Resonance of Spin Labels119
Typical ESR spectrum and its Fourier transform
temperature o f the s p i n l a b e l and the property o f i n t e r e s t i n the second a p p l i c a t i o n i s the amount o f s p i n l a b e l present. The f i r s t method c a l c u l a t e s the c o r r e l a t i o n between the generated features and the property. The second method i s step-wise regress i o n a n a l y s i s t h a t determines which o f the features does the b e s t job o f modeling the property with a l i n e a r model. P l o t s o f the generated features vs. the property are a l s o constructed as p a r t o f the a n a l y s i s . A l l o f the methods d e s c r i b e d are p a r t o f the ARTHUR p a t t e r n r e c o g n i t i o n system (10) which was used i n t h i s study.
In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
120
CHEMOMETRICS: THEORY AND APPLICATION
Figure 2. Graphical demonstration that only the first 64 points of the Fourier transform of an ESR spectrum are needed to regenerate the spectrum from the transform Spin Labels i n I n c l u s i o n C r y s t a l s The f i r s t system s t u d i e d using the above d e s c r i b e d methodology c o n s i s t e d o f 3-doxyl-5a-cholestane (I) (the 4',4 -dimethyloxazoladine-N-oxyl d e r i v a t i v e o f 3-keto-5ot-cholestane) i n an i n c l u s i o n c r y s t a l o f t h i o u r e a . The question t o be answered i n t h i s study i s : can the temperature o f the i n c l u s i o n c r y s t a l system be c o r r e l a t e d t o the ESR spectrum? The data s e t contains 16 ESR s p e c t r a o f the s p i n l a b e l i n c l u s i o n c r y s t a l system corresponding t o a range o f temperatures from -82.0°C t o 59.2°C. Table I l i s t s the data a n a l y s i s steps taken t o analyze t h i s s e r i e s o f s p e c t r a . Feature number four, generated using the F o u r i e r transform, i s found t o be the most important feature i n modeling the temperature o f the system. F i g u r e 3 shows a p l o t o f the temperature vs. feature f o u r . 1
\
Structure I
In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
6.
KOSKINEN AND KOWALSKI
Electron Spin Resonance of Spin Labels
121
Table I Steps Taken i n Data A n a l y s i s I II
III IV V VI
C o l l e c t Spectra Generate Features Using F o u r i e r Transform A Zero F i l l Spectra t o 1024 P o i n t s (Requirement o f F a s t F o u r i e r Transform) Β Perform Fast F o u r i e r Transform C S e l e c t the F i r s t 64 C o e f f i c i e n t s o f t h e Real P a r t o f the Transform C a l c u l a t e C o r r e l a t i o n between the 64 Features and Property Perform Stepwise Regression A n a l y s i s o f the 64 Features Generate P l o t s o f Features S e l e c t e d i n Steps I I I and IV vs. the Property Analyze R e s u l t
I e d a l l y , Figure 3 should show a s t r a i g h t l i n e i n d i c a t i n g t h a t feature four i s l i n e a r l y r e l a t e d t o the temperature. The s c a t t e r o f p o i n t s about the l i n e can be i n t e r p r e t e d as meaning t h a t the a n i s t r o p i c motion o f the molecule i s somewhat r e s t r i c t e d . The s h o r t e r steps between f e a t u r e four values a t the high temperature end i n d i c a t e s t h a t the r o t a t i o n about the long a x i s o f the mole c u l e i s being optimized. • 59 2° c
-82.0° C FEATURE
FOUR
Figure 3. Plot of the Fourier transform generated feature number four (the ordinate) vs. the temperature of the system (the abscissa) in sample one
In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
122
CHEMOMETRICS: THEORY AND APPUCATION
The second system s t u d i e d c o n s i s t e d o f the s p i n l a b e l l a u r y l n i t r o x i d e (II) ( 2 , 2 , 6 , 6 - t e t r a m e t h y l - 4 - p i p e r i d i n o l - l ~ o x y l dodecanoate) i n an i n c l u s i o n c r y s t a l o f β-cyclodextrin. The data s e t f o r t h i s system contains 20 ESR s p e c t r a o f the s p i n l a b e l i n the i n c l u s i o n c r y s t a l correspondin t o 63°C. Table I again the a n a l y s i s o f these s p e c t r a . F o u r i e r transform f e a t u r e number three i s shown by the s t e p wise r e g r e s s i o n a n a l y s i s t o be the most important f e a t u r e i n modeling the temperature o f the system. Once again a l i n e a r p l o t o f the f e a t u r e and the temperature i s expected. Figure 4 shows the a c t u a l p l o t which appears t o be l i n e a r from the low tempera ture end (-196°C) t o a temperature o f about 35°C. Then the value o f the f e a t u r e does not get any l a r g e r . I t remains n e a r l y con s t a n t from about 35°C t o 63°C. In the low temperature r e g i o n , the ESR spectrum approaches the r i g i d g l a s s l i m i t . As the tem perature i n c r e a s e s , the molecule s t a r t s t o r o t a t e more f r e e l y about i t s long a x i s . A t approximately 35°C the r o t a t i o n about the n i t r o x i d e p r i n c i p a l x-axis i s f a s t enough on the ESR time s c a l e such t h a t the y and ζ c o n t r i b u t i o n s are averaged out. A f u r t h e r i n c r e a s e i n temperature has no a d d i t i o n a l e f f e c t on the a n i s t r o p i c motion. I t i s i n t e r e s t i n g t o compare both s p i n l a b e l s i n t h e i r r i g i d m a t r i c e s . The l a u r y l n i t r o x i d e i s able t o r o t a t e q u i t e f r e e l y and reaches an optimum value. The 3-doxyl-5a-cholestane i s not a b l e t o r o t a t e as f r e e l y as the l a u r y l n i t r o x i d e and appears not to reach an optimum v a l u e . T h i s d i f f e r e n c e i n r o t a t i o n can be accounted f o r by the s t r u c t u r e o f the molecules. L a u r y l n i t r o x ide i s a long, c y l i n d r i c a l - s h a p e d molecule, w h i l e the 3-doxyl-5acholestane i s a r e c t a n g u l a r shaped molecule. I t i s e a s i e r f o r the c y l i n d r i c a l molecule t o r o t a t e about i t s long a x i s i n a c a v i t y i n a matrix than i t i s f o r the rectangular-shaped molecule. Now t h a t the Chemometric methods have been shown t o be use f u l i n the study o f s p i n l a b e l s i n w e l l - d e f i n e d i n c l u s i o n c r y s t a l s the methods can be used i n the study o f s p i n l a b e l s i n a model membrane system. The l a s t p a r t o f t h i s paper w i l l d e a l w i t h the a p p l i c a t i o n o f Chemometric methods t o the study o f such a model membrane system.
In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
6.
KOSKINEN AND KOWALSKI
Electron Spin Resonance of Spin Labels
•63 0 ° C
•196.o°C
FEATURE THREE Figure 4. Plot of the Fourier transform generated feature number three (the ordinate) vs. the temperature of the system (the abscissa) in sample two Spin Labels i n Model Membrane
Systems
The model membrane system s t u d i e d i s the cytochrome oxidase protein containing spin labeled phospholipids. The s p i n l a b e l used i s 16-doxyl s t e r i c a c i d (III) (the 4 ,4·-dimethyloxazoladineN-oxyl d e r i v a t i v e o f 16-keto s t e a r i c a c i d ) . F i g u r e 5 shows the s p e c t r a o f r e p r e s e n t a t i v e samples o f the cytochrome oxidase p r o t e i n w i t h d i f f e r e n t concentrations o f p h o s p h o l i p i d s . The amount o f l i p i d i n each sample i s expressed as the r a t i o o f mg o f phosp h o l i p i d p e r mg o f p r o t e i n . The sample i n F i g u r e 5a has a r a t i o 1
Structure I I I
In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
123
CHEMOMETRICS: THEORY AND APPLICATION
Figure 6. Graphical presentation of the generation of a composite ESR spectrum by using scaled amounts of Fourier transform of two ESR spectra
In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
6.
KOSKINEN AND KOWALSKI
Electron Spin Resonance of Spin Labels
125
of 0,10; F i g u r e 5b corresponds t o a r a t i o o f 0.24; and F i g u r e 5c i s 0.73. The ESR spectrum o f the sample with the lowest l i p i d content (Figure 5a) i s c h a r a c t e r i s t i c o f strong i m m o b i l i z a t i o n o f the s p i n l a b e l s w h i l e the spectrum o f the sample with the h i g h e s t l i p i d content (Figure 5c) i s c h a r a c t e r i s t i c o f a more mobil s p i n l a b e l (11). The question t o be answered i n t h i s experiment i s : i s i t p o s s i b l e t o q u a n t i f y the amount o f each k i n d o f s p i n l a b e l i n a composite system as shown i n Figure 5b? The data s e t i n c l u d e s e i g h t s p e c t r a o f samples o f v a r y i n g amounts o f the p r o t e i n and s p i n l a b e l e d p h o s p h o l i p i d . The f e a t u r e generation methodology used i s the same as d e s c r i b e d i n the p r e vious examples. The property o f i n t e r e s t i n t h i s example i s the amount o f immobilized l i p i d present i n the model membrane system. By u s i n g stepwise r e g r e s s i o n a n a l y s i s i t i s p o s s i b l e t o a r r i v e a t an equation t o c a l c u l a t e the amount o f the l i p i d present. By using t h i s equation, i t i s p o s s i b l e t o look a t the F o u r i e r transform o f an ESR spectru the amount o f immobilize the v a l i d i t y o f t h i s equation i s t o s y n t h e s i z e an ESR spectrum f o r t h e s e r i e s s t u d i e d using s p e c t r a o f the immobilized s p i n l a b e l and the mobil s p i n l a b e l . Since the equation was developed u s i n g the F o u r i e r transform t o the ESR s p e c t r a , they w i l l be used i n p l a c e o f t h e s p e c t r a . The F o u r i e r transform o f the immobilized s p i n l a b e l spectrum (Figure 5a) i s m u l t i p l i e d by t h e c a l c u l a t e d s c a l e f a c t o r and the r e s u l t i s added t o the F o u r i e r transform o f the mobil s p i n l a b e l s c a l e d by the c a l c u l a t e d f a c t o r . Then the i n v e r s e transform i s a p p l i e d t o t h i s composite t o g i v e the spectrum. In t h i s case the F i g u r e 5b i s the spectrum t h a t i s being s y n t h e s i z e d . The s c a l e f a c t o r f o r the immobilized s p i n l a b e l i s 0.24, and the f a c t o r f o r the mobil s p i n l a b e l i s 0.76. T h i s p r o cess i s shown g r a p h i c a l l y i n F i g u r e 6. The r e s u l t a n t s y n t h e t i c spectrum appears smoother than the experimental spectrum because the h i g h frequency n o i s e has been d i g i t a l l y f i l t e r e d . In t h i s a p p l i c a t i o n Chemometric methods were used t o show t h a t c e r t a i n ESR s p e c t r a o f a model membrane system are a composi t e o f s p e c t r a o f an immobilized s p i n l a b e l and a mobil s p i n l a b e l . Conclusions Chemometric methods have been used t o analyze experimental ESR s p e c t r a . The methods have provided a d d i t i o n a l i n s i g h t i n t o the processes i n v o l v e d i n p u t t i n g a s p i n l a b e l i n t o an i n c l u s i o n crystal. They have a l s o been used t o examine the ESR s p e c t r a r e s u l t i n g from s p i n l a b e l s d i s p e r s e d i n a model membrane system. Chemometrics does p r o v i d e a powerful t o o l t o a i d the chemist i n the a n a l y s i s o f ESR s p e c t r a o f s p i n l a b e l s i n model membrane systems.
In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
126
CHEMOMETRICS: THEORY AND APPLICATION
Acknowledgments We wish t o acknowledge Drs. P a t r i c i a J o s t and 0. Hayes G r i f f i t h f o r k i n d l y p r o v i d i n g us w i t h the ESR s p e c t r a used i n t h i s study. We a l s o wish t o acknowledge the f i n a n c i a l support o f the O f f i c e o f Naval Research under C o n t r a c t No. N00014-75-C-0536. Literature Cited
1. McConnell, Η. M. and McFarland, B. F., Quart. Rev. Biophys., (1970) 3, 91-136. 2. Jost, P., Waggoner, A. S., and Griffith, Ο. Η., in "Structure and Function of Biological Membranes," Rothfield. L. (Ed.) pp. 84-144, Academic Press, New York, 1971. 3. Birrell, G. B., Van, S. P., and Griffith, O. H., J. Amer. Chem.Soc.,(1973) 95, 2451. 4. Birrell, G. Β., Griffith, O. Η., and French, D., J. Amer. Chem.Soc.,(1973 5. Griffith, Ο. Η., J 6. Klopfenstein, C. E., Jost, P., and Griffith, Ο. Η., in "Computers in Chemical and Biochemical Research," Klopfenstein C. E. and Wilkins, C. L. (Eds.), pp. 175-221, Academic Press, New York, 1972. 7. Kowalski, B. R., Anal. Chem. (1975), 47, 1152A. 8. Kowalski, B. R., J. Chem. Infor. & Compt. Sci., (1975), 15, 201. 9. Marshall, A. G. and Comisarow, Μ. Β., Anal. Chem., (1975), 47, 491A. 10. Duewer, D. L., Koskinen, J. R. and Kowalski, B. R., "ARTHUR" available from B. R. Kowalski, Laboratory for Chemometrics, Department of Chemistry, University of Washington, Seattle, Washington 98195. 11. Jost, P.C.,Griffith, O. H., Capaldi, R., and Vanderkooi, G., Proc. Nat. Acad. Sci. USA, (1973), 70 (2), 480-484.
In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
7 Automatic Elucidation of Reaction Mechanisms in Stirred-Pool Controlled-Potential Chronocoulometry LOUIS MEITES and GEORGE A. SHIA Department of Chemistry, Clarkson College of Technology, Potsdam, NY 13676
During a symposium chemical data, it is appropriat inquir y technique are important, and what effects their adoption may eventually have on the chemist's work and thought. They are important for several different reasons. They can facilitate calculations and interpretations that may involve many steps and, with numerical data, tedious graphical or other analysis. By doing so they can save much of the chemist's time and energy. At the same time they can yield more reliable results because they substitute the dumb patience and objectivity of a machine for the human frailty and occasional unconscious prejudice of the chemist. This makes it possible to find corre lations or interpretations that the chemist might miss, and to achieve greater depth and certainty in the final result. They can influence the design of experiments in several ways: by making it possible to find a data-acquisition schedule that s t r e s s e s the regions o f g r e a t e s t importance t o the d e s i r e d r e s u l t and enables the experimenter t o ignore others o f l e s s e r importance, by making i t p o s s i b l e t o o b t a i n the d e s i r e d r e s u l t from a simple experiment and thus o b v i a t i n g the n e c e s s i t y o f performing a more complicated one t h a t would y i e l d the same information i n a form more e a s i l y amenable t o o l d e r techniques o f data a n a l y s i s , and even by employing the data so e f f i c i e n t l y t h a t one experiment can be made t o y i e l d c e r t a i n t y as g r e a t as c o u l d have been obtained from three o r f o u r with the a i d o f the o l d e r techniques. Examples o f a l l o f these are already i n the l i t e r a t u r e , and new ones continue t o appear. I t i s a l r e a d y evident t h a t the growing adoption o f these techniques i s s u b s t a n t i a l l y easing the tedium o f experimentation and i n t e r p r e t a t i o n while improving the accuracy and r e l i a b i l i t y o f the values deduced from the data. As i t r e l i e v e s chemists o f unnecessary burdens, i t w i l l have longer-range e f f e c t s on chemical education, which now o f
127 In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
128
CHEMOMETRICS:
THEORY AND APPUCATION
n e c e s s i t y devotes time t o the teaching and l e a r n i n g o f techniques o f data a n a l y s i s t h a t might b e t t e r be spent on such t h i n g s as the design o f experiments and the i n t e r p r e t a t i o n o f r e s u l t s . In the course o f making these p o t e n t i a l i t i e s manifest, some b a s t i o n s o f i m p o s s i b i l i t y have a l r e a d y crumbled, as may be shown by s e v e r a l examples from t h e f i e l d o f potentiometry and p o t e n t i o metric t i t r a t i o n . I t has always been known t h a t analyses by d i r e c t potentiometry cannot be made as a c c u r a t e l y and p r e c i s e l y as analyses by p o t e n t i o m e t r i c t i t r a t i o n , but Brand and Rechnitz (1) and I s b e l l , Pecsok, Davies, and P u m e l l (2) have shown t h a t t h i s i s not so. I t has always been known t h a t a c e t i c a c i d (pK^ = 4.755) and p r o p i o n i c a c i d (pK^ = 4.876) have strengths too n e a r l y i d e n t i c a l t o permit i d e n t i f y i n g , much l e s s determining, both from the acid-base t i t r a t i o n curve f o r a mixture, but Ingman e t a l . (3) have shown that t h i s i s not so. I t has always been known t h a t a p o t e n t i o m e t r i c acid-base t i t r a t i o n cannot succeed i f there i s no p o i n t o i f the c o n c e n t r a t i o n o Meites (4) and Barry, Campbell, and Meites (5) have shown t h a t these are not so. These are o n l y a few o f the many i n s t a n c e s i n which i t i s now apparent that f a m i l i a r experiments have always provided us with i n f o r m a t i o n t h a t we have not known how t o o b t a i n i n u s e f u l form. Probably there a r e very few chemists who would not r e a d i l y concede the e f f e c t i v e n e s s o f computerized procedures i n e v a l u a t i n g numerical parameters on the b a s i s o f numerical data. A p p l i c a t i o n s and examples l i k e the ones j u s t c i t e d are t h e r e f o r e comparatively easy t o understand and accept. However, there are many fewer chemists who are prepared t o accept the i d e a t h a t accurate and r e l i a b l e q u a l i t a t i v e d e c i s i o n s can be made by machines without human i n t e r v e n t i o n . In human terms i t i s o f course understandable why t h i s should be so. The chemist faced with having t o decide whether a compound c o n t a i n s , say, a phenyl group on the b a s i s o f i t s mass spectrum i s sure t o be aware, a t some l e v e l , o f the years o f t r a i n i n g and experience t h a t he b r i n g s to t h a t d e c i s i o n , and o f a l l the s u b t l e t i e s and p i t f a l l s t h a t i t may i n v o l v e . I t i s no easy t h i n g t o admit t h a t one's knowledge, understanding, and i n s i g h t cannot produce d e c i s i o n s s u p e r i o r t o those made by a mindless machine - o r , more p r o p e r l y , t h a t t h a t knowledge, understanding, and i n s i g h t can be reduced t o a s e t o f completely predetermined steps that w i l l produce as good a d e c i s i o n as the one t h a t c o u l d be made by a human b r a i n . Despite t h i s d i f f i c u l t y , i t i s c l e a r t h a t progress i n t h i s area o f f e r s prospects having overwhelming importance. Though the p o r t i o n o f chemical research t h a t deals with the e v a l u a t i o n o f numerical parameters can c e r t a i n l y be s i g n i f i c a n t and c h a l l e n g ing, the p o r t i o n that deals with q u a l i t a t i v e i n t e r p r e t a t i o n i s even more so. In experimental k i n e t i c s , f o r example, the r e a l problem i s u s u a l l y t o decide what the mechanism o f a r e a c t i o n i s . A f t e r t h i s has been s o l v e d , more o r l e s s severe experimental
In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
7.
MEITES AND S H i A
Controlled-Potential Chronocoulometry
129
d i f f i c u l t i e s may s t i l l have t o be overcome i n e v a l u a t i n g the r a t e and e q u i l i b r i u m constants a s s o c i a t e d with the r e a c t i o n , but o f t e n the values o f these w i l l be i m p l i c i t i n the data accumulated i n e l u c i d a t i n g the mechanism. What i s i n v o l v e d i n making such d e c i s i o n s ? There are two possibilities. One i s t h a t t h e system being s t u d i e d belongs t o one o f a l i m i t e d number o f c l a s s e s whose behaviors a r e a c c u r a t e l y known. The other i s t h a t i t does not, and i n s t e a d behaves i n some novel way t h a t no p r i o r i n t e r p r e t a t i o n w i l l s u f f i c e t o d e s c r i b e . In the f i r s t case the i n t e r p r e t a t i o n may be s a i d t o be r o u t i n e . Of course the d e c i s i o n s t h a t are i n v o l v e d i n a s s i g n i n g the system t o the proper c l a s s may be both s u b t l e and d i f f i c u l t . C a l l i n g them r o u t i n e merely means t h a t s i m i l a r d e c i s i o n s have been made before f o r other systems, t h a t the p r i n c i p l e s under l y i n g those d e c i s i o n s are known, and t h a t i t should t h e r e f o r e be p o s s i b l e t o e f f e c t them i n ways t h a t a computer can be p r o grammed t o execute. I make i t p o s s i b l e t o decid any known c l a s s . Beyon poin interpretatio r e q u i r e imaginativeness and i n t u i t i o n , and these cannot be p r o grammed i n advance. Except i n s i t u a t i o n s so simple t h a t a l l the p o s s i b l e c l a s s e s are already known, every program designed t o make such i n t e r p r e t a t i o n s must provide f o r a c a l l f o r human i n t e r vention when a l l o f t h e known p o s s i b i l i t i e s have been t e s t e d and found t o be inadequate. The nature o f research i n any area would be profoundly changed by the a v a i l a b i l i t y o f a program t h a t would e f f e c t r o u t i n e i n t e r p r e t a t i o n and t h a t would c a l l f o r human h e l p when t h i s d i d not s u f f i c e . Chemists working i n t h a t area would be r e l i e v e d o f the n e c e s s i t y o f undertaking such i n t e r p r e t a t i o n themselves - o f r e t r a c i n g on each problem the thoughts they had had w h i l e s o l v i n g the one before i t . In l o s i n g t h i s burden they would g a i n the o p p o r t u n i t y t o spend more o f t h e i r time i n breaking new ground, i n improving t h e e x c e l l e n c e o f t h e i r experimental procedures and measurements, and i n s e l e c t i n g the systems t h a t would best repay study. I n v e s t i g a t i o n s t h a t turned out t o be r o u t i n e would be g r e a t l y f a c i l i t a t e d , and those t h a t d i d not could r e c e i v e the b e n e f i t o f the human imagination thus l i b e r a t e d . T h i s i s the r a t i o n a l e o f a program o f research t h a t began s e v e r a l years ago a t Clarkson C o l l e g e o f Technology. Near i t s s t a r t we i d e n t i f i e d the v a r i o u s kinds o f c l a s s i f i c a t i o n s t h a t might a r i s e (6), and these a r e l i s t e d i n Table 1. Table 1.
Kinds o f C l a s s i f i c a t i o n s 1. Binary Ά. Simple B. M u l t i p l e 2. L i n e a r 3. Branched
In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
130
CHEMOMETRICS:
THEORY AND APPLICATION
I t i s always supposed t h a t data are a v a i l a b l e showing how the value o f some dependent v a r i a b l e changes as t h a t o f some i n d e pendent one i s a l t e r e d , and t h a t the problem i s t o account f o r the data by s e l e c t i n g the appropriate one o f a f i n i t e s e t o f hypotheses, each o f which can be expressed by one o r more equat i o n s r e l a t i n g the dependent and independent v a r i a b l e s and i n v o l v i n g numerical parameters as w e l l . A simple b i n a r y c l a s s i f i c a t i o n i s one t h a t r e q u i r e s o n l y a s i n g l e "yes-no" d e c i s i o n . A compound may o r may not c o n t a i n a p a r t i c u l a r k i n d o r group o f atoms; i t may o r may not have a p a r t i c u l a r k i n d o f b i o l o g i c a l a c t i v i t y ; i f subjected t o p o l a r o graphic examination i t may o r may not be r e v e r s i b l y reduced. Much o f the research t o date on p a t t e r n r e c o g n i t i o n i n chemistry has d e a l t w i t h such questions. I t may o r may not be p o s s i b l e t o express one o r both o f the a l t e r n a t i v e s by an equation r e l a t i n g the independent and dependent v a r i a b l e s : our work has concent r a t e d on cases i n which t h i i p o s s i b l e whil p a t t e r n i t i o n has concentrate questions share the propert considered are both exhaustive and mutually e x c l u s i v e : i f e i t h e r can be r e j e c t e d the other must be r i g h t . F o r the kinds o f data considered here, there may be two d i f f e r e n t p o s s i b l e equations o f which the data must conform t o one, o r there may be o n l y one equation and the question may be whether the data conform t o t h a t equation o r not. In e i t h e r event i t i s r e l a t i v e l y simple t o express the r e l i a b i l i t y o f the c l a s s i f i c a t i o n i n the c l a s s i c a l s t a t i s t i c a l f a s h i o n : t h a t i s , by s t a t i n g the l e v e l o f confidence a t which i t can be upheld. M u l t i p l e b i n a r y c l a s s i f i c a t i o n s are those t h a t r e q u i r e two or more independent "yes-no" d e c i s i o n s . A not very complicated one i s shown i n F i g . 1. T h i s arose i n p o t e n t i o m e t r i c t i t r a t i o n s o f sodium o r potassium l a u r a t e with h y d r o c h l o r i c a c i d (7). Depending on the c o n c e n t r a t i o n s o f l a u r a t e and hydrogen ions i n these two s o l u t i o n s , and a l s o on the c o n c e n t r a t i o n o f any a l k a l i metal s a l t (such as sodium or potassium c h l o r i d e ) t h a t was added, any o f three d i f f e r e n t and independent phases might have separated d u r i n g some p o r t i o n o f the t i t r a t i o n . M i c e l l e s o f l a u r a t e i o n might o r might not have been present i n the i n i t i a l s o l u t i o n , and both an " a c i d soap" and the f r e e f a t t y a c i d might o r might not have p r e c i p i t a t e d . E i g h t d i f f e r e n t kinds o f t i t r a t i o n curves can r e s u l t , although not a l l can be obtained with l a u r a t e and sodium o r potassium ions i n aqueous s o l u t i o n s a t 25°. A program was c o n s t r u c t e d t o e f f e c t t h i s t r i p l e b i n a r y c l a s s i f i c a t i o n , beginning with e x a c t l y the same i n f o r m a t i o n - the compositions o f the r e a c t i n g s o l u t i o n s and the coordinates o f the p o i n t s on the experimental t i t r a t i o n curve - t h a t would be a v a i l a b l e t o a human chemist. When i t was a p p l i e d t o about 30 experimental curves secured under widely v a r y i n g c o n d i t i o n s , i t s e r r o r r a t e was approximately 5%. Two s k i l l e d chemists u s i n g the same information managed, with some d i f f i c u l t y , t o achieve an e r r o r r a t e o f about 35%.
In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
MEITES AND SHiA
7.
131
Controlled-Potential Chronocoulometry
A MULTIPLE BINARY CLASSIFICATION
P o t e n t i o m e t r i c t i t r a t i o n o f an a l k a l i - m e t a l
l a u r a t e ML w i t h HC1
Were m i c e l l e s o f l a u r a t e p r e s e n t i n i t i a l l y ?
D i d t h e " a c i d s o a p " MHL p r e c i p i t a t e d u r i n g t h e t i t r a t i o n ? 2
D i d t h e f r e e f a t t y a c i d HL p r e c i p i t a t e ?
1
2
3
5
6
7
8
Figure 1. Triple binary classification of the potentiometric titration curve obtained in a titration of laurate ion with a strong acid. The purpose of the classification is to reveal which, if any, separate phases were present at any stage of the titration.
L i n e a r c l a s s i f i c a t i o n s may be d e f i n e d as those i n which the equations t h a t correspond t o the s u c c e s s i v e hypotheses can be arranged a p r i o r i i n a l o g i c a l order t h a t i s a l s o the order o f i n c r e a s i n g complexity o f t h e equations. Figure 2 shows the simplest example i n the l i t e r a t u r e (6). In some cases the number o f p o s s i b l e hypotheses may be very l a r g e , but i n o t h e r s , i n c l u d i n g t h i s one, i t may be very s m a l l . The l i t e r a t u r e contains no example o f any t e t r a f u n c t i o n a l base whose d i s s o c i a t i o n cons t a n t s a r e a l l so c l o s e together t h a t successive steps could not be p e r c e i v e d on v i s u a l i n s p e c t i o n o f t h e t i t r a t i o n curve, and t h e r e f o r e i t was decided t h a t no hypothesis beyond the t e t r a f u n c t i o n a l one would be allowed. Together with the f a c t t h a t the s u c c e s s i v e d e c i s i o n s were based on an estimated standard d e v i a t i o n o f measurement, t h i s d e c i s i o n made i t p o s s i b l e t o c l o s e the loop and ensure t h e eventual acceptance o f one o f the p e r m i s s i b l e hypotheses. Programs t h a t e f f e c t l i n e a r c l a s s i f i c a t i o n s are e a s i e r t o design than those f o r m u l t i p l e b i n a r y c l a s s i f i c a t i o n s , and a r e
In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
CHEMOMETRICS:
132
THEORY AND APPLICATION
A LINEAR CLASSIFICATION
E v a l u a t i o n o f t h e f u n c t i o n a l i t y o f a monomeric base from p o t e n t l o m e t r l c acid-base t i t r a t i o n
data
Is t h e base m o n o f u n c t l o n a l ? Yes Is i t d i f u n c t i o n a l ? Yes
Yes Is i t t e t r a f u n c t i o n a l ?
Yes P r i n t t h e c o n c l u s i o n and
Revise the estimated value
the corresponding values
of the standard e r r o r o f
o f c ^ and t h e K j
measurement
L Figure 2. Linear classification of the potentiometric titration curve obtained in a titration of a weak base with a strong acid. It is assumed that no phase separation occurs during the titration. The purpose of the classification is to reveal the number of protons consumed by each ion or molecule of the base during the titration.
l i k e l y t o be a t l e a s t as r e l i a b l e , i f not more so. The one represented by P i g . 2 was t e s t e d with a great many s y n t h e t i c and experimental data f o r bases l i k e a c e t a t e , s u c c i n a t e , and c i t r a t e , and c o u l d not be made t o f a i l when i t was provided with an honest estimate o f t h e standard e r r o r o f measurement. I f i t was given an estimate t h a t was much too l a r g e , i t wielded Occam's Razor with r u t h l e s s s k i l l , e x a c t l y as a human chemist would, a c c e p t i n g the simplest hypothesis t h a t c o u l d not be disproved. Occasionally, when i t was given an estimate t h a t was n e a r l y a f u l l order o f magnitude too s m a l l , i t d i d conclude t h a t c i t r a t e i o n was
In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
7.
METTES AND SHIA
Controlled-Potential
133
Chronocoulometry
t e t r a f u n c t i o n a l ; human chemists a r e o f t e n s i m i l a r l y m i s l e d by data that a r e l e s s p r e c i s e than they are thought t o be. Very much the most complicated k i n d o f c l a s s i f i c a t i o n i s the branched c l a s s i f i c a t i o n , i n which t h e r e j e c t i o n o f one hypothesis must be followed by a t e s t o f another one, and i n which there i s no l o g i c a l l y necessary order i n which the t e s t s must be arranged. T h i s paper d e s c r i b e s the s t r u c t u r e o f the f i r s t program designed to e f f e c t a branched c l a s s i f i c a t i o n . The Chemical Problem C o n t r o l l e d - p o t e n t i a l e l e c t r o l y s i s has been s t u d i e d i n t e n s i v e l y i n s e v e r a l l a b o r a t o r i e s s i n c e about 1955 with a view t o emp l o y i n g i t f o r the e l u c i d a t i o n o f the mechanisms o f e l e c t r o chemical processes and f o r t h e e v a l u a t i o n o f the r a t e and e q u i l i brium constants f o r the i n d i v i d u a l steps i n these processes. Roughly 25 d i f f e r e n t mechanism invented a_ p r i o r i o r s t u d i e consequences t h a t c o u l d be observed by other e l e c t r o c h e m i c a l techniques; most were devised i n order t o account f o r the phenomena observed i n studying r e a l systems, both organic and i n o r g a n i c ; a few were i n v e s t i g a t e d b r i e f l y so t h a t they c o u l d be r u l e d out as being unable t o e x p l a i n those phenomena. There a r e s e v e r a l thorough reviews (8-10) s t r e s s i n g d i a g n o s t i c c r i t e r i a and ways o f d i f f e r e n t i a t i n g among the v a r i o u s mechanisms, and only a very b r i e f summary w i l l be undertaken here. C o n t r o l l e d - p o t e n t i a l e l e c t r o l y s e s are performed i n c e l l s f i t t e d with three e l e c t r o d e s . One, t h e "working e l e c t r o d e , " i s the e l e c t r o d e a t which t h e h a l f - r e a c t i o n o f i n t e r e s t occurs. Often i t i s a l a r g e pool o f mercury, though platinum gauze and other m a t e r i a l s can a l s o be used. The s o l u t i o n surrounding the working e l e c t r o d e i s e f f i c i e n t l y s t i r r e d t o f a c i l i t a t e mass t r a n s f e r o f the r e a c t a n t from the bulk o f the s o l u t i o n t o the s u r f a c e e l e c t r o d e and t o maintain a constant value o f the mass-transfer c o e f f i c i e n t s_. An a p p l i e d p o t e n t i a l i s imposed across the working e l e c t r o d e and an " a u x i l i a r y e l e c t r o d e , " which i s u s u a l l y i s o l a t e d i n a separate compartment o f the c e l l and serves merely to permit the flow o f an e l e c t r i c c u r r e n t through t h e c e l l and working e l e c t r o d e . The z e r o - c u r r e n t p o t e n t i a l E j ^ o f the working e l e c t r o d e i s sensed by comparing i t w i t h the p o t e n t i a l o f a "reference e l e c t r o d e , " such as a s a t u r a t e d calomel o r s i l v e r s i l v e r c h l o r i d e e l e c t r o d e , through which c u r r e n t does not flow. A l t e r i n g the a p p l i e d p o t e n t i a l causes the. value o f Ew. . vary. By means o f a p o t e n t i o s t a t , the a p p l i e d p o t e n t i a l i s continuously adjusted t o as t o keep Ew.e. equal t o some predetermined value, which i s u s u a l l y so chosen t h a t ions o r molecules o f the r e a c t a n t A are reduced as r a p i d l y as they are brought t o the s u r f a c e o f the working e l e c t r o d e by convection and d i f f u s i o n . Of course o x i d a t i o n i s a l s o p o s s i b l e , but only r e d u c t i o n w i l l be a l l u d e d t o here. # e #
t
o
e
In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
134
CHEMOMETRICS:
THEORY AND APPLICATION
The simplest p o s s i b i l i t y i s t h a t A i s reduced to a s t a b l e product Ρ i n a s i n g l e step and without any s i d e r e a c t i o n o r other coupled chemical process. T h i s i s d e s c r i b e d by the equations (11) A + η e + Ρ
(1)
i = i°e-§_t
(2a)
2=
2»
(2b)
( i - e-£l)
where η i s the number o f faradays consumed i n reducing each mole of A, jL i s the c u r r e n t t h a t flows through the c e l l and working e l e c t r o d e It s (seconds) a f t e r the e l e c t r o l y s i s has begun, £ the q u a n t i t y o f e l e c t r i c i t y (coulombs o r faradays) t h a t has flowed up t o t h a t i n s t a n t , i ^ i s the i n i t i a l value o f i ^ , and gpa i s the t o t a l q u a n t i t y o f e l e c t r i c i t y t h a t w i l l flow i f the e l e c t r o l y s i s i s i n d e f i n i t e l y prolonged The value o f in faradays i s given i n t h i i s
2po = η N °
(3)
A
where N ° i s the number o f moles o f A present i n the i n i t i a l solution. D i f f e r e n t mechanisms y i e l d d i f f e r e n t r e s u l t s . I t i s possible f o r A t o undergo reductions along p a r a l l e l but independent paths to y i e l d d i f f e r e n t products} A
A + n^e = Ρ
;
A + n2£ = Q
(4)
Then the c u r r e n t w i l l decay e x p o n e n t i a l l y with time, as i t does i n the simple case, but the apparent value o f η d e f i n e d by the equation
2app. "
2.^ϋ°
(5)
Α
w i l l l i e between the values o f n j and n_2 and w i l l r e f l e c t the r e l a t i v e magnitudes o f the o v e r a l l r a t e constants f o r the two h a l f - r e a c t i o n s . There may be a "continuous f a r a d a i c background current" i f , r e d u c t i o n o f hydrogen i o n , water, or some other major c o n s t i t u e n t o f the s o l u t i o n , a t a r a t e t h a t i s constant and u n a f f e c t e d by the c o n c e n t r a t i o n o f A o r P. Then the c u r r e n t i s given by d
u
e
t
o
t
n
e
c
1=
where i p i s the t o t a l c u r r e n t a t t = 0. c i t y i s given by
(6) The q u a n t i t y o f e l e c t r i
a - a . α - β-· *) + if,..t
In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
(?)
7. METTES AND SHIA
135
Controlled-Potetitial Chronocotdometry
Behavior very s i m i l a r t o t h i s r e s u l t s from the mechanism A + η e
=
Ρ (8)
Ρ + Ζ
=
A + Ζ
where Ζ i s some major c o n s t i t u e n t o f the s o l u t i o n and Z* i s i t s reduced form. Frequently Ζ i s hydrogen i o n and Z* i s hydrogen gas. Again the c u r r e n t decays t o a f i n i t e s t e a d y - s t a t e v a l u e , which depends on the r a t e constant f o r the chemical step i n t h i s mechanism and on the c o n c e n t r a t i o n o f Z. A t any i n s t a n t the "regeneration c u r r e n t " i s equal t o the d i f f e r e n c e between the c u r r e n t t h a t i s a c t u a l l y observed and the one t h a t would be ob served i f the chemical step d i d not occur. In c o n t r a s t t o a con tinuous f a r a d a i c background c u r r e n t , which has the same value thoughout an e l e c t r o l y s i s , the r e g e n e r a t i o n c u r r e n t i s equal t o zero a t the s t a r t and i n c r e a s e the e l e c t r o l y s i s proceed Only by experimental measurement precis y made c o u l d i t be detected t h a t equations (6) and (7) p r o v i d e d s l i g h t l y imperfect f i t s , but because equation (7) overestimates the q u a n t i t y o f e l e c t r i c i t y consumed by regenerated A near the beginning o f the e l e c t r o l y s i s i t y i e l d s a value o f ^» t h a t i s s l i g h t l y s m a l l e r than the one d e s c r i b e d by equation (3). For some p a i r s o f mechanisms t h e d i f f e r e n c e s o f behavior are prominent and unmistakable; f o r others they are s u b t l e and d i f f i c u l t t o d e t e c t . The aims o f t h e present work are 1. t o devise a program t h a t w i l l i d e n t i f y the mechanism o f a process, u s i n g no more i n f o r m a t i o n than would be a v a i l a b l e t o a research chemist studying the same process f o r the f i r s t time; 2. t o study the nature and r e l i a b i l i t y o f the c l a s s i f i c a t i o n s t h a t such a program can make and, i n doing so, t o s t r i v e f o r a d d i t i o n a l i n s i g h t i n t o the thought processes o f the human chemist attempting t o make s i m i l a r c l a s s i f i c a t i o n s ; and 3. t o e s t a b l i s h some general p r i n c i p l e s and expectations t h a t may f a c i l i t a t e the development o f o t h e r branched c l a s s i f i c a t i o n programs i n the f u t u r e . The
Starting Point
The f i r s t n e c e s s i t y was t h a t o f d e c i d i n g what data and i n f o r m a t i o n might be provided t o the program. E i t h e r the c u r r e n t o r t h e q u a n t i t y o f e l e c t r i c i t y might be measured as a f u n c t i o n o f time. E s p e c i a l l y when the working e l e c t r o d e i s a v i o l e n t l y s t i r r e d mercury p o o l , i t has long been known t h a t the noise l e v e l o f the measured c u r r e n t i s r a t h e r h i g h , t y p i c a l l y o f the order o f +_ 10 per cent. T h i s i s due p a r t l y t o t r a n s i e n t f l u c t u a t i o n s o f the area o f t h e p o o l , and p a r t l y t o the i n h e r e n t i r r e p r o d u c i b i l i t y o f the convective mass-transfer process. These momentary i r r e g u l a r i t i e s are smoothed o u t by c u r r e n t i n t e g r a t i o n t o such an
In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
136
CHEMOMETRICS: THEORY AND APPUGATION
extent t h a t even those employing equations d e s c r i b i n g the dependence o f c u r r e n t on time have u s u a l l y chosen t o evaluate the c u r r e n t a t time _t by employing the equation it
-
Q
^t+At" t - A t
(9)
2 (At) i n preference t o measuring t h a t c u r r e n t d i r e c t l y . Directreading e l e c t r o m e c h a n i c a l c u r r e n t i n t e g r a t o r s w i t h p r e c i s i o n s as good as 0.1 p e r cent o f f u l l s c a l e , o r b e t t e r , have been a v a i l able f o r over 20 years, and can now be obtained with BCD outputs f o r d i r e c t data a c q u i s i t i o n . Experimentation with measurements of c u r r e n t and c u r r e n t i n t e g r a l s convinced us t h a t very much b e t t e r f i t s t o t h e o r e t i c a l equations c o u l d be obtained f o r the l a t t e r . Consequently we decided t o assume t h a t Q-t data would be a v a i l a b l e and t h a t the absolute e r r o r i n a measured value o f 2. would be independent o In a d d i t i o n t o th g-t curve, t h e chemist t r y i n g t o e l u c i d a t e the mechanism o f a process would s u r e l y know s e v e r a l t h i n g s about the experimental c o n d i t i o n s , and t h i s i n f o r m a t i o n i s obtained i n a s h o r t i n i t i a l i n t e r a c t i v e d i a l o g u e . I t may be s a i d here t h a t the program i s w r i t t e n i n t h e D i g i t a l Equipment Corporation's EDUSystem 25 BASIC f o r execution on a PDP8/1 minicomputer t h a t p r o v i d e d 4096 words of core memory as the user area a v a i l a b l e f o r t h i s work. P a r t l y because o f the number o f hypotheses t h a t must be t e s t e d , and p a r t l y because lengthy p r i n t commands a r e i n c l u d e d t o provide d e t a i l e d guidance i n d e s i g n i n g subsequent experiments, t h e e n t i r e length o f the program c o n s i d e r a b l y exceeds 4096 words. I t i s t h e r e f o r e c o n s t r u c t e d i n s e v e r a l segments chained together, with data f i l e s on magnetic tape, f o r t r a n s f e r r i n g numerical values from one segment t o another. I t i s t h e f i r s t segment t h a t c o n t a i n s t h e data statements and t h e i n i t i a l dialogue. T y p i c a l examples o f these are shown i n F i g s . 3 and 4. The data statements a r e arranged i n the form t ^ , 2lf 22'··· · T* i n i t i a l dialogue o b t a i n s the number o f m i l l i m o l e s o f s t a r t i n g m a t e r i a l i n the o r i g i n a l s o l u t i o n , the volume o f t h a t s o l u t i o n , and the expected value o f n. Because we thought i t important t o enable the program t o accept the simplest hypothesis c o n s i s t e n t with the p r e c i s i o n o f the data, as a human chemist would n a t u r a l l y do, the user's estimate o f the standard e r r o r i n the measured values o f £ obtained as w e l l . 10
i
700 705 710 715 720 725
s
DAT 1 0 » . 0 9 5 7 » £ 0 » . 1 8 0 5 » 3 0 » . £ 6 0 5 » 4 0 » . 3 2 9 1 » 50».3947 DAT 6 0 » . 4 5 1 8 » 7 0 » . 5 0 4 1 » 8 0 » . 5 5 0 5 * 9 0 » . 5 9 3 2 » 120».6972 BAT 1 5 0 » . 7 7 5 5 » 1 8 0 » . 8 3 7 1 » 2 1 0 » . 8 7 9 1 » 2 4 0 » . 9 0 9 9 » 270».9320 DAT 3 0 0 » . 9 4 9 5 » 3 6 0 » . 9 7 2 1 » 4 2 0 » . 9 8 4 9 » 4 8 0 » . 9 9 1 6 » 540».9952 DAT 6 0 0 » . 9 9 7 2 » 6 6 0 » . 9 9 8 0 » 7 2 0 » 1 . 0 0 0 2 » 7 8 0 » . 9 9 9 2 » 840».9998 DAT 9 0 0 » 1 . 0 0 0 7
Figure 3.
Typical data statements
In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
7.
Controïled-ΈPotential Chronocoulometry
MEITES AND SHiA
137
RUM R=STARTING MATERIALS B = E L E C T R O A C T I V E I M P U R I T Y ; I ? J? K = I N T E R M E D I A T E S ? P-.Q-. R = P R D D U C T S ; Y = E L E C T R O I N A C T I V E P R E C U R S O R • F A ; 2 = A N E L E C T R O I N A C T I V E S P E C I E S ? O F T E N H-IDNÎ. WHOSE CONCENTRATION REMAINS CONSTANT DURING THE E L E C T R O L Y S I S ; 2 " = T H E R E D U C E D FORM O F Ζ MMOLES O F A T A K E N ? 1.0014 EXPECTED N-VALUE? 1 C\r-3 OF SOLN.? 5 2 . 0 NO. O F D A T A P O I N T S " £ 6 U N I T S OF Q<MEAS.>: 0=CDULDMBS» 1=MF? 1 E S T D . STD. DEV. OF C K M E A S . V ? 1 E - 3 IS K F f O KNOWN F O R T H E B L A N K S O L N . : 0=NO> D I A G N O S T I C P R I N T WANTED: 0=NO«. 1 = Y E S ? 1 HOW MUCH: 1=FINAL!> P = ? 10
1=YES?
0
Figure 4. Initial interactive dialogue for the data shown in Figure 2 The program reads the data, converts the v a l u e s o f £ t o m i l l i faradays i f they were p r o v i d e d i n coulombs, and o b t a i n s a value of i f f o r the blank s o l u t i o n i f t h i s i s known. There i s no human i n t e r v e n t i o n a f t e r t h i s p o i n t . c
The Body o f the Program Most current-time curves (on which c e r t a i n types o f behavior are e a s i e r t o d i s c e r n than they are on £ri. curves) have the general shape shown i n F i g . 5: the c u r r e n t decays smoothly and monotonically from i t s i n i t i a l value t o a very much s m a l l e r one which, as was mentioned above, may o r may not be i n d i s t i n g u i s h able from zero. Curves o f t h i s s o r t belong t o what i s c a l l e d the "main l i n e " below. Two d i f f e r e n t kinds o f curves are shown i n F i g s . 6 and 7. The one i n F i g . 6 r e s u l t s from a mechanism l i k e A + 2e = Ρ Ρ + A
*
I + e = Ρ
(slow) 21
(10) (fast)
which has been observed i n r e d u c t i o n s o f vanadium (IV) (=A) t o vanadium(II) (=P): the s t a r t i n g m a t e r i a l and product r e a c t i n the bulk o f the s o l u t i o n t o produce vanadium (III) (=1) r and one o f the c u r i o s i t i e s o f the e l e c t r o c h e m i c a l behavior o f vanadium i s t h a t the r e d u c t i o n o f vanadium (III) i s o f t e n much more n e a r l y r e v e r s i b l e than t h a t o f vanadium(IV),
In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
138
CHEMOMETRICS: THEORY AND APPLICATION
Figure 5. Current-time curve for the controUed-OOtential reduction of cadmium(ll) in 3M hydrochloric acid at a large stirredmercury-pool electrode maintained at a potential of -0.80 V vs. S.C.E.
Time,
s
Figure 6. Current-time curve for the controlled-potential reduction of vanadiumflV) in 3M hydrochloric acid at a large stirred mercury-pool electrode maintained at a potential of -0.85 V vs. S.C.E.
Time,
Figure 7. Current-time curve for the controlled-potential oxidation of vanadium(H) in 3.5M hydrochloric acid at a large stirred mercury-pool electrode, with a resistor connected in series with the cell to limit the current at the start of the electrolysis. After the potentiostat gained control the potential of the working electrode was maintained at -0.35 V vs. S.C.E.
Time, s
s
In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
7.
MEITES AND S H i A
Controlled-Potential Chronocoulometry
139
The s u c c e s s i v e values o f £ and t a r e d i f f e r e n t i a t e d numeri c a l l y t o o b t a i n values o f ii
0 A
-
- ο
-l+i ii+i
-
i
( i d
- lj_
The value o f i ^ f o r the f i r s t i n t e r v a l (j=l) i s s t o r e d , and another i s computed f o r the i n t e r v a l a t whose end £ f i r s t exceeds one-half o f i t s value a t t h e very l a s t p o i n t . I f the l a t t e r c u r r e n t c o n s i d e r a b l y exceeds t h e f i r s t one, i t i s concluded t h a t there i s a maximum on the current-time curve. Since the mech anism d e s c r i b e d by equations (10) i s the o n l y one y e t d e v i s e d t h a t w i l l account f o r a maximum, crude estimates o f the appro p r i a t e parameters are d e r i v e d from the data and s t o r e d i n a data f i l e , and a second segmen e f f e c t a f i t t o the d i f f e r e n t i a mechanism. An acceptable f i t may o r may not be o b t a i n e d . I f one i s obtained, the values o f the parameters are p r i n t e d out, together with a message s a y i n g t h a t o n l y t h i s one o f the mechan isms i n the program's l i b r a r y i s capable o f y i e l d i n g a s a t i s f a c t o r y f i t . I f an acceptable f i t i s not obtained, a message to t h a t e f f e c t i s p r i n t e d out, along with a statement t h a t human interpretation i s essential. In e i t h e r event execution stops at t h i s p o i n t . I f the i H t curve does not possess a maximum, the data are next searched f o r t h e c h a r a c t e r i s t i c f e a t u r e s o f F i g . 7. T h i s i s done by comparing t h e q u a n t i t i e s o f e l e c t r i c i t y accumulated d u r i n g the f i r s t two i n t e r v a l s . I f
2t2
where
>
*2 -g-flt,
3
s,
12
< >
i s the t o t a l q u a n t i t y o f e l e c t r i c i t y accumulated between
t^O and t=t^j and SQ i s the estimated standard e r r o r o f a s i n g l e measurement o f £ , i t i s concluded t h a t the current-time curve i s i n i t i a l l y f l a t w i t h i n the p r e c i s i o n o f the measurements. T h i s s o r t o f behavior has been observed i n e i t h e r o f two circumstances: 1. when the i n i t i a l c u r r e n t i s c o n t r o l l e d by an experimental a r t e f a c t , such as a. a c e l l r e s i s t a n c e so high t h a t the maximum a p p l i e d p o t e n t i a l a v a i l a b l e from the p o t e n t i o s t a t does not s u f f i c e t o d r i v e E ^ e . t o the d e s i r e d value, o r b. the presence o f a separate s o l i d o r l i q u i d phase suspended i n the s o l u t i o n being e l e c t r o l y z e d , and replenishing the e l e c t r o a c t i v e material d i s s o l v e d i n t h a t s o l u t i o n as r a p i d l y as i t i s removed by
In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
140
CHEMOMETRICS:
THEORY AND APPLICATION
reduction, 2, when the process f o l l o w s an ECE mechanism A + n^e_ = J 1
>
J
(13)
J + n^ e = Ρ and the r a t e constant f o r the i n t e r v e n i n g chemical step i s about an order o f magnitude l a r g e r than the o v e r a l l r a t e constants f o r the e l e c t r o n - t r a n s f e r s t e p s . The f i r s t o f these corresponds t o the equations (
2=2t* £A*> (14) 2 = % *
+
£'
where t * i s the time a t which the c u r r e n t j u s t ceases t o be con stant, 2 t * * q u a n t i t y o f e l e c t r i c i t y accumulated up t o t h a t time and" 2/ q u a n t i t y o f e l e c t r i c i t y t h a t remains t o be accumulated t h e r e a f t e r , and s_ i s the o v e r a l l r a t e constant f o r the r e d u c t i o n . There appear t o be four parameters, but s i n c e i t can be shown t h a t g' 2 t * / £ £* there are only three t h a t a r e independent. I n i t i a l estimates o f these three are made and c o n t r o l i s t r a n s f e r r e d t o a t h i r d segment o f the program t o e f f e c t a f i t t o these discontinuous equations. I f an acceptable f i t i s not secured, the ECE mechanism i s hypothesized and a f i t i s made t o the d i f f e r e n t i a l equation t h a t d e s c r i b e s i t . I f t h i s a l s o does not y i e l d an acceptable f i t , the experimenter i s advised t o repeat the experiment under s u i t a b l y modified c o n d i t i o n s , f o r the f a i l u r e o f both these hypotheses must mean t h a t an experimental a r t e f a c t i s superimposed on some other mechanism i n c l u d e d i n the m a i n . l i n e , and t h i s makes f u r t h e r analysis injudicious. We turn now t o the main l i n e , which i n c l u d e s current-time curves t h a t do not have maxima and t h a t a r e concave upward from the very beginning o f the e l e c t r o l y s i s . The simplest hypothesis t h a t can account f o r such a curve i s the one d e s c r i b e d by equation (2b): s t
i
s
t
n
n
e
e
=
g = 2?o(l-e-s t )
(
2b)
The b a s i c program f o r e f f e c t i n g n o n - l i n e a r r e g r e s s i o n r e q u i r e s i n i t i a l estimates o f the values o f and £ , and these are d e r i v e d from the raw data, The value o f i s estimated t o be equal t o t h a t o f g t the l a s t experimental p o i n t . The data are then scanned t o f i n d the f i r s t p o i n t a t which g/g» exceeds 0.865 (= 1 - e " ) , and the time t a t t h i s p o i n t i s combined with the equation a
2
9
In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
7.
MEITES AND SHIA
Contwlled-Potential Chronocoulometry
£ = s/t'
141
(15)
to o b t a i n an estimate o f s_. Regression onto equation (2b) i s then e f f e c t e d , u s i n g the l a t e s t v e r s i o n o f a general program (12) t h a t was w r i t t e n l o c a l l y and has been used f o r a l a r g e number o f d i f f e r e n t purposes i n our l a b o r a t o r i e s and elsewhere. When the f i t i s complete, the best values o f the parameters are combined w i t h equation (2b) to compute a value o f Scale. ^ P îwith the measured one Smeas.d i f f e r e n c e (gmeas. - Scale.) p l o t t e d aganist the independent v a r i a b l e t^ the r e s u l t would be a " d e v i a t i o n p l o t . " E a r l i e r papers from t h i s l a b o r a t o r y (12-15) have demonstrated the u t i l i t y o f d e v i a t i o n p l o t s i n t e s t i n g hypotheses and g u i d i n g the s e l e c t i o n o f b e t t e r ones, and s i n c e t h i s i s the b a s i s o f the d e c i s i o n mechanism i n the present program a b r i e f review o f the p r i n c i p l e s underlying t h e i follow. A deviation p l o t i s a p l o t of ( y - Ycalc.) a g a i n s t x, where χ i s the independent v a r i a b l e and y_ the dependent one, Ymeas. measured value o f y_, and y^aic. i s the corresponding value of y_ c a l c u l a t e d from an assumed r e l a t i o n between χ and y_ with the numerical values o f the parameters t h a t y i e l d the b e s t f i t to that r e l a t i o n . On the assumption t h a t the experiment has been p r o p e r l y designed and i s t h e r e f o r e f r e e from systematic e r r o r s o f measure ment, the only e r r o r s t h a t can a r i s e i f the assumed r e l a t i o n i s c o r r e c t w i l l be random e r r o r s . The d e v i a t i o n p l o t w i l l then c o n s i s t o f p o i n t s randomly s c a t t e r e d around the x - a x i s . I f , however, the assumed r e l a t i o n i s i n c o r r e c t , i t s i n c o r r e c t n e s s represents a source of systematic e r r o r , and the d e v i a t i o n s w i l l no longer be p e r f e c t l y random. They w i l l i n s t e a d s c a t t e r around a curve having a c h a r a c t e r i s t i c shape, which depends only on the natures o f the equations t h a t d e s c r i b e the assumed and t r u e r e l a t i o n s between χ and y_. For example, F i g u r e 8 shows the c h a r a c t e r i s t i c shape t h a t r e s u l t s from assuming t h a t the r e l a t i o n between g. t is o r
I
f
t
h
c o m
a r
s o n
e
w
e
r
e
m e a S e
i
s
a
a
n
d
(2b)
2. = â»(l-e"£ t) when i t i s a c t u a l l y 2 - Sx>,corr<
) +
if
j C
t
(7)
T h i s i s a smooth curve because i t was obtained with s y n t h e t i c "data" computed from equation (7); we c a l l such a smooth curve a "deviation pattern." The way i n which i t s shape r e f l e c t s the inherent natures o f the equations may be deduced by studying the curves i n F i g u r e 9, from which F i g u r e 8 may e a s i l y be d e r i v e d . The shape o f t h i s curve i s c h a r a c t e r i s t i c i n the t o p o l o g i c a l
In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
142
CHEMOMETRICS: THEORY AND APPLICATION
Figure 8. Deviation pattern arising from the assumption that the dependence of Q on t is described by equation (2b) when it is actually described by equation (7). The ordinate of each point is the difference between the measured value of Q and the value calculated from equation (2b), usina the values of <Joe and s that yield the best fit to that equation.
Figure 9. Rationale of the deviation pattern shown in Figure 8. The solid curve and the open circles represent the calculated dependence oi Q on t according to equation (7) with Q . . — 1 mF, s — 0.01 s' , and \ — 5 X 10* mF s' . The triangles show the best fit that can be obtained to equation (2b), which gives Q„ — 1.236 mF and s — 7.78 X 10* r . c o r r
1
ftC
1
1
o
r
sense; i t s amplitude i s a l t e r e d by changing the value o f i f , c £), and i t s p e r i o d i s a l t e r e d by changing the value o f b u t the eye and b r a i n are so c o n s t r u c t e d that changes do not a f f e c t the r e c o g n i z a b i l i t y o f the p a t t e r n . When random e r r o r s are a l s o i n v o l v e d , as o f course they always are i n d e a l i n g with r e a l data, the d e v i a t i o n p l o t c o n s i s t s o f random e r r o r s superimposed on the p a t t e r n . I f the random e r r o r i s much smaller than the amplitude o f the p a t t e r n , i t i s s t i l l easy t o recognize the p a t t e r n on v i s u a l i n s p e c t i o n , and i t can even be d i s t i n g u i s h e d from some g e n e r a l l y s i m i l a r one c o r r e s ponding t o another equation d i f f e r e n t from equation (7). With any given random e r r o r , decreasing the amplitude o f the p a t t e r n (e.g., by decreasing t h e value o f i f , while h o l d i n g the values o f the other parameters constant) f i r s t makes i t more d i f f i c u l t to d i s t i n g u i s h the underlying p a t t e r n from another s i m i l a r one. Decreasing t h e amplitude s t i l l f a r t h e r makes i t d i f f i c u l t , and e v e n t u a l l y impossible, t o d e t e c t t h e existence o f the p a t t e r n . T h i s i s a complicated way o f saying t h a t a systematic e r r o r may be t o o small t o d e t e c t . In our experience the d i f f i c u l t y o f p e r c e i v i n g a p a t t e r n v i s u a l l y becomes severe i f the random e r r o r i s as l a r g e as about h a l f o f the peak-to-peak amplitude o f the d e v i a t i o n p a t t e r n on which i t i s superimposed. c
In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
7. MEITES AND SHIA
Controlled-Potential
143
Chronocoulometry
D i f f e r e n t d e v i a t i o n p a t t e r n s a r e f r e q u e n t l y easy t o d i s t i n guish v i s u a l l y i f t h e random e r r o r i s very small* F o r example, the d e v i a t i o n p a t t e r n i n F i g u r e 10, which i s o b t a i n e d on f i t t i n g data t h a t a c t u a l l y correspond t o the mechanism A + n,e
=
Ρ
(mass-transfer constant = s^)
Β + n e
=
Q
(mass-transfer constant = s?)
(16) 2
DEVIATION
PLOT 8
FOR T H I S
HYPOTHESIS •
Κ - 25.53
• 20 •
4
• • •
1 4
J - 14
4
• • • 10
1 4
• • •
I - k
—
—
—
—
• • •
Figure 10. Deviation pattern arising from the assumption that the mechanism of a controlled-potential electrolysis is described by equation (1) when it is actually described by equations (16). As compared with the plots in Figures 8 and 9, this is rotated through 90°, with the ordinal numbers of the data points (which increase monotonically with time) being plotted along the vertical axis and deviation being plotted along the hori zontal axis. In addition, the abscissa of this plot is compressed at long times because the " * ' * ~ Each deviation, 7, /, and Κ are explained below. 1
In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
Ά
CHEMOMETRICS:
144
THEORY AND APPLICATION
EXAMINATION OF DEVIATION PLOTS
γ
Are t h e r e three c o n s e c u t i v e p o s i t i v e d e v i a t i o n s whose sum exceeds ***5SQ?
ι
No maximum
Maximum
I - 0
F i n d those f o r which the sum Is l a r g e s t , and s t o r e the number I o f the c e n t r a l p o i n t
Are there three consecutiv
ι
"
ι
No minimum
Minimum
J - 0
F i n d those f o r which the sum I s l a r g e s t , and s t o r e the number J o f the c e n t r a l p o i n t
E x c l u d i n g any f e a t u r e a l r e a d y f o u n d , i d e n t i f y t h e t h r e e c o n s e c u t i v e p o i n t s f o r which t h e sum S i s l a r g e s t .
S t o r e the v a l u e and s i g n o f S and t h e
number Ν o f the c e n t r a l p o i n t as Κ - (Ν + |SHO)*SGN(S).
Figure 11. Scheme for identifying maxima and minima on deviation plots and storing their locations and amplitudes Cwith c ° = c ° , n i = n?, and SJL = l,2s2> t o equation (2b), i s q u i t e d i f f e r e n t from the one i n Figure 8, Near the end o f the e l e c t r o l y s i s the slope o f the former approaches zero, while t h a t o f the l a t t e r i n c r e a s e s continuous y. T h i s i s so s u b t l e a d i f f e r e n c e , however, t h a t i t would be completely masked by even a q u i t e small random e r r o r , and the same t h i n g i s t r u e o f many other c l o s e l y s i m i l a r p a t t e r n s . The program t h e r e f o r e embodies a scheme f o r r e c o g n i z i n g p a t t e r n s t h a t i s s e n s i t i v e o n l y t o gross d i f f e r e n c e s o f shape t h a t cannot e a s i l y be hidden by random errors. T h i s scheme i s shown diagrammatically i n F i g u r e 11. The s u c c e s s i v e values o f (gmeas. " 2 c a l c . ) i n s p e c t e d t o see A
B
a
r
e
In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
7.
MEITES AND SHIA
145
Controlled-Potential Chronocoulometry
whether the sum o f three consecutive ones exceeds 4.5SQ, where S Q i s the estimated standard e r r o r o f a s i n g l e measured value o f Q. In the absence o f any systematic e r r o r , the p r o b a b i l i t y t h a t any one d e v i a t i o n would equal o r exceed 1.5SQ i s equal t o 0.0668; the p r o b a b i l i t y t h a t three consecutive ones would do so i s equal t o (0.0668) = 3x10-4. i f t h e r e are no three consecutive p o i n t s t h a t s a t i s f y t h i s c r i t e r i o n , i t i s concluded t h a t there i s no maximum on the d e v i a t i o n p l o t , and a q u a n t i t y 1^ i s s e t equal t o zero. I f there are three such p o i n t s , the three g i v i n g the l a r g e s t sum o f the values o f (Smeas. " Scale.) are i d e n t i f i e d , and JE i s s e t equal t o the o r d i n a l number Ν o f the middle one o f these points. Next a minimum i s sought i n a s i m i l a r way. For t h i s purpose a minimum i s a r b i t r a r i l y d e f i n e d as a s e t o f three consecutive p o i n t s f o r which the sum o f the values o f (Smeas. " Scale.) * more negative than -4.5SQ. The deepest minimum i s i d e n t i f i e d and a second q u a n t i t y J i s s e point at i t s center; i I f the q u a n t i t i e s 3 and J d i f f e r from zero, t h e i r values i d e n t i f y the l a r g e s t maximum and the l a r g e s t minimum on the d e v i a t i o n p l o t . Values o f Ν c l o s e t o e i t h e r I o r J are excluded from a t h i r d search, which i d e n t i f i e s the s e t o f three consecutive p o i n t s f o r which the absolute value o f the sum £ o f the d e v i a t i o n s i s l a r g e s t . When t h i s has been done, a t h i r d q u a n t i t y Κ i s s e t equal to [N + ABS(S/10)]-SGNiS), where Ν i s the o r d i n a l number o f the middle one o f t h i s s e t o f three p o i n t s . From the r e s u l t i n g values o f I, J , and Κ i t i s p o s s i b l e t o t e l l whether the p l o t has two maxima and one minimum o r two minima and one maximum, where these f e a t u r e s l i e with r e s p e c t t o i t s t - a x i s , and how l a r g e the l e s s e r maximum o r minimum i s . Two p o s s i b i l i t i e s now a r i s e . One i s t h a t I / J i s equal to zero, which l o g i c a l l y means t h a t the p l o t l a c k s e i t h e r a maximum o r a minimum a c c o r d i n g t o the above d e f i n i t i o n s , and u s u a l l y i n p r a c t i c e means t h a t i t l a c k s both. In t h i s case the hypothesis t h a t equation (2b) accounts f o r the curve i s accepted, and a f i n a l segment o f the program i s c a l l e d i n t o core t o p r o v i d e an a p p r o p r i a t e message. We s h a l l r e t u r n t o t h i s i n a l a t e r p a r a graph. The other p o s s i b l i t y i s t h a t I ^ J has some f i n i t e v a l u e . In t h i s case the hypothesis i s r e j e c t e d . Since the p r o b a b i l i t y t h a t e i t h e r I o r J w i l l be assigned a f i n i t e value i f the hypo t h e s i s i s c o r r e c t i s o n l y 3 x l 0 ~ , the p r o b a b i l i t y t h a t the hypothesis w i l l be i n c o r r e c t l y r e j e c t e d i s o n l y 10"^. The s m a l l ness o f t h i s f i g u r e r e f l e c t s , but c o n s i d e r a b l y exaggerates, our b i a s i n f a v o r o f a c c e p t i n g the s i m p l e s t tenable hypothesis, and i t w i l l probably be r e v i s e d upward i n f u r t h e r work. Normally i t would be p o s s i b l e t o choose e i t h e r o f two a l t e r n a t i v e hypotheses a t t h i s stage: one i f the d e v i a t i o n p l o t had two maxima and one minimum, and another i f i t had two minima and one maximum. However, i t i s an unusual f e a t u r e o f the present 3
s
4
In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
146
CHEMOMETRICS: THEORY AND APPLICATION
problem t h a t curves o f both these shapes can a r i s e from a con tinuous f a r a d a i c background c u r r e n t . I f t h i s has the same s i g n as the c u r r e n t f o r the r e d u c t i o n o r o x i d a t i o n o f A, the p l o t has two maxima and one minimum and resembles Figure 9. I f i t has the opposite s i g n , the p l o t has two minima and one maximum. C o n t i n uous f a r a d a i c backgound c u r r e n t s are r a r e l y p o s i t i v e (According to a recent IUPAC recommendation, p o s i t i v e currents correspond to anodic processes a t the working e l e c t r o d e , and negative ones t o cathodic processes. T h i s recommendation i s followed here, but i t i s d i r e c t l y opposite t o e s t a b l i s h e d p r a c t i c e among e l e c t r o a n a l y t i c a l chemists.), but s i n c e the process being s t u d i e d may be e i t h e r cathodic or anodic, p r o v i s i o n must be made f o r both p o s s i b i l i t i e s . Hence i t i s im p o s s i b l e to e f f e c t a branch a t t h i s p o i n t , and any f i n i t e value o f the product 1·J leads t o the t e n t a t i v e adoption o f the hypo t h e s i s embodied i n equation (7). A f i t t o t h i s equatio i f n from the coordinate and a p p l y i n g appropriate small c o r r e c t i o n s to the values of £ and £ obtained i n the preceding f i t . Regression onto equation (7) i s then e f f e c t e d i n the o r d i n a r y way, and a new d e v i a t i o n p l o t i s c o n s t r u c t e d and examined as before. The hypothesis i s accepted i f I>J = 0; i f I ^ J has a f i n i t e value e i t h e r o f two branches may be pursued, depending on what the value o f Κ r e v e a l s about the shape o f the p l o t . For example, i t was s a i d e a r l i e r that the hypotheses represented by equations (7) and (16) w i l l probably be i n d i s t i n g u i s h a b l e on a d e v i a t i o n p l o t c o n s t r u c t e d from equation (2b) unless the random e r r o r s of measurement are very s m a l l . However, the shapes o f the two p a t t e r n s are not a c t u a l l y the same, and the d i f f e r e n c e i s c l e a r l y r e v e a l e d on con s t r u c t i n g a new p l o t from equation (7). I f there i s a c t u a l l y a continuous f a r a d a i c background c u r r e n t t h i s p l o t w i l l c o n s i s t o f p o i n t s randomly s c a t t e r e d around the t - a x i s . I f there are a c t u a l l y two substances undergoing simultaneous but independent reductions, the p o i n t s w i l l be s c a t t e r e d around the p a t t e r n shown i n F i g . 12. The s e n s i t i v i t y o f t h i s procedure f o r d e t e c t i n g simultaneous processes may be gauged from the f a c t t h a t there was a d i f f e r e n c e of only 20% between the values o f the two r a t e constants employed i n c a l c u l a t i n g the "data" on which t h i s Figure i s based. r
We t u r n now t o the segments o f the program t h a t provide the f i n a l output. For the sake o f b r e v i t y we s h a l l d i s c u s s only the one t h a t corresponds t o acceptance o f the hypothesis represented by equation (2b). I f t h i s equation y i e l d s a s a t i s f a c t o r y f i t , the mechanism may be the one d e s c r i b e d by equation (1), but there are other p o s s i b i l i t i e s as w e l l . The s t a r t i n g m a t e r i a l A may undergo a pseudof i r s t - o r d e r r e a c t i o n with the s o l v e n t o r some other major con s t i t u e n t of the s o l u t i o n Z, y i e l d i n g a product Q t h a t i s not electroactive:
In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
7.
MEITES AND SHiA A + η e
147
Controlled-Potential Chrorwcoulometry =
P;
A + Ζ
»
Q
(17)
In t h i s event the c u r r e n t w i l l decay e x p o n e n t i a l l y with time and equation (2b) w i l l be obeyed, but the value o f napp. c a l c u l a t e d from equation (5) w i l l be s m a l l e r than t h a t o f n. There may be an "induced" r e d u c t i o n o f Z: A + Ζ « AZ
; AZ + n'e - A + Ζ· ; A + η e « Ρ
(18)
Equation (2b) w i l l again be obeyed, but the value o f n^pp, w i l l exceed t h a t o f n. The s t a r t i n g m a t e r i a l may be reduced i n two d i f f e r e n t ways t o y i e l d two d i f f e r e n t products: A + η β Ί
=
Ρ
;
A + n?e
=
Q
(19)
Equation (2b) w i l l be obeyed, and n*pp. may be e i t h e r s m a l l e r or l a r g e r than the value expecte The e x i s t e n c e o f thes compare the value o f with that o f η N ° [equation (3) ]. The v a r i a n c e o f J^o depends on the standard e r r o r o f the i n d i v i d u a l A
8
* 4
•
•
• •
? 5
• • 20 • •
1 5 3 3
•
? 6 4 0 4 5 1 3
0 5 0 4 7 0 0
• • • • • • 10 • • • • • • • •
?
1 • -•••••••••••••••••••••••••••••••••••••••••••••••••••••••••••Φ 2 1 0 1 2 Figure 12. Deviation pattern arising from the assumption that the dependence of Q ont is described by equation, (7) when the mechanism is actually described by equations (16)
American Chemical Society Library 1155 16th St. N. W. Washington, D. C. 20036 In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
148 AN
CHEMOMETRICS: THEORY AND APPLICATION
ACCEPTABLE
F I T I S OBTAINED
ON A S S U M I N G
T H E I - T CURVE
TO B E
EXPONENTIAL. T H E D E V I A T I O N P L O T I S F E A T U R E L E S S AND T H E S T A N D A R D ERROR T H E F I T I S C O N S I S T E N T WITH T H E E S T I M A T E P R O V I D E D . A
OF
O N E - T A I L E D C H I " £ T E S T I N D I C A T E S T H A T M E A N I N G F U L NON-RANDOM ERRDRS ARE ABSENT AT T H E 9 8 . 6 6 \ L E V E L OF CONFIDENCE. T H I S F A C T MAS N O T U S E D I N M A K I N G T H E C L A S S I F I C A T I O N B U T TENDS TO SUPPORT I T S C O R R E C T N E S S .
CONCLUSION I S THAT T H E PROCESS OCCURS I N A S I N G L E THAT I T S RATE I S MASS-TRANSFER-CONTROLLED: A + 1 Ε = Ρ W I T H N<APP.> = . 9 9 8 5 3 1 6 ? I N A C C E P T A B L E A G R E E M E N T T H E E X P E C T E D V A L U E * AND S= . 0 1 0 0 0 6 3 3 S - 1 . IF THIS CONCLUSION I S V A L U E O F N<APP.> W I L C O N C E N T R A T I O N O F A? T H L O N G A S T H I S R E M A I N S ON T H E P L A T E A U
O F T H E WAVE
STEP
AND
WITH
OF A.
T H E R E MAY A L S O B E A P R I O R E Q U I L I B R I U M B E T W E E N A AND I T S PRECURSOR Y — Υ + Ζ = H
Figure 13. Typical final output obtained with synthetic data conforming to the mecha nism described by equation (1 ) values o f g and on the number o f data p o i n t s and the manner i n which these are d i s t r i b u t e d along the t - a x i s (16)« We chose t o adopt a f i x e d d a t a - a c q u i s i t i o n schedule, which was being used by one o f us f o r research i n t h i s area eighteen years before t h e p r e s e n t work was begun. T h i s enabled us t o c a l c u l a t e the r a t i o S Q ^ / S Q and t o use i t i n a s s e s s i n g the d i f f e r e n c e between the c a l c u l a t e d and expected values o f Q^. I f t h i s d i f f e r e n c e i s s m a l l e r than, say, S S Q ^ , i t may be concluded t h a t the mechanism i s adequately represented by equation ( 1 ) . There may i n a d d i t i o n be a f a s t p r i o r e q u i l i b r i u m between the e l e c t r o a c t i v e substance A and some other s p e c i e s Y: Y + Z m A(fast)
;
A + η e = Ρ
(20)
but none o f t h e many other p o s s i b i l i t i e s y e t e n v i s i o n e d w i l l g i v e r i s e both t o an e x p o n e n t i a l l y decaying c u r r e n t and t o a value o f £oo c o i n c i d i n g w i t h the p r e d i c t i o n o f equation (3). Should the c a l c u l a t e d and expected values o f d i s a g r e e , the s i g n o f t h e i r d i f f e r e n c e i s used t o produce a message i d e n t i f y i n g the most l i k e l y p o s s i b i l i t i e s and t e l l i n g how they can be t e s t e d . T y p i c a l
In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
7. MEITES AND SHIA
149
Controlled-Potential Chronocoulometry
examples o f the f i n a l output from t h i s p o r t i o n o f the program are shown i n F i g u r e s 13 and 14. There a r e many other p o s s i b i l i t i e s , but a f u l l d i s c u s s i o n o f them would be impossible i n the space available. Results and D i s c u s s i o n How r e l i a b l e a r e the c l a s s i f i c a t i o n s e f f e c t e d by such a program? T h i s one has been t e s t e d i n two ways: with experimental data f o r systems b e l i e v e d t o be thoroughly w e l l understood, and
RM A C C E P T A B L E F I T H A S B E E N O B T A I N E D BY A S S U M I N G T H A T T H E I - Τ C U R V E I S T H E S U M O F AN E X P O N E N T I A L L Y D E C A Y I N G C U R R E N T A N D R CONSTANT ONE. THE D E V I A T I O N P L D T I S F E A T U R E L E S S AND T H E S T A N D A R D E R R O R O F THE F I T IS CONSISTEN R O N E - T R I L E D C H I 'c T E S E R R O R S R R E A B S E N T RT T H E 1 0 0 \ L E V E L ZJF C O N F I D E N C E . T H I S F A C T MAS NOT > IS E D I N M A K I N G T H E C L A S S I F I C A T I O N B U T T E N D S TO S U P P O R T I T " C O R R E C T N E Γ " . THE C O N S T A N T C U R R E N T MA. B E A C O N T I N U O U S - A R A D * I C C U R R E N T E Q U A L TO 9 . 6 4 5 8 9 9 E - 6 MF S = . 9 3 0 5 2 7 4 MA.
BACKGROUND
T H I S C O N C L U S I O N S H O U L D B E C O N F I R M E D BY O T H E R E X P E R I M E N T S I N W H I C H T H E S T I R R I N G E F F I C I E N C Y " . I N I T I A L C O N C E N T R A T I O N O F A< AND E<W.E.> A R E V A R I E D . I F I T I S CORRECT, THE C L A S S I F I C A T I O N AND V A L U E O F I ' R « C > '··ILL i'E I N D E P E N D E N T OF S T I R R I N G E F F I C I E N C Y A N D C O N C E N T R A T I O N O F A* B U T I ·Ρ* Ο W I L L V A R Y E X P O N E N T I A L L Y W I T H E'W.E.'. M O R E O V E R ? A V I R T U A L L Y E Q U A L S T E A D Y - S T A T E C U R R E N T SHOULD B E O B T A I N E D I N E L E C T R O L Y S E S O F T H E S U P P O R T I N G E L E C T R O L Y T E A L O N E * T T ^ £ : AM Ε Ε - Μ . Ε . ' . R
THE E X P O N E N T I A L L Y D E C A Y I N G CURRENT CORRESPOND" TO A S I N G L E - S T E P MASS-TRANSFER-CONTROLLED PROCESS* A + NE = Ρ < W I T H Ν·::ΑΡΡ.;:·= . 9 9 8 7 8 4 9 AND S= 9 . 9 9 5 5 7 5 E - 3 S -1 T H I S C O N C L U S I O N SHOULD B E C O N F I R M E D BY OTHER E X P E R I M E N T S I N WHICH T H E S T I R R I N G E F F I C I E N C Y - . I N I T I A L C O N C E N T R A T I O N OF Α · AND E<W.E.> A R E V A R I E D . I F I T I S CORRECTTHE C L A S S I F I C A T I O N A N D T H E V A L U E O F N«.APP.> W I L L B E U N A F F E C T E D BY T H E S E CHANGES. T H E R E MAY A L S O B E A F A S T P R I O R E Q U I L I B R I U M B E T W E E N A A N D I T S P R E C U R S O R Υ : Υ + Ζ = A < F A S T > ; A + N E = P. IN THIS CASE THE C L A S S I F I C A T I O N WILL CHANGE
READY
Figure 14. Typical final output obtained withe synthetic data conforming to equation (7)
In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
150 3.
CHEMOMETRICS: THEORY AND APPLICATION
AN APPARENTLY CONSTANT CURRENT HAVING THE OPPOSITE SIGN FROM THE CURRENT FDR REDUCTION OR OXIDATION OF R. AND EQUAL TO APPROXIMATELY 3.500000E-6 MF- S = .35 MA? WHICH IS TOO SMALL FOR POSITIVE IDENTIFICATION BY THIS PROGRAM. A CONSTANT CURRENT OF THE MAGNITUDE GIVEN ABOVE SHOULD BE OBSERVED IN THE BLANK SOLUTION AT THE SAME E'W.E.>> AND ITS VALUE SHOULD BE SUPPLIED IN THE INITIAL DIA LOGUE WHEN THIS PROGRAM IS RE-RUN.
Figure 15. A portion oj the final output obtained when i , /(i — i ) is between 3 X 10~ and 8.5 χ 10*. The value given for the constant current is calculated from an empirical rehtion between i/,c/(Q-s - i ) and Q„ aic - Q « M M * . f c
0
ftC
4
ftC
tC
with s y n t h e t i c data obtained by superimposing normally d i s t r i buted random e r r o r s on t h e t h e o r e t i c a l values c a l c u l a t e d from exact equations. As the work i s s t i l l i n progress, we can provide a f u l l r e p o r t only on d i s c r i m i n a t i o n s between the simplest mechan ism [equation (1)] an background c u r r e n t i s a l s There a r e three d i f f e r e n t r e s u l t s t h a t may be obtained, and i t i s convenient t o d e s c r i b e them i n terms o f the r a t i o o f the continuous f a r a d a i c background c u r r e n t t o the i n i t i a l c u r r e n t for the r e d u c t i o n o f A. T h i s r a t i o corresponds t o i f c / i i p - i f , c ) i n equation (6). The values c i t e d i n the f o l l o w i n g paragraphs r e f l e c t c e r t a i n a r b i t r a r y d e c i s i o n s made i n c o n s t r u c t i n g the program, and have no absolute s i g n i f i c a n c e . Nevertheless they correspond s u r p r i s i n g l y w e l l t o the l e v e l s a t which an informed human chemist would draw l i n e s between reasonable assurance and s u b s t a i n t i a l doubt i n i n t e r p r e t i n g the same data. 1· i f c / ( i ° - i f 0 . 0 0 1 . T h i r t y - s e v e n s e t s o f data f u l f i l l i n g t h i s c o n d i t i o n were analyzed; every one was c l a s s i f i e d as i n v o l v i n g a continuous f a r a d a i c background c u r r e n t - o r some phenomenon t h a t even a human chemist c o u l d not d i s t i n g u i s h from t h i s on the b a s i s o f a s i n g l e experiment. 2. if, /(i°-if ,c> 0.00025. F o r t y - f o u r s e t s o f data f u l f i l l i n g t h i s c o n d i t i o n were analyzed; every one was c l a s s i f i e d as corresponding t o a simple process - o r , again, something i n d i s t i n g u i s h a b l e from t h i s . 3. 0.0003 <. i f , / U ° - i f , ) 0.00085. Twenty-four s e t s o f data f u l f i l l i n g t h i s c o n d i t i o n were analyzed. For each one equation (2b) provided a s a t i s f a c t o r y f i t , but y i e l d e d a value of g » t h a t was not i n acceptable agreement with the expected one. The hypothesis represented by equation (1) i s then accepted and the f i n a l p r i n t o u t i n c l u d e s the paragraph shown i n Figure 15 as one o f s e v e r a l p o s s i b l e explanations o f the e r r o r i n £». I t i s not s u r p r i s i n g t h a t the boundary between regions 2 and 3 should be as sharp as i t i s . There i s a very d i r e c t r e l a t i o n between the value o f i f / ( i ° - i f , ) and the value o f g » obtained from a f i t t o equation (2b), and the standard e r r o r o f 2» i s so small t h a t even a small v a r i a t i o n o f i f /(i°-.if ,c) produces a comparatively l a r g e v a r i a t i o n o f g». f
f
f
c
c
c
fc
c
/C
In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
7.
MEITES AND SHIA
Controlled-Potential Chronocoulometry
151
However, we were not prepared t o f i n d t h a t the boundary between regions 1 and 3 i s e q u a l l y sharp. The boundary between regions 2 and 3 i s governed by p u r e l y s t a t i s t i c a l c o n s i d e r a t i o n s , but the one between regions 1 and 3 depends on the p o s s i b i l i t y o f i d e n t i f y i n g t h e f e a t u r e s o f a d e v i a t i o n p a t t e r n i n the face o f random e r r o r s t h a t o c c a s i o n a l l y cloak them a g a i n s t v i s u a l detection. We see no way o f a v o i d i n g the c o n c l u s i o n t h a t d e v i a t i o n p a t t e r n r e c o g n i t i o n i n the f a s h i o n o u t l i n e d here i s no l e s s r e l i a b l e and u s e f u l than the f a m i l i a r t e s t s p r e s c r i b e d by classical statistics. Conclusion There appear t o be two, and only two, r e a c t i o n s t h a t can be manifested by the chemist watching the execution, o r i n s p e c t i n g the f i n a l output, o f a p r o p e r l y c o n s t r u c t e d c l a s s i f i c a t i o n p r o gram. Rather t o our s u r p r i s e One, which was mentione a human being's devoted a p p l i c a t i o n o f understanding and i n t u i t i o n can be so a c c u r a t e l y d u p l i c a t e d by a machine. As such programs become more common i n the years t o come - no matter whether they i n v o l v e p a t t e r n r e c o g n i t i o n , the s o r t o f numerical a n a l y s i s o u t l i n e d here, o r some other technique not y e t devised t h i s i n c r e d u l i t y w i l l probably be the c h i e f d e t e r r e n t t o t h e i r wide and ready adoption. Those who w r i t e such programs, and who study the p r i n c i p l e s on which such programs can be based, should r e a l i z e t h a t they appear t o much o f the s c i e n t i f i c community t o be c h a l l e n g i n g the worth o f the human being i n s c i e n c e , and would do w e l l t o face the p s y c h o l o g i c a l and p h i l o s o p h i c a l problems t h a t are i n v o l v e d . The other common r e a c t i o n i s t o p e r s o n i f y the computer executing the program. T h i s i s e s p e c i a l l y a p t t o happen i f the numerical computations and other processes t h a t produce and guide the f i n a l c l a s s i f i c a t i o n a r e concealed d u r i n g execution, and i f the f i n a l output i n c l u d e s p o s s i b i l i t i e s t h a t the watcher recognizes as being c o r r e c t but d i d not t h i n k o f o r remember u n t i l the computer t e r m i n a l p r i n t e d them out. I t i s oddly easy to f o r g e t the r o l e t h a t the programmer has played i n paving the path from i n i t i a l input t o f i n a l p r i n t o u t . Whatever the extent t o which these problems may hinder and delay t h e achievement o f general p o p u l a r i t y and widespread use by m a c h i n e - c l a s s i f i c a t i o n procedures, there i s a very r e a l foundation f o r the enthusiasm t h a t t h e i r proponents have evinced a t t h i s symposium. B e l i e v i n g t h a t the length o f h i s s c i e n t i f i c c a r e e r exceeds t h a t o f any other member o f t h i s group, the s e n i o r author o f t h i s paper wishes t o express h i s c o n v i c t i o n t h a t he has seen no other development i n s c i e n t i f i c thought, knowledge, o r methodology - with the s i n g l e exception o f o n - l i n e data a c q u i s i t i o n and p r o c e s s i n g - t h a t has had anything l i k e the same promise o f
In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
152
CHEMOMETRICS: THEORY AND APPLICATION
changing the s t r u c t u r e o f science and the p r o f e s s i o n a l l i v e s o f those who pursue i t s ends. F o r t h i s reason he hopes that t h i s symposium w i l l s w e l l the ranks o f those who are a c t i v e i n i t s field. Literature Cited
1. Brand, M.J.D., and G. A. Rechnitz, Anal. Chem., (1970), 42, 1170. 2. Isbell, A. F., JR., R. L. Pecsok, R. H. Davies, and J. H. Purnell, Anal. Chem., (1973), 45, 2363. 3. Ingman, F., A. Johansson, S. Johansson, and R. Karlson, Anal. Chim. Acta, (1973), 64, 113. 4. Barry, D. Μ., and L. Meites, Anal. Chim. Acta, (1974), 68, 435. 5. Barry, D. Μ., Β. H. Campbell, and L. Meites, Anal. Chim. Acta. (1974), 69, 143. 6. Campbell, Β. H., an 7. Meites, L., and E. Matijević, (1975) 423. 8. Meites, L., Pure Appl. Chem., (1969), 18, 35. 9. Meites, L., "Controlled-Potential Electrolysis," Chap. IX in "Physical Methods of Chemistry (Vol. I of 'Techniques of Chemistry,' ed. by A. Weissberger and B. W. Rossiter) Part IIA. Electrochemical Methods," Wiley-Interscience, New York, 1971. 10. Bard, A. J., and K. S. V. Santhanam, "Application of Con trolled Potential Coulometry to the Study of Electrode Re actions," in "Electroanalytical Chemistry," ed. by A. J. Bard, Marcel Dekker, Inc., New York, Vol. 4, 1972. 11. Lingane, J. J., "Electroanalytical Chemistry," Interscience, New York, 2nd ed., 1958, pp. 222-9. 12. Meites, L., "The General Multiparametric Curve-Fitting Program CFT4," Computing Laboratory of the Department of Chemistry, Clarkson College of Technology, Potsdam, N.Y., 1976. 13. Meites. L., and L. Lampugnani, Anal. Chem., (1973), 45, 1317. 14. Meites, L., and Barry, D. Μ., Talanta, (1973), 20, 1173. 15. Campbell, B. H., L. Meites, and P. W. Carr, Anal. Chem., (1974), 46, 386. 16. Meites, L., Anal. Chim. Acta, (1975), 74, 177.
In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
8 Examples of the Application of Nonlinear Regression Analysis to Chemical Data Y. C. MARTIN and J. J. HACKBARTH Abbott Laboratories, North Chicago, IL 60064
Nonlinear regressio analysi i powerful mathematical tool which has been use not achieved widespread application in chemistry. It is the purpose of this communication to illustrate some of the circumstances in which we have found the method to be useful. The objective is to encourage others to use this technique for other data-fitting problems. Introduction to Nonlinear Regression Analysis What is nonlinear regression? How does it differ from linear regression? Linear regression analysis is the process of finding the least-squares best fit of a set of data to a uni- or multidimensional equation in which the parameters (coefficients) to be fit are linear functions of the observed properties. The simplest linear regression analysis involves fitting data to a single parameter; as the name implies, the equation is that of a straight line: Y
i
=
b
0
+
b
x
1i
+
ε
eq. 1
In equation 1, Y is the observed value of the dependent variable of observation i , is the value of the independent variable for observation i, bo is the intercept of the line on the Y axis, bi is the slope of the line, and ε is the error. 1
Nonlinear r e g r e s s i o n a n a l y s i s i s the process o f f i n d i n g the least-squares b e s t f i t o f a s e t o f data t o an equation which i s not l i n e a r i n the parameters t o be f i t . A very simple example is: b
log(Yi) = l o g ( l + a X ) + ε ±
eq. 2
Equation 2 i s n o n l i n e a r i n a and b, the parameters t o be f i t .
153
In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
154
CHEMOMETRICS:
THEORY AND APPLICATION
Most chemical r e l a t i o n s h i p s are n o n l i n e a r : one f a m i l i a r example i s the f r a c t i o n o f an a c i d i o n i z e d as a f u n c t i o n o f pH and pK : a
α
eq. 3
= pK -pH 10 +1
I f some p h y s i c a l property were l i n e a r l y r e l a t e d t o a, then the observed v a r i a b l e s would be the p h y s i c a l property i n question and pH. The p K would be the parameter t o be f i t . How does one determine what i s the "best" f i t i n the n o n l i n e a r case? As with l i n e a r r e g r e s s i o n , the l e a s t - s q u a r e s c r i t e r i o n f o r best f i t i s commonly used. I t i s d e f i n e d as t h a t choice o f value f o r the a d j u s t a b l e parameters (bo and b i i n Equation 1 o r a and b i d i f f e r e n c e between the the b a s i s o f the X j / s i Mathematically, i n v o l v e s t a k i n g the p a r t i a l d e r i v a t i v e with r e s p e c t t o each o f the parameters o f the equation f o r the sum o f the squared d i f f e r e n c e s , s e t t i n g t h i s d e r i v a t i v e equal t b zero, and s o l v i n g f o r the parameter. In the l i n e a r case t h i s a l l works very w e l l — the values o f bg and b^ can be e x p l i c i t l y determined. Nonlinear equations do not y i e l d such an easy s o l u t i o n f o r the minimum sum o f squares: hence, i n order t o f i n d the best f i t t o a n o n l i n e a r equation an i t e r a t i v e procedure must be used. Hence one s t a r t s with a s e t o f b e s t guesses f o r the values o f the parameters t o be f i t . The sum o f squared d e v i a t i o n s between observed and c a l c u l a t e d Y j / s i s then c a l c u l a t e d . By some a l g o r i t h m another s e t o f (better) estimates i s chosen and the sum o f squared d e v i a t i o n s i s c a l c u l a t e d from these values. T h i s process continues u n t i l , by some p r e - e s t a b l i s h e d c r i t e r i o n , f u r t h e r changes i n the estimates do not decrease the sum o f squared d e v i a t i o n s from the f i t . a
From t h i s b r i e f d e s c r i p t i o n i t can be seen t h a t n o n l i n e a r r e g r e s s i o n a n a l y s i s s u f f e r s from s e v e r a l apparent disadvantages compared t o l i n e a r r e g r e s s i o n . An i n i t i a l estimate o f the parameter values must be s u p p l i e d , an a l g o r i t h m f o r f i n d i n g the minimum sum o f squares must be provided, and many c a l c u l a t i o n s o f t h i s sum o f squares are r e q u i r e d f o r the s o l u t i o n t o one problem. In view o f these d i f f i c u l t i e s , the t r a d i t i o n a l method f o r f i t t i n g a n o n l i n e a r equation has been t o transform the equation i n t o a l i n e a r form and f i t the data t o t h i s transformed equation. The disadvantages o f such a l i n e a r i z a t i o n s t r a t e g y are t h a t i t may i n v o l v e hours o f a l g e b r a i c manipulation, that f r e q u e n t l y assumptions must be made as t o the range o f the data o r the importance o f terms i n a sum, and t h a t the r e s u l t i n g equation i m p l i c i t l y weights the data i n a manner which may not be c o n s i s t e n t with experiment.
In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
8.
MARTIN AND HACKBARTH
155
Nonlinear Regression Analysis
On the o t h e r hand, advances i n computer technology have made these disadvantages o f n o n l i n e a r r e g r e s s i o n a n a l y s i s r e l a t i v e l y unimportant. I n i t i a l estimates are e a s i l y determined with standard g r a p h i c s techniques i n which the data and c a l c u l a t e d curve are d i s p l a y e d on a cathode ray tube. Algorithms t o f i n d minima and maxima are easy t o implement. F i n a l l y , the computations o f many sums o f squares i s a t r i v i a l task f o r a computer. The major advantages o f n o n l i n e a r r e g r e s s i o n a n a l y s i s are t h a t one f i t s data t o the equation as d e r i v e d and t h a t s i m p l i f y i n g assumptions are not necessary. Computer Programs As noted above, i n the a n a l y s i s o f n o n l i n e a r problems we have found i t convenient t o be able t o d i s p l a y the data and a c a l c u l a t e d curve on a cathode ray tube. The parameters o f the c a l c u l a t e d curve may the reasonable f i t t o the data parameter space f o r estimates i s not d i f f i c u l t , i n f a c t i t o f t e n r e v e a l s unsuspected f a c e t s o f the data. A companion program takes the same data f i l e and generates i n s t r u c t i o n s f o r a Calcomp p l o t t e r t o make a hard copy o f what was seen on the screen. We have w r i t t e n programs f o r two dimensions (one independent v a r i a b l e ) and three dimensions (two independent v a r i a b l e s ) . We use a simplex method t o f i n d the minimum sum o f squares (9,10). The a l g o r i t h m i n c l u d e s expansion and c o n t r a c t i o n o f dimensions o f the simplex. The s t a t i s t i c a l p r o p e r t i e s o f the b e s t f i t are c a l c u l a t e d by the equations used i n the BMD P - s e r i e s (11). General D e s c r i p t i o n o f Examples The f o l l o w i n g d i s c u s s i o n i l l u s t r a t e s the two p r i n c i p l e types of a p p l i c a t i o n which we have made o f n o n l i n e a r r e g r e s s i o n analysis. The f i r s t type o f a p p l i c a t i o n i s the c a l c u l a t i o n o f p h y s i c a l p r o p e r t i e s o f a molecule from experimental o b s e r v a t i o n s . We have chosen as an example the c a l c u l a t i o n o f the p K ' s o f a d i b a s i c substance from measurements o f absorbance vs pH. The other two examples are from our work on the q u a n t i t a t i v e r e l a t i o n s h i p between p h y s i c a l p r o p e r t i e s and b i o l o g i c a l potency of drug analogs. V a r i a t i o n i n a n t i b a c t e r i a l potency o f n i t r o p h e n o l s and erythromycins as a f u n c t i o n o f s t r u c t u r e and pH of the t e s t are examined. a
Example 1: pK 's o f a D i b a s i c Substance from Measurements a
Absorbance
In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
156
CHEMOMETRICS:
THEORY AND APPLICATION
Recently we wanted t o know the pK 's o f the f o l l o w i n g compound: a
H
Since i t i s not s o l u b l e enough t o t i t r a t e , we measured i t s u l t r a v i o l e t spectrum with changes i n degree o f p r o t o n a t i o n , ie_ as a f u n c t i o n o f pH. E s s e n t i a l l y the method d e s c r i b e d by A l b e r t and S e r j e a n t were followed (12). That i s , the u l t r a v i o l e t spectrum o f the compound was recorded i n b u f f e r s o l u t i o n a t s e v e r a l pH's near the suspected pK 304 nm as a f u n c t i o n o over such a wide pH i n t e r v a l , we concluded t h a t the absorbance change r e f l e c t e d more than one pK . The f i r s t n o n l i n e a r f u n c t i o n we attempted t o f i t was t h a t given by A l b e r t and Serjeant (12): a
a
+
c a [H ] t
2
d
ο^ Κ Κ
+ c^tH+lK! +
η
λ
2
A =1
eq. 4 +
[H ]
2
+ [H+l^ + Κ Κ χ
2
i n which c i s the t o t a l c o n c e n t r a t i o n o f the compound; a , a^, and a are the molar a b s o r b t i v i t i e s o f the d i c a t i o n , monocation, and n e u t r a l forms r e s p e c t i v e l y ; and and K 2 are the f i r s t and second a c i d d i s s o c i a t i o n constants. T h i s form of the equation l e d t o problems i n f i t t i n g , p a r t i c u l a r l y with data from t r i b a s i c analogs. S o l u t i o n o f Equation 4 i n terms o f [H ] l e d t o an a l g e b r a i c a l l y very complex r e l a t i o n s h i p . So we manipulated Equation 3 f u r t h e r u n t i l we r e a l i z e d the f o l l o w i n g : t
d
n
+
+
A([H ]
2
+ [Η+ΙΚχ + K]K ) 2
+
= c a [H ] t
d
2
+ CtamtH+ΙΚχ + c a K i K 2 t
n
or 0 = (c a t
+
d
- A) [H ]2 + (
C t
a
m
- A) t H + l ^ + ( c a t
n
- Α)Κ Κ χ
2
Since a l l terms add up t o a sum o f zero, the sum o f those with a p o s i t i v e s i g n must equal the sum o f those with a negative s i g n . The former sum i s l a b e l l e d "POS", and the l a t t e r , "NEG". Hence :
In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
8.
MARTIN AND HACKBARTH
Nonlinear Regression Analysis
157
POS » -NEG or POS 1 -NEG or 0 « Hence we chose t o f i t the f o l l o w i n g f u n c t i o n : POS H
P calc '
l o
9 -NE
eq. 5
+ pH obs
The computer program t e s t s the s i g n o f each term i n the equation f o r each data p o i n t , sums the p o s i t i v e and negative terms s e p a r a t e l y , and p l a c e s the sum o f the p o s i t i v e values i n the numerator and the sum o f the negative values i n the denominator. From t h i s type o f a n a l y s i s o f the data, the f o l l o w i n g values were c a l c u l a t e d : ρΚ pK c a c a χ
2
c
a
w
a
t
d
t
m
« » «
2.31 ± .02 4.45 ± .02 0.508 ± .001 0,316 ± .002
s
t n e x p e r i m e n t a l l y e s t a b l i s h e d from measurements a t high pH t o be 0,181. The s t a t i s t i c s o f f i t a r e : R2 - .9997,
s = .0253, w i t h 6 degrees o f freedom.
F i g u r e 1 shows the curve c a l c u l a t e d on the b a s i s o f t h i s f i t . From every standpoint, the use o f n o n l i n e a r r e g r e s s i o n a n a l y s i s allowed the maximum i n f o r m a t i o n t o be gained from the data. The p r e c i s i o n o f f i t i s s a t i s f a c t o r y c o n s i d e r i n g the low number o f experimental p o i n t s i n v o l v e d and the c l o s e n e s s o f the two pK 's. We have used t h i s method o f c a l c u l a t i o n t o f i t the pKa's o f t r i b a s i c substances from absorbance measurements a t s e v e r a l pH's, and a l s o t o f i t the p K ' s o f the i n d i v i d u a l a
a
In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
158
CHEMOMETRICS: THEORY AND APPLICATION
A-44296 ABSORBANCE AT 304 V3 PH
.S5
.AS
U
-3S
U
Figure 1. The variation in absorbance of Compound I at 304 nm as a function of pH. The curve is calculated from Eq. 4, pK, = 2.31, pK = 4.45, and the absorptivity of the diprotonated species times the concentration equal to 0.580, that of the monoprotonated species equal to 0.316, and that of the nonprotonated species equal to 0.181. 2
SCHEME I Compartment-. Aqueous no. 1
Nonaqueous
Aqueous no. 2
Receptor
Equilibria
jr jr jDr CDjr
In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
8.
MARTIN AND HACKBARTH
159
Nonlinear Regression Analysis
amines o f p o l y b a s i c molecules from the pH dependence o f the change i n the carbon-13 NMR chemical s h i f t . Example 2;
A n t i b a c t e r i a l Potency o f Nitrophenols
Our p r i n c i p l e r e s e a r c h i s i n the a n a l y s i s o f the r e l a t i o n s h i p s between chemical and b i o l o g i c a l p r o p e r t i e s o f compounds. We have r e c e n t l y become aware o f the importance o f the form o f the equation t o which the data are f i t . S p e c i f i c a l l y , a c o n s i d e r a t i o n o f the g e n e r a l p r o p e r t i e s o f the b i o l o g i c a l system i n which the data was generated coupled with a statement o f the l i n e a r f r e e energy assumptions which may be made about the s t r u c t u r e - a c t i v i t y r e l a t i o n s h i p s o f each p a r t o f t h i s b i o l o g i c a l system can l e a d t o some very u s e f u l i n s i g h t s i n t o the form o f the equation which should be used t o analyze the data. Our f i r s t p u b l i c a t i o concerned b i o l o g i c a l system between s e v e r a l compartments; t h i s model a p p l i e s p r i n c i p a l l y to i n v i t r o t e s t s (13). The equation d e r i v e d f o r a f o u r compartment model (Scheme I) i s s u f f i c i e n t t o c o r r e l a t e the data from the examples d i s c u s s e d i n t h i s paper. Compartment one i s the aqueous compartment o u t s i d e o f the b a c t e r i a , t h a t i s , the medium; compartment two i s the aqueous compartment i n s i d e o f the b a c t e r i a ; compartment three i s a nonaqueous, nonreceptor compartment; and compartment four i s the r e c e p t o r . Compartments one and two become i d e n t i c a l i f t h e i r pH's are identical. From simple l i n e a r f r e e energy assumptions about the r e l a t i o n s h i p between b i n d i n g i n a compartment and h y d r o p h o b i c i t y o f the v a r i o u s analogs as measured by the octanol-water p a r t i t i o n c o e f f i c i e n t (Ρ), the f o l l o w i n g equation may be d e r i v e d : α 1 + Ζ — — I î1-ou + X log(1/C) = l o g e(
+ dP
1 + ζ l-Olo
c
1-Oh
ι-α
2
The symbol a i n d i c a t e s the f r a c t i o n o f the compound i n the i o n i z e d form a t the pH o f the p a r t i c u l a r compartment. The dependent v a r i a b l e i s the negative l o g a r i t h m o f the c o n c e n t r a t i o n , C., o f the compound r e q u i r e d t o produce the d e f i n e d b i o l o g i c a l response. The independent v a r i a b l e s are and P. F i n a l l y , the parameters t o be f i t i n the r e g r e s s i o n a n a l y s i s are Z, a, b, c, and d. From the model i t may be shown t h a t Ζ i s the r e l a t i v e a f f i n i t y f o r the r e c e p t o r o f the i o n
In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
6
CHEMOMETRICS:
160
THEORY AND APPLICATION
-6.0
1.5
υ*
2.0
1
2.5
Figure 2. Tu?o uieu* of the variation in antibacterial activity of nitrophenoh as a function of foaV and log(l — a). The curve is calculated from Eq. 7. compared with t h a t o f the n e u t r a l form o f a drug. A, b, c, and d are r e l a t e d t o the extrathermodynamic r e l a t i o n s h i p s between b i n d i n g constants t o the nonaqueous and receptor compartments and logP. A s e r i e s o f s i x n i t r o p h e n o l s had been t e s t e d vs E. c o l i a t pH's 5.5, 6.5, 7.5, and 8.5 (14). A l l 24 data p o i n t s f i t the f o l l o w i n g equation (15): 1 + Ζ 1-α + χ
log(1/C) • l o g 1 + 1-α
+ dP° -κ1-α
In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
eq. 7
8.
Nonlinear Regression Analysis
MARTIN AND HACKBARTH
161
i n which: log(Z) c log(d) X* pK a
» -3.60 » 4.95 = -6,96 = 5.41 = 3.26
± ± ± ± ±
0.17 0.81 1.61 0.10 0.73
The s t a t i s t i c s o f f i t a r e : R
2
+ 0.909,
s » 0.209, and η » 24.
F i g u r e 2 i s a p l o t o f l o g ( l / C ) as a f u n c t i o n o f logP and log(l-a). These parameters correspond t o a case i n which there i s no v a r i a t i o n i n hydrophobi s e r i e s . However, ther t o the nonaqueous compartment. Only one aqueous compartment (at the pH o f the e x t e r n a l medium) i s needed t o f i t the data. A s l i g h t l y b e t t e r f i t i s found when the p K o f p i c r i c a c i d i s allowed t o vary; t h i s p K i s i n d i c a t e d with an a s t e r i s k above. The value o f Ζ i n d i c a t e s t h a t the i o n i c form o f any n i t r o p h e n o l has 4000 X l e s s potency than the n e u t r a l form o f the same compound; t h i s value i s s i g n i f i c a n t l y d i f f e r e n t from zero. a
a
Example 3: A n t i b a c t e r i a l Potency o f N-benzyl Erythromycins (15) In t h i s s e r i e s one o f the hydrogen atoms o f the dimethylamino group o f erythromycin, s t r u c t u r e below, was r e p l a c e d w i t h a s u b s t i t u t e d phenyl group. CH
3
CH,
CH
IA *=OH Erythromycin A IB R=H Erythromycin 6
The l o g r e l a t i v e p o t e n c i e s o f these compounds as measured a t pH 6.00, 7.00, and 7.65 i s w e l l f i t by the f o l l o w i n g equation:
In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
CHEMOMETRICS: THEORY AND APPLICATION
162
eq. 8 pK -6.0 1 + 10 pKa-pHjL pK -6.0 1 + 10 + a ( l + 10 a
log(1/C) « l o g
+ eEs + X
a
i n which: log(a) = 1,30 * 0.23 X = 2.64 ± 0.17 e = 0.303 ± 0.065 The s t a t i s t i c s o f f i t a r e : R
2
= .844,
s = 0.186,
η = 38.
Thé s u b s t i t u e n t constan the r e l a t i v e s i z e o f th phenyl r i n g . F i g u r e 3 i s a p l o t o f potency as a f u n c t i o n o f the p K o f the compound and the pH o f the t e s t . Both the numerator and denominator o f Equation 6 are dominated by a s i n g l e term because the amount o f i o n i c form o f the drug bound t o the r e c e p t o r i s l a r g e r than the amount o f n e u t r a l form o f the drug bound t o the receptor, and because most o f the drug added t o the system remains i n the e x t e r n a l aqueous compartment. As a consequence o f t h i s i t was not p o s s i b l e t o independently f i t a Z, and a . Therefore, pH was a r b i t r a r i l y s e t a t 6.0, and a h y b r i d constant, a, was fit: H a
l f
2
2
P
2
a = Ζ Within the l i m i t s s t a t e d p r e v i o u s l y , the values o f the constants found by the n o n l i n e a r r e g r e s s i o n a n a l y s i s l e a d t o c e r t a i n t e n t a t i v e c o n c l u s i o n s with r e s p e c t t o the determinants o f potency i n t h i s s e r i e s . F i r s t , two aqueous compartments o f d i f f e r e n t pH must be considered i n the a n a l y s i s . Additionally, the pH o f the aqueous phase w i t h i n the b a c t e r i a remains r e l a t i v e l y constant when the pH o f the e x t e r n a l phase i s v a r i e d from 6.0 to 7.65. Second, the i o n i c form o f these compounds c o n t r i b u t e s s i g n i f i c a n t l y t o potency. I t i s not p o s s i b l e t o e s t a b l i s h the r e l a t i v e potency o f the i o n vs t h a t o f the n e u t r a l form, however, because o f the problem d i s c u s s e d above. T h i r d , there i s no evidence f o r hydrophobic bonding o f the phenyl r i n g and i t s s u b s t i t u e n t s t o the r e c e p t o r o r t o an i n e r t phase. Fourth, s u b s t i t u e n t s a t the para p o s i t i o n decrease potency by a s t e r i c e f f e c t .
In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
8.
MARTIN AND HACKBARTH
Nonlinear Regression Analysis
Figure 3. Two views of the variation in antibacterial activity of N-benzyl erythromycins as a function of pH of the medium and pK of the compound. The curve is calculated from Eq. 8. a
These examples show some o f the uses which we have made o f nonlinear regression analysis. Once one has a l i t t l e experience w i t h a program i t can e a s i l y become a r o u t i n e l y used and extremely h e l p f u l t o o l f o r the mathematical a n a l y s i s o f chemical data.
In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
163
164
CHEMOMETRICS: THEORY AND APPLICATION
Literature
Cited
1. Wentworth, W. Ε., J. Chem.Educ.,42 (1965) 97. 2. Jensen, R. E., R.G.Garvey, and B. A. Paulson, J. Chem. 3. Dye, J. L., and V. A. Nicely, J. Chem.Educ.,48 (1971) 443. 4. Barry, D. Μ., and L. Meites, Anal. Chim. Acta, 68 (1974) 435. 5. Barry, D. Μ., L. Meites, and Β. H. Campbell, Anal. Chim. Acta, 69 (1974) 143. 6. Meites, L., J. E. Steuhr, and T. N. Briggs, Anal. Chem., 47 (1975) 1485. 7. Gorenstein, D. G., A. M. Wyrwicz, and J. Bode, J. Amer. Chem,Soc.,98 (1976) 2308. 8. Asleson, G. L., and C. W. Frank, J. Amer. Chem.Soc.,98 (1976) 4745. 9. Nelder, J. Α., an 10. Olsson, D. Μ., and L. S. Nelson,Technometrics,17 (1975) 45. 11. Dixon, W. J., ed., "BMDP, Biomedical Computer Programs," pp 556, University of California Press, Berkeley, CA, 1975. 12. Albert, Α., and E. P. Serjeant, "The Determination of Ionization Constants", pp 44-60, Chapman and Hall, London, 1971. 13. Martin, Y.C.,and J. J. Hackbarth, J. Med. Chem., 19 (1976) 1033. 14. Cowles, P., and I. M. Klotz, J. Bacteriol, 56 (1948) 277. 15. Martin, Y. C., J. J. Hackbarth, and L. A. Freiberg, submitted for publication.
In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
9 A Computer System for Structure-Activity Studies Using Chemical Structure Information Handling and Pattern Recognition Techniques A. J. STUPER, W. E. BRUGGER, and P. C. JURS Department of Chemistry, The Pennsylvania State University, University Park, PA 16802
The study of relationship their biological activit attention. The term biological activity covers a range from pharmaceuticals and drugs to agricultural chemicals such as pest icides and herbicides to toxic reactions such as those of poisions, carcinogens, teratogens, and mutagens. A variety of methods have been exploited for structure-activity studies: (1) The semiempirical linear free energy related (LFER) or extrathermodynamic model developed by Hansch and co-workers. The LFER method is applied to homologous series of compounds that are related in that they are formed by placing substituents on a par ent compound. The method depends on defining quantitative corr elations between physicochemical parameters of a compound and the biological response observed. An equation of the form 2
log (1/C = aπ + bπ + ρσ + cE + d s
is fit to the set of data using linear regression. The variables are as follows: C is the concentration of the compound necessary to produce a standard biological response; π is the difference between the logarithm of the 1-octanol/water partition coefficient of the parent compound and the substituted compound; σ is the Hammett substituent constant that provides a measure of the elec tronic effect on the reaction rate; and E is a steric factor which compares sizes of substituents to that of methyl taken as a standard. (2) The de novo or additivity model proposed by Free and Wilson. In this approach the contributions to the parameter de fining biological response by each substituent group is assumed to be additive. The equation is s
Ai = y + Zj a
j # p
where μ is the overall average activity (the contribution of the constant part of the molecule, the parent structure), aj^ is the p
165 In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
166
CHEMOMETRICS: THEORY AND APPLICATION
c o n t r i b u t i o n t o the a c t i v i t y from the j t h s u b s t i t u e n t i n the p t h p o s i t i o n i n the parent s t r u c t u r e , and A i i s the standard b i o l o g i c a l response f o r drug compound i . Regression a n a l y s i s i s used t o obtain numerical values f o r the s u b s t i t u e n t c o n t r i b u t i o n s . (3) Quantum mechanical methods. These methods have been used t o c a l c u l a t e parameters t o be c o r r e l a t e d with a c t i v i t y and f o r the determination o f p r e f e r r e d conformations o f b i o l o g i c a l l y a c t i v e molecules. The purpose o f the present p r o j e c t was t o apply the ADAPT computer system t o s p e c i f i c s t r u c t u r e - a c t i v i t y problems. The ADAPT computer system combines techniques o f chemical s t r u c t u r e information handling and p a t t e r n r e c o g n i t i o n f o r the study o f chemical s t r u c t u r e - b i o l o g i c a l a c t i v i t y r e l a t i o n s . T h i s system can be used t o enter and s t o r e a s e t o f d i v e r s e chemical s t r u c t u r e s , generate s t r u c t u r a l d e s c r i p t o r s , and analyze them using p a t t e r n r e c o g n i t i o n methods. These three steps are i l l u s t r a t e d i n Figure 1. Several premises a r of structure-activity relations: - S t r u c t u r e and b i o l o g i c a l a c t i v i t y are r e l a t e d . - S t r u c t u r e s o f compounds can be adequately represented as a set o f molecular d e s c r i p t o r s . -A r e l a t i o n can be discovered between the s t r u c t u r e and a c t i v i t y by a p p l y i n g p a t t e r n r e c o g n i t i o n methods t o a s e t o f t e s t e d compounds. -The r e l a t i o n can be e x t r a p o l a t e d t o untested compounds. Introduction to Pattern
Recognition
Chemical and b i o l o g i c a l data are being produced a t a p r o d i g ious r a t e . T h i s had l e d t o burgeoning i n t e r e s t i n computer a s s i s t ed methods f o r the accumulation, handling, and i n t e r p r e t a t i o n o f these data. Standard approaches t o the i n t e r p r e t a t i o n problem i n clude s t a t i s t i c a l i n t e r p r e t a t i o n , curve f i t t i n g and model f i t t i n g . The development o r v e r i f i c a t i o n o f mathematical expressions r e l a t ing independent v a r i a b l e s and observable dependent v a r i a b l e s i s the goal o f such s t u d i e s . The i n t e n t i s t o c r e a t e a model whose parameters represent q u a n t i t i e s with p h y s i c a l s i g n i f i c a n c e . Then best values f o r the parameters are developed from the data by model fitting. In the absence o f a mathematical model, curve f i t t i n g using general f u n c t i o n s , e_.£., polynomials, can be employed. Not a l l problems faced by the chemist, however, l e n d themselves t o such exacting s o l u t i o n : f r e q u e n t l y , equations d e s c r i b i n g processes o f i n t e r e s t are d i f f i c u l t o r impossible t o o b t a i n , and a host o f problems have not y i e l d e d t o a s a t i s f a c t o r y o r usable t h e o r e t i c a l exp l a n a t i o n . In the absence o f t h e o r e t i c a l l y - b a s e d s o l u t i o n s , empi r i c a l l y - d e r i v e d methods w i l l o f t e n s u f f i c e t o y i e l d u s e f u l and p r a c t i c a l s o l u t i o n s t o complex problems. Standard approaches t o the e x t r a c t i o n o f information from complex data forms have i n c l u d e d l i n e a r o p t i m i z a t i o n , information
In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
9.
STUPER ET
AL.
167
Structure-Activity Studies ENTRY AND STORAGE OF CHEMICAL STRUCTURES
Connection Tables
DESCRIPTOR GENERATION
Data Matrix
PATTERN RECOGNITION ANALYSIS
Figure 1.
Steps in
dure
theory, and a p l e t h o r a o f s t a t i s t i c a l a n a l y s i s techniques. Since the e a r l y 1950's p a t t e r n r e c o g n i t i o n methods have a l s o been a p p l i ed to a v a r i e t y o f data i n t e r p r e t a t i o n problems and have p a r a l l e l ed the computer's growth i n speed and s o p h i s t i c a t i o n with a c o r r esponding expansion i n scope and c a p a c i t y . Pattern recognition techniques have found a p p l i c a t i o n i n such v a r i e d f i e l d s as compute r and information science, engineering, s t a t i s t i c s , b i o l o g y , p h y s i c s , medicine, and physiology. Each o f these d i s c i p l i n e s has adapted the b a s i c methods of p a t t e r n r e c o g n i t i o n to i t s own s p e c i f i c requirements. Pattern recogniton comprises the d e t e c t i o n , p e r c e p t i o n , and r e c o g n i t i o n o f i n v a r i a n t p r o p e r t i e s among sets o f measurements o f o b j e c t s or events. The purpose of p a t t e r n r e c o g n i t i o n i s generall y to c a t e g o r i z e a sample o f observed data as a member o f the c l a s s t o which i t belongs. T h i s general approach has been a p p l i e d to problems from a number o f d i v e r s e f i e l d s . Several e x c e l l e n t r e views o f the p a t t e r n r e c o g n i t i o n l i t e r a t u r e have appeared which dramatize the enormous breadth o f p a t t e r n r e c o g n i t i o n a p p l i c a t i o n s (1-5). There i s a growing l i t e r a t u r e addressed to the a p p l i c a t i o n s o f p a t t e r n r e c o g n i t i o n t o chemical data i n t e r p r e t a t i o n . Pattern r e c o g n i t i o n methods are uniquely s u i t e d to a v a r i e t y o f s t u d i e s because of s e v e r a l novel a t t r i b u t e s . No mathematical model i s used, but r a t h e r r e l a t i o n s h i p s are sought which provide d e f i n i t i o n s o f s i m i l a r i t y between d i v e r s e groups o f data. Pattern r e c o g n i t i o n techniques are able to d e a l with high dimensional data (data f o r which more than three measurements are used to represent each o b j e c t ) . Such high dimensional data can not be d i r e c t l y v i s u a l i z e d or d i s p l a y e d . In a d d i t i o n p a t t e r n r e c o g n i t i o n t e c h n i ques can d e a l with multisource data or data i n which the r e l a t i o n ships are discontinuous. In multisource data each measurement can
In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
168
CHEMOMETRICS: THEORY AND APPLICATION
be the r e s u l t o f an independent generating algorithm or experiment, and each can have a d i f f e r e n t s c a l e , o r i g i n , d i s t r i b u t i o n , e t c . from a l l the other measurements. Therefore, there w i l l be no d i r e c t f u n c t i o n a l r e l a t i o n s h i p between the measurements i n multisource data as there must be, f o r example, i n an absorbance vs. concentrat i o n p l o t . In a p p l i c a t i o n s o f p a t t e r n r e c o g n i t i o n to s t r u c t u r e a c t i v i t y r e l a t i o n s , the data i s always multisource data. For problems p r o v i d i n g multisource data, i t i s d i f f i c u l t to know i n advance whether an appropriate s e t of measurements has been generated to e f f e c t a s a t i s f a c t o r y s o l u t i o n . The generation o f s u f f i c i e n t l y i n formative multisource measurements can become i n i t s e l f a major p a r t o f the o v e r a l l p a t t e r n r e c o g n i t i o n experiment. When a number o f measurements are a v a i l a b l e , p a t t e r n r e c o g n i t i o n can be used to judge t h e i r r e l a t i v e q u a l i t y o r u t i l i t y with regard to s p e c i f i c questions. I t i s t h i s a b i l i t y to d e f i n e r e l a t i o n s through use of a d i v e r s e s e t o f measurements which a f f o r d s p a t t e r n r e c o g n i t i o n t e c h niques t h e i r u t i l i t y i n such a wide v a r i e t y o f f i e l d s When p r o p e r l y used chemist to develop c r i t e r i presenc propertie to a p a r t i c u l a r sub-set o f the t o t a l number o f measurements. Once the important measurements are i d e n t i f i e d , they can be used to guide the development o f subsequent experiments. For example, i f a chemist were to f i n d t h a t ten s t r u c t u r a l parameters were important i n d i c a t o r s o f a p a r t i c u l a r b i o l o g i c a l e f f e c t , then he might hypothesize s e v e r a l as y e t unstudied s t r u c t u r e s , and use the r e s u l t s from the p a t t e r n r e c o g n i t i o n a n a l y s i s to make an educated guess as to t h e i r e f f e c t s . A l t e r n a t i v e l y , the f a c t t h a t the p a r t i c u l a r ten parameters were shown t o be important may l e a d to added i n s i g h t s i n t o the problem. T h i s a b i l i t y to p i c k a subset o f the o r i g i n a l measurements which contains the bulk o f the t o t a l i n f o r mation content i s extremely d e s i r a b l e . As r e l a t i o n s between seve r a l v a r i a b l e s are not e a s i l y deduced through o b s e r v a t i o n , t h i s i s an extremely u s e f u l c a p a b i l i t y of p a t t e r n r e c o g n i t i o n . B a s i c P a t t e r n Recognition System. A general p a t t e r n r e c o g n i t i o n system f o r s t r u c t u r e - a c t i v i t y s t u d i e s must be capable o f acce p t i n g numerical d e s c r i p t o r s from the d e s c r i p t o r development rout i n e s performing p r i o r f e a t u r e s e l e c t i o n p r e p r o c e s s i n g the data, and c l a s s i f y i n g the compound. A schematic r e p r e s e n t a t i o n o f t h i s b a s i c system i s shown i n F i g u r e 2. I t c o n s i s t s o f four i n t e r r e l a t e d subunits: p r i o r f e a t u r e s e l e c t i o n , p r e p r o c e s s i n g , c l a s s i f i c a t i o n , and feedback feature s e l e c t i o n . The p r i o r f e a t u r e s e l e c t i o n r o u t i n e accepts the data to be c l a s s i f i e d and transforms them to make the c l a s s i f i c a t i o n task e a s i e r . Then, the preprocessor attempts to pursue the f o l l o w i n g two goals simultaneously: (a) to reduce o r e l i m i n a t e the f r a c t i o n o f information contained i n the raw data t h a t i s i r r e l e v a n t or even confusing; and (b) to preserve s u f f i c i e n t information to allow d i s c r i m i n a t i o n among the p a t t e r n c l a s s e s . The c l a s s i f i e r operates on the transformed p a t t e r n vect o r to produce a c l a s s i f i c a t i o n d e c i s i o n . The feedback loop i n -
In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
9.
STUPER ET AL.
Structure-Activity
169
Studies
d i c a t e s t h a t the p a t t e r n r e c o g n i t i o n system may use the r e s u l t s o f i t s c l a s s i f i c a t i o n t o develop a s u p e r i o r f e a t u r e e x t r a c t i o n app roach. The e n t i r e p a t t e r n r e c o g n i t i o n system i s g e n e r a l l y imple mented w i t h computer software. Classifiers. Methods o f c l a s s i f i c a t i o n f a l l n a t u r a l l y i n t o two c a t e g o r i e s : parametric and nonparametric methods. Parametric t r a i n i n g methods c o n s i s t o f e s t i m a t i n g the s t a t i s t i c a l parameters o f the samples forming t h e t r a i n i n g s e t and then u s i n g these s t a t i s t i c a l parameters f o r the s p e c i f i c a t i o n o f the d i s c r i m i n a n t f u n c t i o n . Nonparametric d i s c r i m i n a n t f u n c t i o n s a r e developed d i r e c t l y from a sample o f data themselves. Learning Machines. Data t o be used i n p a t t e r n r e c o g n i t i o n s t u d i e s are represented as v e c t o r s , X = (x^, x # ···/ n ) ' where XJ represents one o b s e r v a t i o n . S t r u c t u r e s o f molecules can be coded i n t h i s format u s i n es. F o r example, e n t r i e bers o f oxygen atoms, length, volume, l i p o p h i l i c i t y , d i p o l e moment, number o f times a p a r t i c u l a r substructure i s imbedded i n the s t r u c t u r e , e t c . F o r computational convenience an e x t r a d e s c r i p t o r , whose value i s s e t equal t o a constant, i s added t o each pattern vector. Data represented as v e c t o r s can be thought o f e i t h e r as p o i n t s i n an η-dimensional E u c l i d e a n space o r as v e c t o r s p o i n t i n g from the o r i g i n t o those p o i n t s , hence p a t t e r n v e c t o r s . Thus, a set o f data such as a c o l l e c t i o n o f mass s p e c t r a o r a s e t o f s u i t ably encoded chemical s t r u c t u r e s can be represented as a s e t o f η-dimensional p a t t e r n v e c t o r s . Experience shows t h a t p o i n t s r e p r e s e n t i n g p a t t e r n s with common c h a r a c t e r i s t i c s c l u s t e r i n l i m i t ed regions o f the p a t t e r n space. F o r example, a s e t o f p o i n t s r e p r e s e n t i n g the molecular s t r u c t u r e s o f compounds a c t i v e as t r a n q u i l i z e r s may c l u s t e r i n a d i f f e r e n t r e g i o n . There i s an important r e l a t i o n s h i p connecting the number o f p o i n t s i n a data s e t , m, and the number o f d e s c r i p t o r s p e r p o i n t , n, the d i m e n s i o n a l i t y o f the space. As shown by N i l s s o n (6) and by Tou and Gonzalez {!) t h e a b i l i t y o f a b i n a r y p a t t e r n c l a s s i f i e r to separate p o i n t s i s high, even f o r random p o i n t s , i f m i s l e s s than twice as l a r g e as n. The p r o b a b i l i t y o f f i n d i n g a l i n e a r d e c i s i o n s u r f a c e capable o f s e p a r a t i n g any randomly p l a c e d 50 p o i n t s i n a 25-dimensional space i s n e a r l y u n i t y . D i r e c t t e s t s i n our l a b o r a t o r y a r e i n agreement w i t h the theory o f BPC's and show t h a t one has not e l i m i n a t e d the p o s s i b i l i t y o f meaningless t r a i n i n g u n t i l m i s two p o i n t f i v e o r three times as l a r g e as n. Thus, i f one f i n d s a s e p a r a t i n g l i n e a r d e c i s i o n s u r f a c e f o r 75 p o i n t s i n a 25-space, then the p r o b a b i l i t y i s overwhelming t h a t the s e p a r a t i o n i s meaningful, and i t i s not a mathematical a r t i fact. I f the c l u s t e r s a r e dense and a r e f a r apart from each other, and i f the d i m e n s i o n a l i t y o f the space i s s u f f i c i e n t l y low, then x
2
In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
170
CHEMOMETRICS:
THEORY AND APPLICATION
d i s p l a y o r mapping techniques can be used. T h i s i s done by p e r forming a one-to-one mapping o f p a t t e r n p o i n t s from the o r i g i n a l η-dimensional space t o a 2- o r 3-dimensional space with as l i t t l e d i s t o r t i o n as p o s s i b l e . I f these techniques can be s u c c e s s f u l l y employed, then one can observe the c l u s t e r s d i r e c t l y on a 2- o r 3-dimensional p l o t . An a l t e r n a t i v e way t o i n v e s t i g a t e the s t r u c t u r e o f the s e t o f p o i n t s i s t o separate the c l u s t e r s from one another by d e c i s i o n s u r f a c e s . The simplest d e c i s i o n s u r f a c e i s a hyperplane. Two c l u s t e r s o f p o i n t s which can be completely separated by a hyper plane a r e s a i d t o be l i n e a r l y separable. Any hyperplane has a s s o c i a t e d with i t a normal v e c t o r , c a l l e d here the weight v e c t o r . The weight v e c t o r c o n s i s t s o f an ordered sequence o f components, W = (w^, w # w ) , which stands i n one t o one correspondence with the components o f the p a t t e r n s t o be c l a s s i f i e d . Specifi c a t i o n o f the components o f the weight v e c t o r i s completely e q u i valent to s p e c i f i c a t i o n f th p o s i t i o f hyperplan d e c i s i o surface. Any p a t t e r n p o i n t i hyperspac r e s p e c t t o a hyperplane d e c i s i o n s u r f a c e by t a k i n g the dot product between t h a t p a t t e r n v e c t o r and the normal v e c t o r , o r weight v e c t or: 2
n
s = W.X = w^x^ + W2X2 + ... + w x n
n
= Iw|
|x| cos θ
i n which θ i s the angle between the two v e c t o r s . Since |w| and |x| are always p o s i t i v e , then the value o f θ determines the s i g n o f the dot product. For p a t t e r n s on one s i d e o f the plane the dot prod u c t i s always p o s i t i v e , and f o r p a t t e r n s on the opposite s i d e the dot product i s always negative. The dot product i s normally com puted from the summation o f p a i r w i s e products o f the components o f the two v e c t o r s f o r convenience. The correspondence between c a t e gory 1 and category 2 and the two s i d e s o f the hyperplane i s a r b itary. The l o g i c a l o p e r a t i o n d e s c r i b e d above i s performed by a t h r e s h o l d l o g i c u n i t o r TLU. The TLU accepts the p a t t e r n v e c t o r t o be c l a s s i f i e d , c a l c u l a t e s t h e dot product between the p a t t e r n v e c t o r and the weight v e c t o r , compares the dot product a g a i n s t zero, and c l a s s i f i e s the p a t t e r n according t o the s i g n o f the dot product. D i s c r i m i n a n t Function Development. Given the system d i s c u s s e d above f o r performing c l a s s i f i c a t i o n s , the outstanding problem i n the development o f u s e f u l p a t t e r n c l a s s i f i e r s becomes t h a t o f f i n d i n g u s e f u l d e c i s i o n s u r f a c e s . T h i s can be done, f o r the nonpara m e t r i c systems o f i n t e r e s t , by a method c a l l e d t r a i n i n g . A t r a i n i n g s e t o f p a t t e r n s whose c o r r e c t c l a s s i f i c a t i o n s a r e known i s used t o develop an e f f e c t i v e d e c i s i o n s u r f a c e . The members o f the t r a i n i n g s e t o f o b j e c t s a r e presented t o the TLU being t r a i n e d one a t a time. The weight v e c t o r being trained i s i n i t i a l i z e d a r b i t r a r i l y . When an i n c o r r e c t c l a s s i f i -
In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
9.
STUPER ET AL.
Structure-Activity
Studies
171
c a t i o n i s made, the weight v e c t o r i s a l t e r e d . The a l t e r a t i o n i s performed i n such a way as to i n s u r e t h a t the new weight v e c t o r w i l l c o r r e c t l y c l a s s i f y the p a t t e r n . T h i s process continues u n t i l a l l the p a t t e r n s o f the t r a i n i n g s e t are c o r r e c t l y c l a s s i f i e d . If the procedure does not f i n d a weight v e c t o r capable o f c o r r e c t l y c l a s s i f y i n g a l l the members o f the t r a i n i n g s e t , then the r o u t i n e i s terminated i n order t o conserve computer time. Learning Machine A t t r i b u t e s . The c a p a b i l i t i e s and performance o f l e a r n i n g machines can be d e s c r i b e d i n terms o f three p r i n cipal attributes: r e c o g n i t i o n , convergence r a t e , and p r e d i c t i o n . Recognition i s the a b i l i t y o f the t r a i n e d b i n a r y p a t t e r n c l a s s i f i e r to c o r r e c t l y c l a s s i f y the members o f i t s t r a i n i n g s e t . Recognition i s 100% f o r a b i n a r y p a t t e r n c l a s s i f i e r whose d e c i s i o n s u r f a c e i s i n the r e g i o n between two separated c l u s t e r s . That i s , a f t e r t r a i n i n g i s complete f o r such a case, the TLU can c o r r e c t l y c a t e g o r i z e any of the member Convergence r a t e r e f e r 100% r e c o g n i t i o n . Since computer time i s an expensive commodity, i t i s o f i n t e r e s t t o minimize t r a i n i n g time. The t r a i n i n g procedures used to f i n d u s e f u l TLU*s are commonly a l t e r e d so as t o force rapid learning. P r e d i c t i o n r e f e r s t o the a b i l i t y o f the TLU to c o r r e c t l y c l a s s i f y unknowns which were not members o f the t r a i n i n g s e t . P r e d i c t i o n i s the most i n t e r e s t i n g and p o t e n t i a l l y u s e f u l o f the a t t r i b u t e s because high p r e d i c t i v e a b i l i t y demonstrates t h a t the TLU has been able to l e a r n something about how to d i s c r i m i n a t e between the two c l a s s e s being t r a i n e d f o r , and the a b i l i t y to c o r r e c t l y c l a s s i f y unknown s p e c t r a i n t o u s e f u l chemical c a t e g o r i e s i s one d r i v e behind a l l automation o f chemical data i n t e r p r e t a t i o n . P r e d i c t i v e a b i l i t y i s normally t e s t e d by s p l i t t i n g the a v a i l a b l e data s e t i n t o two p a r t s - a t r a i n i n g s e t and a p r e d i c t i o n s e t . A f t e r t r a i n i n g i s complete, and without f u r t h e r adjustment o f the weight v e c t o r , the members o f the p r e d i c t i v e s e t are c l a s s i f i e d and the percentage c o r r e c t i s taken as the p r e d i c t i v e a b i l i t y . Another approach, known as the leave-one-out procedure, i n v o l v e s t r a i n i n g a BPC using a t r a i n i n g set c o n t a i n i n g a l l the data on hand except one member, and then p r e d i c t i n g the c l a s s o f the one unknown a f t e r t r a i n i n g i s complète. When averaged over a number of independent t r i a l s , the percentage o f unknowns c o r r e c t l y c l a s s i f i e d i s a measure o f the p r e d i c t i v e a b i l i t y . Feedback Feature S e l e c t i o n . A f t e r a s e r i e s o f weight v e c t o r s have been t r a i n e d f o r the same q u e s t i o n , then they can be used to perform feedback feature s e l e c t i o n . One method t h a t has been used f o r a number o f problems i s weight-sign feature s e l e c t i o n . Implementation o f t h i s method takes advantage o f the f a c t t h a t the exact o r i e n t a t i o n of a t r a i n e d weight v e c t o r (that i s , the r e l a t i v e magnitudes of i t s components) depends on the i n i t i a l i z a t i o n used p r i o r to t r a i n i n g , the magnitude o f the nth component o f the p a t t e r n
In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
172
CHEMOMETRICS : THEORY AND APPLICATION
v e c t o r s , x ^ the feedback t r a i n i n g methods employed, the sequence i n which the members o f the t r a i n i n g s e t a r e presented t o the c l a s s i f i e r during t r a i n i n g , and s e v e r a l other f a c t o r s . In other words, the exact o r i e n t a t i o n o f a t r a i n e d weight vector depends on the d e t a i l s o f how i t was found. In weight-sign feature s e l e c t ion a p a i r o f weight v e c t o r s i s developed f o r the same question but u s i n g s l i g h t l y d i f f e r e n t approaches, £.CJ.. , d i f f e r e n t i n i t i a l i z a t i o n s . Then the a l g e b r a i c s i g n s o f t h e i r components are compared pairwise. When the components o f the two weight v e c t o r s t h a t both correspond t o a p a r t i c u l a r d e s c r i p t o r disagree i n s i g n , t h a t desc r i p t o r i s discarded; when the signs agree, the d e s c r i p t o r i s r e tained. The procedure i s repeated i t e r a t i v e l y u n t i l two weight v e c t o r s are t r a i n e d t h a t a r e i n complete agreement f o r a l l d e s c r i p t o r s t h a t a r e most u s e f u l f o r a p a r t i c u l a r c l a s s i f i c a t i o n . More r e c e n t l y , a new feedback f e a t u r e s e l e c t i o n procedure much s u p e r i o r t o the weight-sign method has been developed. The v a r i ance f e a t u r e s e l e c t i o n metho the o r i e n t a t i o n o f a t r a i n e was developed. Here, a group o f weight v e c t o r s a r e t r a i n e d f o r a c l a s s i f i c a t i o n problem i n a manner designed t o e x p l o i t these dependencies. The s e r i e s o f weight v e c t o r s i s then used t o rank the d e s c r i p t o r s t h a t were most u s e f u l i n s e p a r a t i n g the two c l a s s e s under i n v e s t i g a t i o n . The ranking i s done by developing an ordered l i s t o f the d e s c r i p t o r s based on the r e l a t i v e v a r i a t i o n o f the corresponding weight vector components among the s e r i e s o f t r a i n e d weight v e c t o r s . Then the i n t r i n s i c d e s c r i p t o r s (those forming the minimal s e t o f d e s c r i p t o r s s u f f i c i e n t t o e f f e c t separation) can be discarded. The variance f e a t u r e s e l e c t i o n method has been a p p l i e d to a wide v a r i e t y o f problems i n our l a b o r a t o r y . Chemical A p p l i c a t i o n s o f P a t t e r n Recognition. Application s t u d i e s o f chemical problems using p a t t e r n r e c o g n i t i o n techniques have been reported i n a number o f areas (8-14). These a r e l i s t e d i n subsets because each general area r e q u i r e s some d i f f e r e n t approaches and techniques. (1) S p e c t r a l Data A n a l y s i s . E l u c i d a t i o n o f chemical s t r u c ture information from s p e c t r o s c o p i c data i s the area that has r e c e i v e d the most a t t e n t i o n from those p r a c t i c i n g p a t t e r n recognition. Studies have been done with mass s p e c t r a , i n f r a r e d s p e c t r a , s t a t i o n a r y e l e c t r o d e polarograms, gamma-ray s p e c t r a , proton and C n u c l e a r magnetic resonance s p e c t r a . (2) M a t e r i a l s Science. The c l a s s i f i c a t i o n o f m a t e r i a l s as t o o r i g i n o r s u i t a b i l i t y w i t h respect t o production s p e c i f i c a t i o n s has been reported. The data used are g e n e r a l l y multi-source data coming from a v a r i e t y o f a n a l y t i c a l techniques. (3) C l a s s i f i c a t i o n o f Complex Mixtures. The i d e n t i f i c a t i o n o f petroleum samples by a n a l y z i n g a n a l y t i c a l data by p a t t e r n r e c o g n i t i o n techniques has been reported. Data used f o r c l a s s i f i c a t i o n i n d i f f e r e n t s t u d i e s has i n c l u d e d gas chromatograms, i n f r a r e d s p e c t r a , fluorescence s p e c t r a , t r a c e metals c o n c e n t r a t i o n s . A 1 3
In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
9.
STUPER ET AL.
Structure-Activity Studies
173
second example o f data a n a l y s i s o f complex mixtures i s from the b i o l o g i c a l mixtures, e_.£., serum, are f e a s i b l e and have been reported, (4) Modeling o f Chemical Experiments. Pattern r e c o g n i t i o n techniques have been used t o model complex chemical systems where the d e t a i l s o f the chemical and/or p h y s i c a l i n t e r a c t i o n s were not completely understood, e_.£.,. r e l a t i v e r e t e n t i o n o f compounds on d i f f e r e n t chromatographic l i q u i d phases. (5) P r e d i c t i o n o f P r o p e r t i e s from Molecular S t r u c t u r e . A number o f s t u d i e s o f the a p p l i c a t i o n o f p a t t e r n r e c o g n i t i o n to the problem o f searching f o r c o r r e l a t i o n s between molecular s t r u c t u r e and b i o l o g i c a l a c t i v i t y have been reported. A p p l i c a t i o n s o f P a t t e r n Recognition t o S t r u c t u r e - A c t i v i t y R e l a t i o n s A p p l i c a t i o n s t o S t r u c t u r e - A c t i v i t y R e l a t i o n s . W i t h i n the l a s t few years r e p o r t s have begu a n a l y s i s and p a t t e r n r e c o g n i t i o a c t i v i t y r e l a t i o n s t u d i e s . A paper by Hansch, Unger, and Forsythe (15) d i s c u s s e d the a p p l i c a t i o n o f h i e r a c h i c a l c l u s t e r a n a l y s i s techniques to the problem o f s e l e c t i o n o f s u b s t i t u e n t s . The data used to represent each drug were the l i p o p h i l i c π constant, e l e c t r o n i c parameters, the approximate s t e r i c molar r e f r a c t i v i t y and molecular weight constants — physicochemical parameters. A paper by H i l l e t aJU (16) d i s c u s s e d the problem o f drug design as app roached by using a t h r e e - l a y e r perceptron network. Forty-six 1,3-dioxane molecules were used as the data s e t f o r t r a i n i n g and p r e d i c t i o n o f perceptrons t o determine a n t i c o n v u l s a n t a c t i v i t y . P r e d i c t i v e a b i l i t i e s i n the range o f 68 t o 76 percent were r e p o r t ed. A paper by T i n g e t a l . (17) reported c o r r e l a t i o n s between the low r e s o l u t i o n mass s p e c t r a o f s i x t y - s i x drugs and t h e i r pharma c o l o g i c a l a c t i v i t y as sedatives o r t r a n q u i l i z e r s . T h i s paper was c r i t i c i z e d with regard t o the s e t o f drugs used i n the a n a l y s i s (18) and with regard to the number o f drugs used and t h e i r r e l a t i v e s i m i l a r i t i e s (19). Several papers (20-22) have r e c e n t l y appeared r e p o r t i n g s t u d i e s i n which molecules were represented by a l i s t o f s t r u c t u r a l f e a t u r e s o f the molecules. Adamson and Bush (20) used l i b r a r y searching programs t o generate a l l s t r u c t u r a l fragments i n t h e i r data s e t and represented the drugs by l i s t s o f the number of occurences o f each substructure i n the molecules. Chu (21) used a number o f p a t t e r n r e c o g n i t i o n and c l u s t e r a n a l y s i s programs to analyze a s e t o f s i x t y - s i x drugs represented by f o r t y - s i x fragments. Kowalski and Bender (22) used three p a t t e r n r e c o g n i t i o n c l a s s i f i e r s t o attempt t o c l a s s i f y 200 drugs w i t h respect to a c t i v i t y f o r the Adenocarcinoma 755 B i o l o g i c a l A c t i v i t y T e s t . T h e i r paper has been c r i t i c i z e d f o r the choice o f the twenty d e s c r i p t o r s used (23). Chu e t a l . (24) reported on the a p p l i c a t i o n o f p a t t e r n r e c o g n i t i o n and s u b s t r u c t u r a l a n a l y s i s t o the problem o f i n v e s t i g a t i n g the a n t i n e o p l i a s t i c a c t i v i t y o f a s e t o f drugs i n the
In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
174
CHEMOMETRICS: THEORY AND APPLICATION
experimental mouse b r a i n tumor system. The s e t o f molecules were represented by augmented atom fragments, "heteropath" fragments, and r i n g fragments. Nearest neighbor and l e a r n i n g machine methods of c l a s s i f i c a t i o n were employed, and i t was concluded t h a t these methods could be s u c c e s s f u l l y a p p l i e d t o the problem. C r a i g and Waite (25) have reported the use of p a t t e r n r e c o g n i t i o n techniques to the p r e d i c t i o n o f t o x i c i t y o f o r g a n i c compounds. S t r u c t u r e - A c t i v i t y Studies Using P a t t e r n
Recognition
In order to apply p a t t e r n r e c o g n i t i o n techniques t o s t u d i e s o f molecular s t r u c t u r e - b i o l o g i c a l a c t i v i t y c o r r e l a t i o n s the data must be taken through a number o f i n d i v i d u a l steps. These are l i s t e d i n order t o show how i n t e r r e l a t e d the steps become. (a) I d e n t i f y data s e t . (b) E n t e r molecular s t r u c t u r e s . A complete d e s c r i p t i o n o f the s t r u c t u r e file. (c) Generate usable f i l e . A subset o f compounds must be s e l e c t e d from the master s t r u c t u r e f i l e . T h i s may i n volve searching o f keys f o r the s t r u c t u r e s , and w i l l r e q u i r e c a r r y i n g along an i d e n t i f y i n g l a b e l f o r each s t r u c ture. (d) D e s c r i p t o r development. The molecular s t r u c t u r e s s t o r e d i n a general purpose form (£.2/ 9 connection t a b l e s ) must be decomposed i n t o sets o f d e s c r i p t o r s . The three gene r a l c l a s s e s are t o p o l o g i c a l , geometrical, and e x t e r n a l l y generated d e s c r i p t o r s . (e) Form data matrix. The subset o f the a v a i l a b l e d e s c r i p t o r s t o be used i s i d e n t i f i e d , and a matrix o f data i s generated. I t may be p a r t i t i o n e d i n t o a t r a i n i n g s e t and a prediction set. (f) P r i o r feature s e l e c t i o n . Techniques can be a p p l i e d to determine which d e s c r i p t o r s are expected to be most important. (g) Discriminant development. The data s e t i s used t o develop a d i s c r i m i n a n t f u n c t i o n . A f t e r development, the d i s criminant f u n c t i o n can be t e s t e d on unknowns to assess predictive a b i l i t y . (h) Feedback feature s e l e c t i o n . The r e s u l t s o f c l a s s i f i c a t i o n can be used to i d e n t i f y the most u s e f u l d e s c r i p t o r s . One o f the primary p r e r e q u i s i t e s f o r a u s e f u l general purpose p a t t e r n r e c o g n i t i o n system i s a general, data-independent, f i l e management system. A general purpose system has been developed (26) t h a t c o n s i s t s o f a s e t o f i n t e r a c t i v e computer r o u t i n e s known c o l l e c t i v e l y as ADAPT (Automated Data A n a l y s i s using Pattern recogn i t i o n Techniques). T h i s system p r o v i d e s a g e n e r a l i z e d framework t h a t takes i n t o account the p r a c t i c a l c o n s i d e r a t i o n s inherent i n the implementation o f the p a t t e r n r e c o g n i t i o n framework shown i n F i g u r e 1.
In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
9. STUPER ET AL.
Structure-Activity
Studies
175
ADAPT A r c h i t e c t u r e . F i g u r e 1 does not make c l e a r the i n h e r e n t d i v e r s i t y o f the data h a n d l i n g problem. Not only must measurements from the transducer(s) be i n p u t , but they must be s t o r e d and l a b e l l ed. Each data p o i n t must be given a c l a s s d e s i g n a t i o n and i n d e n t i f i c a t i o n number. C l a s s d e s i g n a t i o n s must be e a s i l y assigned o r m o d i f i e d . T h i s ease o f d e f i n i t i o n and r e d e f i n i t i o n i s o f utmost importance i n the o v e r a l l data a n a l y s i s . The source o f the data i s a l s o important. Sources such as d i g i t i z e d s p e c t r a o r complex molecular s t r u c t u r e s would have widely d i f f e r e n t storage r e q u i r e ments. Since the o p e r a t i o n s performed on one type o f data may bear l i t t l e s i m i l a r i t y t o the o p e r a t i o n s performed on o t h e r types o f data, a system designed with a high degree o f modularity i s r e q u i r e d . To accomodate these requirements, the ADAPT system i s implemented i n independent segments. Each segment can execute independently, o b t a i n i n g a l l necessary i n f o r m a t i o n e i t h e r from a s e t o f d i s c s t o r age f i l e s o r by i n t e r a c t i o n with the user. T h i s mode o f o p e r a t i o n o f f e r s s e v e r a l advantages i n core storage. The modularity decreases the complexity o f the system and p r o v i d e s a means t o i n c o r p o r a t e a d d i t i o n a l algorithms i n t o the system a t any time. Thus the e n t i r e system i s adapted t o any user's i n d i v i d u a l requirements s i n c e o n l y those o v e r l a y s which are r e l e vant t o the p a r t i c u l a r problem a t hand need be executed. In addi t i o n , these r o u t i n e s a r e r e l a t i v e l y inexpensive t o use because they do not r e q u i r e l a r g e s c a l e f a c i l i t i e s f o r e x e c u t i o n . Finally, the system i s i n t e r a c t i v e i n the sense t h a t the user d i r e c t s which manipulations are t o be performed upon the data. ADAPT thus c o n s i s t s o f a framework w i t h i n which an u n l i m i t e d number o f independent segments can be supported. Each segment performs a s p e c i f i c , independent o p e r a t i o n ranging from i n i t i a l input o f data t o f i n a l output o f r e s u l t s . The g e n e r a l u t i l i t y o f the system a r i s e s from the f a c t t h a t the user has a l a r g e number of o p t i o n s t o choose from, and he can c o n v e n i e n t l y i n t e r a c t with h i s data s e t . I n t e r a c t i o n with ADAPT i s p r o v i d e d v i a a T e k t r o n i x 4010 CRT t e r m i n a l . Data i s s t o r e d i n a s e r i e s o f d e f i n e d f i l e s on c a r t ridge d i s c s . T h i s allows f a s t access and ease o f manipulation. C u r r e n t l y , ADAPT c o n s i s t s o f approximately 70 d e f i n e d f i l e s which use 2.4 m i l l i o n bytes o f storage (one c a r t r i d g e d i s c ) . The ADAPT r o u t i n e uses approximately 90,000 bytes o f core storage f o r i t s l a r g e s t o v e r l a y and i s c u r r e n t l y implemented using a s i x t e e n - b i t M0DC0MP 11/25 computer l o c a t e d i n the Department o f Chemistry a t The Pennsylvania State U n i v e r s i t y . The segments o f the ADAPT system can be broken down i n t o the following l i s t : (1) F i l e generator, i n c l u d i n g g r a p h i c a l i n p u t o f s t r u c t u r e s (2) C l a s s maker (3) Three-dimensional model b u i l d e r (4) D e s c r i p t o r developer
In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
176
CHEMOMETRICS: THEORY AND APPLICATION
(5) (6) (7) (8) (9)
Collator Preprocessor P r i o r feature s e l e c t o r D i s c r i m i n a n t developer Feedback feature s e l e c t o r
(1) F i l e Generator. The l i b r a r y o f drugs t o be s t u d i e d i s e n t e r ed through the f i l e generator r o u t i n e . S t r u c t u r e s are input by drawing them i n two dimensions on the screen o f an i n t e r a c t i v e graphics t e r m i n a l under the c o n t r o l o f a general s t r u c t u r a l input r o u t i n e , UDRAW, which has been f u l l y d e s c r i b e d elsewhere (27). A molecule's s t r u c t u r e , along with corresponding pharmacological data, i s entered i n t o a d i s c r e s i d e n t permanent f i l e . Information saved f o r f u t u r e use i n c l u d e s a compressed connection t a b l e , r i n g information, a l i s t o f reported a c t i v i t i e s , the two-dimensional coordinates o f the atoms when entered ( f o r p o s s i b l e redrawing o f the s t r u c t u r e s l a t e r ) , a name o f the compound. I a l t e r e d by making changes t o information s t o r e d f o r a drug, a drug can be e n t i r e l y d e l e t e d from the f i l e , o r any f i l e member can be d i s p l a y e d . A s e l e c t i o n o f r e c a l l a b l e molecular backbones can be s t o r e d f o r more convenient entry o f s e r i e s o f s t r u c t u r a l l y r e l a t e d compounds. These s t r u c t u r e s can then be made t o appear upon the i n i t i a l UDRAW sketch pad and a complete molecule can be b u i l t up s t a r t i n g from t h i s backbone. This allows the user t o input a s e r i e s o f s t r u c t u r a l l y s i m i l a r compounds without redrawing the base s t r u c t u r e each time. The r o u t i n e t h a t oversees s t r u c t u r e input and f i l e generation can maintain a f i l e o f 1000 s t r u c t u r e s and a s s o c i a ted a u x i l i a r y information. The f i r s t s t r u c t u r e f i l e now s t o r e d i n the system c o n s i s t s o f approximately one thousand c e n t r a l nervous system agents taken from the l i t e r a t u r e (28). Among the b i o l o g i c a l a c t i v i t y c l a s s e s reported there are a n a l g e s i c s , a n t i c o n v u l s a n t s , depressants, hypnotics, r e l a x a n t s , s e d a t i v e s , s t i m u l a n t s , and t r a n q u i l i z e r s ? there are approximately f o r t y c l a s s e s a l t o g e t h e r , many o f which overlap. The second f i l e o f molecular s t r u c t u r e s c u r r e n t l y r e s i d e n t on the ADAPT d i s c f i l e c o n s i s t s o f 184 5 , 5 - d i s u b s t i t u t e d b a r b i t u r a t e s taken from a reference volume (29). A study using t h i s data s e t w i l l be discussed i n a l a t e r s e c t i o n . The t h i r d f i l e contains approximately 500 compounds comprising an o l f a c t i o n data s e t taken from Amoore (30). Molecules reported to have musk, camphor, mint, ether, f l o r a l , pungent, and p u t r i d odors are present. T h i s data s e t i s being used i n s t u d i e s o f the r e l a t i o n between molecular s t r u c t u r e and odor q u a l i t y . The f o u r t h f i l e c o n s i s t s o f a s e t o f molecules comprising an o l f a c t i o n data s e t taken i n a study o f t r i g e m i n a l d e t e c t i o n o f compounds. These compounds are being employed i n a study o f the s i m i l a r i t i e s and d i f f e r e n c e s observed i n t r i g e m i n a l as opposed t o
In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
177
Structure-Activity Studies
9. STUPER ET AL.
o l f a c t o r y d e t e c t i o n o f chemicals by humans. (2) C l a s s Maker. The c l a s s maker r o u t i n e i s used t o access the l i b r a r y f i l e and t o create s e t s o f l i b r a r y members t h a t s a t i s f y q u e r i e s entered by the user. Thus, t h e s e t o f a l l f i l e members which have been reported t o be sedatives can be formed i n t o an a c t i v e data s e t . T h i s r o u t i n e i s used t o generate c l a s s e s o f s t r u c t u r e s t o be used as data s e t s f o r the development o f d i s c r i m inants by another s e c t i o n o f ADAPT. When the property being sought i s known q u a n t i a t i v e l y , the data s e t i s assembled i n i n c r e a s i n g sequence. Then a s e r i e s o f d i s c r i m i n a n t s can be t r a i n e d f o r d i f f e r e n t t h r e s h o l d c u t o f f s between the a c t i v e and i n a c t i v e c l a s s e s without moving any data but only by r e a l l o c a t i n g c l a s s memberships. (3) Three-Dimensional Model B u i l d e r . The three-dimensional mole c u l a r model b u i l d e r routin i d t deriv informatio th s p a c i a l conformation o collection of particle togethe y simpl e l a s t i c f o r c e s . These f o r c e s can be d e f i n e d by p o t e n t i a l energy f u n c t i o n s whose terms are the atom coordinates o f the molecule. T h i s f u n c t i o n can then be minimized t o o b t a i n a s t r a i n - f r e e t h r e e dimensional model o f the molecule. Geometric parameters can then be e x t r a c t e d . A wealth o f information already e x i s t s d e s c r i b i n g the procedures and r e s u l t s o f s e v e r a l d i f f e r e n t molecular mechani c s algorithms (31,32). Therefore, f i n d i n g and implementing an a l g o r i t h m t o model sets o f molecules i s a r e l a t i v e l y s t r a i g h t forward procedure. A modified v e r s i o n o f the molecular mechanics routine described by Wipke, e t a l (33-35) has been developed and i n t e r f a c e d t o the ADAPT system so t h a t geometric d e s c r i p t o r s can be d e r i v e d from the r e s u l t i n g molecular s t r u c t u r e . The molecular mechanics r o u t i n e , MOLMEC, used i n conjunction with the ADAPT system i s h i g h l y i n t e r a c t i v e and r e l i e s on g r a p h i c a l input and output. A graphics u n i t i s a l s o supported and i s u t i l i z ed by MOLMEC f o r d i s p l a y i n g the molecule being modelled. The s t r u c t u r e input s e c t i o n o f MOLMEC has been designed t o allow the user t o e i t h e r read the molecule's connection t a b l e from ADAPT*s d i s c f i l e s o r e l s e accept the s t r u c t u r e from the CRT v i a UDRAW (27). Thus, MOLMEC can be used independently o f the ADAPT system. Once the molecule has been entered, c o n t r o l branches t o the i n t e r a c t i v e s e c t i o n where the user can d i r e c t the d i f f e r e n t phases o f modelling as w e l l as monitor the r e s u l t s . In the s t r a i n minimization s e c t i o n , the atom coordinates are s y s t e m a t i c a l l y a l t e r e d u n t i l a minimum i s found i n the s t r a i n o r p o t e n t i a l energy f u n c t i o n . The a c t u a l s t r a i n f u n c t i o n used i n MOLMEC i s : w
E
strain
s
The
Ebond first
+
E ngle a
+
Etorsion
+
E on-bond n
+
E s
tereo
four terms o f the f u n c t i o n are commonly found i n a l l
In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
178
CHEMOMETRICS:
THEORY AND APPLICATION
molecular mechanics s t r a i n f u n c t i o n s and are m o d i f i e d Hooke's Law f u n c t i o n s . The l a s t term o f the f u n c t i o n has been added t o assure the proper stereochemistry about an asymmetric atom. The a c t u a l minimization o f the f u n c t i o n i s b e s t accomplished by some type o f n o n l i n e a r programming method (£.£., steepest descent) · In MOLMEC, an adaptive p a t t e r n search r o u t i n e (36) i s used because i t does n o t r e q u i r e a n a l y t i c a l d e r i v a t i v e s . The amount o f time necessary t o o b t a i n good molecular models depends upon the number o f atoms i n the molecule, the i n i t i a l s t r a i n o f the molecule, and the degrees o f freedom i n the s t r u c t u r e . I f a s m a l l molecule i s being modelled, only one pass through the minimization s e c t i o n may be s u f f i c i e n t t o o b t a i n a good s t r u c t u r e . However, t h i s i s seldom the case. U s u a l l y , t h e molecules are r a t h e r l a r g e and r e q u i r e s e v e r a l passes. The a c t u a l amount o f time p e r pass i s l i m i t e d by a c u t o f f parameter so t h a t the user may analyze the progress o f t h e modelling a t d i f f e r e n t i n t e r v a l s . The graphics i n t e r a c t i o s e c t i o f MOLMEC c o n t a i n r o u t i n e capable o f r o t a t i n g an p o s i t i o n . Since the graphic y , r o t a t i o n i s e s s e n t i a l t o o b t a i n a good view o f the s t r u c t u r e . Furthermore, these r o u t i n e s are u s e f u l i n l o c a t i n g atoms trapped i n l o c a l minima. I f such an atom i s found, the user can move the trapped atom t o a new p o s i t i o n by a MOVE r o u t i n e found i n the graphics s e c t i o n . N a t u r a l l y , i f the s t r u c t u r e i s a l t e r e d the molecule should be passed through the minimization r o u t i n e a t l e a s t once more. When the molecule i s f i n a l l y i n a low s t r a i n energy conformat i o n , the molecular parameters can be e i t h e r l i s t e d on an output device, o r e l s e the s t r u c t u r e ' s coordinate matrix can be s t o r e d on a d i s c f i l e f o r .further p r o c e s s i n g . An automatic v e r s i o n o f MOLMEC has a l s o been developed so t h a t l a r g e molecular data s e t s can be modelled without continuous superv i s i o n . The program c o n s i s t s on an i n p u t s e c t i o n , which reads the molecule's connection t a b l e and present coordinate matrix from the ADAPT f i l e s , a m i n i m i z a t i o n s e c t i o n w i t h a l l output suppressed, and a s e c t i o n which s t o r e s the f i n a l coordinate matrix. Good models can e a s i l y be obtained i n t h i s manner. However, before the coordinate matrices can be used f o r c a l c u l a t i n g d e s c r i p t o r s , the s t r u c t u r e s a r e reviewed t o make sure t h a t the molecules are i n acceptable conformations. Once modelling i s complete, geometric d e s c r i p t o r s can be d e r i v e d . D e s c r i p t o r s c u r r e n t l y being used i n c l u d e the absolute o r r e l a t i v e magnitudes o f t h e p r i n c i p a l moments o f i n e r t i a o f t h e molecule, the presence o r absence o f p a r t i c u l a r s p a c i a l arrangements o f atoms which have been c a l l e d pharmacophores, and the molecular volume. (4) D e s c r i p t o r Developer. The next step i n s t u d i e s o f s t r u c t u r e a c t i v i t y r e l a t i o n s i s t h e development o f d e s c r i p t o r s f o r the molec u l e s contained i n the a c t i v e data s e t . T h i s s u b j e c t has been
In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
9.
STUPER ET AL.
Structure-Activity Studies
179
d i s c u s s e d i n a recent p u b l i c a t i o n (37). D e s c r i p t o r s belong t o two general c l a s s e s : t o p o l o g i c a l and geometrical. T o p o l o g i c a l desc r i p t o r s are d e r i v e d from the t o p o l o g i c a l r e p r e s e n t a t i o n o f a compound — the connection t a b l e . Geometrical d e s c r i p t o r s are d e r i v ed from the three-dimensional model o f the molecule. The i n d i v i d ual d e s c r i p t o r s that have been used i n reported s t u d i e s are desc r i b e d i n the f o l l o w i n g paragraphs. (a) Atom and bond d e s c r i p t o r s — Fragment d e s c r i p t o r s . Atom d e s c r i p t o r s i n c l u d e the number o f C., N, 0, S, P, F, C l , Br, I atoms i n the s t r u c t u r e . Numbers o f bonds o f each type are a l s o generated. Both atom and bond d e s c r i p t o r s are developed d i r e c t l y from the s t o r e d connection t a b l e . (b) Substructure D e s c r i p t o r s . Searching the molecule f o r the presence o f l a r g e r fragments provides an a l t e r n a t i v e method f o r generating d e s c r i p t o r s . I f the substructure i s found i n the mole c u l e , the d e s c r i p t o r can be given a value o f one. Otherwise, i t has a value o f zero. Therefore t o r s f o r a given molecula substructure searching a l g o r i t h m and a l i b r a r y of appropriate sub structures. Algorithms f o r substructure searching f a l l i n t o two general c a t e g o r i e s . The f i r s t , atom-by-atom searching, i s the e a s i e s t to implement on a d i g i t a l computer because i t simply matches the s t r u c t u r e and substructure atoms and a s s o c i a t e d bonds one a t a time using a l l p o s s i b l e combinations. However, f o r l a r g e s t r u c t u r e s and substructures the time r e q u i r e d f o r a s i n g l e search becomes p r o h i b i t i v e because o f the number o f p o s s i b l e combinations i n c r e a s e s f a c tor i a l l y . The second category u t i l i z e s s e t r e d u c t i o n techniques t o accomplish the substructure search, and f a c t o r i a l c a l c u l a t i o n s are not i n v o l v e d . Although they are more complex than atom-by-atom searching techniques, algorithms implementing s e t r e d u c t i o n are very a t t r a c t i v e because o f t h e i r searching speed. Several d i f f e r ent algorithms have been d e s c r i b e d which use s e t r e d u c t i o n (38-40). In the ADAPT system, a v a r i a t i o n o f the techniques d e s c r i b e d by Sussenguth (38) i s used f o r generating substructure d e s c r i p t o r s . The m o d i f i c a t i o n s allow f o r g r e a t e r substructure s p e c i f i c i t y , a wider v a r i e t y of substructure types, and numeric i n s t e a d of b i n a r y searches. A d i s c u s s i o n o f the changes made i n the Sussenguth's a l g o r i t h m has been reported (41_) and w i l l not be d e t a i l e d here. The problem of c r e a t i n g a substructure l i b r a r y i s not as easy to s o l v e as o b t a i n i n g a good substructure searching algorithm. One approach t o t h i s problem i n v o l v e s the systematic combing o f the b a s i c atom and bond fragments i n t o s u b s t r u c t u r e s . However, the f i n a l number o f substructures generated i n t h i s manner would be t o t a l l y unmanageable. The d i s c r i m i n a t i o n between usable and usel e s s substructures would r e q u i r e some type o f p a t t e r n r e c o g n i t i o n system, and t h i s approach i s not f e a s i b l e . A more workable approach to the problem i s to study the data s e t o f molecules under i n v e s t i g a t i o n and allow the chemist to decide on a c o l l e c t i o n o f
In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
180
CHEMOMETRICS : THEORY AND APPLICATION
substructures to be a p p l i e d t o the data s e t . The ADAPT system u t i l i z e s t h i s second method to generate a substructure l i b r a r y . A set o f substructure d e s c r i p t o r s can now be generated. Two types of searches are p o s s i b l e . For a general search, a match i s made i f the i n d i c a t e d substructure i s l o c a t e d anywhere i n the molecule; a l l r i n g information i s ignored. However, during a s p e c i f i c search, r i n g information i s taken i n t o c o n s i d e r a t i o n . Therefore, i f the substructure i s not s p e c i f i e d to be i n a r i n g , i t cannot p o s s i b l y be matched to a molecular fragment t h a t i s con t a i n e d i n a r i n g system. The a c t u a l information contained i n any one s u b s t r u c t u r a l des c r i p t o r depends h i g h l y upon the judgement o f the person s e l e c t i n g the substructure l i b r a r y , i n some a p p l i c a t i o n s , good d e s c r i p t o r s can be obtained immediately because s u f f i c i e n t a p r i o r i knowledge e x i s t s . However, i n other cases, a t r i a l - a n d - e r r o r procedure may be warranted where a l a r g e number o f p o s s i b l e substructures are generated and poor d e s c r i p t o r s are e l i m i n a t e d by some prescreening criterion. In g e n e r a l , p o r t a n t purpose i n t h a t information l o s t i n the atom and bond fragmentation. Nevertheless, considerable s t r u c t u r a l information i s s t i l l missing. (c) Environment D e s c r i p t o r s . The d e s c r i p t i o n of s t r u c t u r e s using fragment and substructure d e s c r i p t o r s i n d i c a t e the components o f a molecule. However, the manner i n which these i n d i v i d u a l p a r t s are connected i s not d e s c r i b e d . Environment d e s c r i p t o r s take i n t o account how d i f f e r e n t areas o f a molecule f i t together and provide a measure o f the "environment" i n which a s i n g l e atom fragment finds i t s e l f . The environment d e s c r i p t o r describes the fragment's surround ings by i n c l u d i n g i t s f i r s t and second nearest neighbors and t h e i r bonds i n t o a s i n g l e parameter which r e f l e c t s the atom and bond types connected t o i t . There may be more than one i d e n t i c a l f r a g ment i n a molecule but they do not n e c e s s a r i l y belong to the same f u n c t i o n a l group. For example, the fragment, -C-, i s found once i n both s t r u c t u r e s A and Β below, but twice i n s t r u c t u r e C: Ο
OU
t\ CH -C-0-CH 3
C = CH - CH
3
CH (A)
0 II
l 3
CH ~ 3
CH. ,
C - CH ~ 2
3
CH = C CH
3
(B)
3
(C)
Obviously, the environment seen by t h i s fragment would be d i f f e r ent i n each o f the three cases. Of course, t h i s d i f f e r e n c e i s de pendent upon the d e f i n i t i o n incorporated t o c a l c u l a t e the e n v i r o n ment d e s c r i p t o r . In the ADAPT system, the three forms most o f t e n used are: bond environment d e s c r i p t o r s (BED), weighted e n v i r o n ment d e s c r i p t o r s (WED), and augmented environment d e s c r i p t o r s (AED).
In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
9. STUPER ET AL.
Structure-Activity Studies
181
The procedure used t o c a l c u l a t e these three parameters f o r a p a r t i c u l a r environment fragment i s as f o l l o w s : (1) A s s i g n a r b i t r a r y values t o each type o f atom and bond. The v a l u e s already employed i n the connection t a b l e w i l l suffice. (2) For "BED", sum the number o f bonds connected t o the f r a g ment's f i r s t and second nearest neighbor. (3) For "WED", sum the values assigned t o each bond type i n stead o f merely counting the bonds. (4) For "AED", sum the product o f the bond's assigned value and the assigned values f o r the two atoms which form the bond. The BED, WED, and AED v a l u e s f o r the fragment and s t r u c t u r e s g i v e n above are as f o l l o w s : f o r s t r u c t u r e A, BED = 5, WED = 6, AED = 11; f o r s t r u c t u r e B, BED = 5, WED = 6, AED = 6; f o r s t r u c t u r e C., BED = 12, WED = 15, AED = 17. Since there may be vironment d e s c r i p t o r i n d i c a t e a given fragment. T h i s f e a t u r e makes them u s e f u l when used i n con j u n c t i o n with s u b s t r u c t u r e d e s c r i p t o r s . The s u b s t r u c t u r e d e s c r i p t o r s i n d i c a t e the number o f times a p a r t i c u l a r fragment i s found i n the molecule and the environment d e s c r i p t o r s i n d i c a t e the con t e x t i n which the fragment i s found. The r o u t i n e t h a t generates the environment d e s c r i p t o r s must have access t o the f i l e o f molecular s t r u c t u r e s and t o the atom centered fragment l i b r a r y which i s c o n s t r u c t e d by the user. The a c t u a l c a l c u l a t i o n o f the environment d e s c r i p t o r s proceeds extrem e l y r a p i d l y s i n c e both the fragment l o c a t i o n and necessary c a l c u l a t i o n s are e a s i l y done by a computer. The concept o f the environment i s not l i m i t e d t o c o n n e c t i v i t i e s , but c o u l d take i n t o account e l e c t r o n d e n s i t i e s , bond d i s t a n c e s , e l e c t r o n e g a t i v i t i e s , o r other p h y s i c a l parameters. T h i s can be done by r e p l a c i n g the v a l u e s assigned i n step one by the d e s i r e d parameters. In t h i s manner, more i n f o r m a t i v e d e s c r i p t o r s may be obtained. Use o f the environment d e s c r i p t o r s may r e v e a l r e l a t i o n s which are not p a r t i c u l a r l y obvious. Note t h a t both s t r u c t u r e s A and Β have the same BED and WED v a l u e s . These s t r u c t u r e s , which a t f i r s t glance appear q u i t e d i f f e r e n t , do indeed have these parameters i n common. However, when one takes i n t o account the type o f atoms connected t o these bonds the d i f f e r e n c e becomes apparent. Such r e l a t i o n s h i p s may o r may not prove s i g n i f i c a n t . Their ultimate u t i l i t y depends on the type o f environment measure, the molecule being coded, and the problem being attacked. (d) Geometric D e s c r i p t o r s . Geometric d e s c r i p t o r s are d e r i v ed from the three-dimensional c o n f i g u r a t i o n as generated by MOLMEC. P r e s e n t l y , two b a s i c types o f geometric d e s c r i p t o r s are c a l c u l a t e d from the molecular s t r u c t u r e s . The three p r i n c i p a l axes o f the molecule form the b a s i s f o r the f i r s t type o f geometric d e s c r i p t o r . Since the o r i e n t a t i o n o f the o r i g i n a l molecule i n space i s
In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
182
CHEMOMETRICS: THEORY AND APPLICATION
e s s e n t i a l l y random, the r a d i i must be s o r t e d i n some manner. T h i s i s done by a r b i t r a r i l y a s s i g n i n g X t o the longest r a d i u s , Y t o the second longest r a d i u s , and Ζ t o the s h o r t e s t r a d i u s . Once s o r t e d , the three r a t i o s , X/Y; X/Z and Y/Z, are a l s o c a l c u l a t e d . Due t o t h e i r small values, a l l o f the r a d i i are m u l t i p l i e d by some con s t a n t s c a l i n g f a c t o r t o prevent l o s s o f information during t r u n c a t i o n . These s i x geometric parameters are then used as new des c r i p t o r s and c o n s t i t u t e t h e f i r s t s e t o f geometric d e s c r i p t o r s . The van der Waals volume o f a molecule i s the other type o f geometric d e s c r i p t o r generated i n the ADAPT system. Before t h i s c a l c u l a t i o n can be done, the bond d i s t a n c e s and the van der Waals r a d i i o f the atoms must be known. The bond d i s t a n c e s are e a s i l y obtained from the molecular modelling r e s u l t s . For the van d e r Waals r a d i i , an a r t i c l e p u b l i s h e d by A. Bondi (42) was consulted. The volume occupied by an atom i s taken as t h a t o f a sphere with r a d i u s equal t o the van der Waals r a d i u s o f the atom minus the volume o f o v e r l a p with adjacen c a l c u l a t e d from standar volume i s not found f o r two reasons: the assumption o f sphere and s p h e r i c a l segments i s not t o t a l l y c o r r e c t , and the r a d i i used were s e l e c t e d as being the "best" values from a l a r g e c o l l e c t i o n o f data using an e m p i r i c a l s e l e c t i o n method. The t o t a l molecular v o l ume f o r the molecule i s taken as the sum o f the c o n t r i b u t i o n s f o r each atom found as d e s c r i b e d above. The volume c o n t r i b u t i o n s o f attached hydrogens are a l s o i n c l u d e d i n the c a l c u l a t i o n o f the t o t a l volume. In order t o make the r o u t i n e more v e r s a t i l e , the o p t i o n o f e i t h e r using standard bond d i s t a n c e s o r modelled bond d i s t a n c e s i s i n c l u d e d . Since MOLMEC uses the standard bond d i s t a n c e s t o d e t e r mine a low s t r a i n geometry, i t i s not s u r p r i s i n g t h a t f o r a w e l l modelled data s e t , the molecular volumes c a l c u l a t e d using the two d i f f e r e n t bond d i s t a n c e s are very s i m i l a r . However, d i s c r e p a n c i e s can a r i s e when the molecule contains r i n g s o f f i v e o r fewer atoms which cause a l a r g e amount o f bond s t r a i n . The volumes are i n i t i a l l y c a l c u l a t e d i n u n i t s o f c u b i c Angstroms per atom but are then converted t o u n i t s o f c c per mole. The molecular volume can then be used as another geometric d e s c r i p t o r . Each geometric d e s c r i p t o r contains some information about the molecule. The r a d i i and r a t i o s d e s c r i b e the general shape o f the molecule which may be very important i n systems where receptor s i t e s are i n v o l v e d . However, t h i s i s only a r e l a t i v e shape s i n c e the model obtained i s f o r the molecule i n a vacuum: i n some environments, the molecule's shape w i l l change, e s p e c i a l l y i f long chains are present. On the other hand, the molecular volume i s e s s e n t i a l l y constant r e g a r d l e s s o f how the molecule i s bent. How ever, l i k e any other d e s c r i p t o r , the a c t u a l value o f any geometric d e s c r i p t o r depends upon the s p e c i f i c a p p l i c a t i o n i n which i t i s used. (5)
Collator.
The c o l l a t o r r o u t i n e i s used t o s e l e c t which o f the
In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
9. STUPER ET AL.
Structure-Activity
Studies
183
a v a i l a b l e d e s c r i p t o r s w i l l be i n c l u d e d i n the data s e t t o be passed t o other p a r t s o f ADAPT, The experimenter has complete f l e x i b i l i t y i n d e c i d i n g which data s e t o r subset t o use and how t o s t r u c t u r e problems when they are t o be passed t o the p r i o r f e a t u r e s e l e c t i o n algorithms o f the d i s c r i m i n a n t development algorithms. T h i s r o u t i n e i s used t o s e l e c t f i r s t one subset o f the a v a i l a b l e d e s c r i p t o r s t o be used f o r d i s c r i m i n a n t development, and then on subsequent t r i a l s other subsets o f d e s c r i p t o r s . Thus, o v e r a l l performance o f the system can be evaluated with respect t o which d e s c r i p t o r s a r e being i n c l u d e d i n the a n a l y s i s . (6) Preprocessor. The preprocessor r o u t i n e accepts the raw desc r i p t o r s developed by the d e s c r i p t o r development r o u t i n e s and p e r forms the d e s i r e d preprocessing necessary f o r f u r t h e r p r o c e s s i n g . One example o f such p r e p r o c e s s i n g i s a u t o s c a l i n g , where each desc r i p t o r over a data s e t i s a l t e r e d so t h a t the mean i s zero and the standard d e v i a t i o n i t h i s procedure s t a n d a r d i z i n (7) P r i o r Feature S e l e c t i o n . A f t e r a s e t o f drugs have been formed i n t o a l a b e l l e d data s e t ready f o r p r e s e n t a t i o n t o the d i s criminant developer, i t i s d e s i r a b l e t o submit i t t o feature s e l e c t i o n i f p o s s i b l e . One method f o r s e l e c t i n g the d e s c r i p t o r s expected t o be most u s e f u l has been the use o f the well-known F i s h e r r a t i o (e_.£., 21). A number o f other s t a t i s t i c a l l y based methods suggest themselves, but they mostly r e q u i r e making the assumption t h a t the b e s t , i..e_., most s e p a r a t i n g , d e s c r i p t o r s i d e n t i f i e d one a t a time w i l l a l s o be the best s e t o f d e s c r i p t o r s . T h i s assumption i s r a r e l y v a l i d . In the s t u d i e s performed t o date, we have u s u a l l y t r i e d t o s e l e c t subsets o f d e s c r i p t o r s i n as wise a manner as we c o u l d devise; we have r e l i e d on being able t o i n v e s t i g a t e a l a r g e enough number o f subsets o f d e s c r i p t o r s t o f e e l reasonably c o n f i d e n t t h a t we have found good d e s c r i p t o r s e t s . Feature s e l e c t i o n i s performed as an i n t e g r a l p a r t o f s t e p wise descriminant a n a l y s i s such as t h a t implemented i n the BMD (43) package as BMD07M. T h i s w i l l be d i s c u s s e d l a t e r i n the s e c t i o n on d i s c r i m i n a n t development and feedback feature s e l e c t i o n . (8) D i s c r i m i n a n t Developer. The d i s c r i m i n a n t developer accepts the s e t o f data generated by the previous s e c t i o n s o f ADAPT and attempts t o develop d i s c r i m i n a n t f u n c t i o n s capable o f c o r r e c t l y c l a s s i f y i n g t h e data. The development o f such d i s c r i m i n a n t s can proceed through the use o f (a) e r r o r c o r r e c t i o n feedback l e a r n i n g machines, (b) i n t e r a c t i v e l e a s t squares development o f l i n e a r d i s criminant f u n c t i o n , (c) other parametric and nonparametric r o u t i n e s . The e r r o r c o r r e c t i o n feedback t r a i n i n g method has been used i n the s t u d i e s on b a r b i t u r a t e s t o be d e s c r i b e d i n the f o l l o w i n g s e c t i o n o f this article. The i t e r a t i v e l e a s t squares development method was developed s e v e r a l years ago i n t h i s l a b o r a t o r y (44) and has been
In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
184
CHEMOMETRICS: THEORY AND APPLICATION
i n t e r f a c e d i n t o ADAPT, (9) Feedback Feature S e l e c t i o n . In many chemical a p p l i c a t i o n s o f p a t t e r n r e c o g n i t i o n a s e t of data i s coded using more d e s c r i p t o r s than are necessary t o c o r r e c t l y c l a s s i f y the members. However, the necessary and unnecessary d e s c r i p t o r s cannot u s u a l l y be i d e n t ified a priori. (When they can, t h i s i s o b v i o u s l y the method o f choice.) Therefore, feature s e l e c t i o n must o f t e n be approached from a systems viewpoint, whereby the r e s u l t s o f c l a s s i f i c a t i o n are used to t r y to i d e n t i f y the minimal s e t o f necessary d e s c r i p t o r s . T h i s approach i s shown by the feedback loop i n Figure 2. An e a r l y approach to feedback feature s e l e c t i o n was weights i g n feature s e l e c t i o n . Here, two weight v e c t o r s , i n i t i a l i z e d with each component equal to +1 or -1, r e s p e c t i v e l y , were developed using e r r o r c o r r e c t i o n feedback t r a i n i n g with i d e n t i c a l t r a i n i n g s e t s . A component by component comparison was made between the two t r a i n e d weight v e c t o r s and those d e s c r i p t o r s correspondi n g t o weight v e c t o r component carded. T h i s method wa data i n s e v e r a l s t u d i e s . The variance feature s e l e c t i o n method, d e s c r i b e d e a r l i e r , has been incorporated i n t o ADAPT and has been used e f f e c t i v e l y on s e v e r a l types of data. The variance method allows r a p i d e x t r a c t i o n o f f e a t u r e s r e s p o n s i b l e f o r l i n e a r seperability. I t i s much s u p e r i o r t o the weight-sign method i n terms of speed and r e l i a b i l i t y . B a r b i t u r a t e Study The s e t o f compounds used i n the present study c o n s i s t s o f 160 5,5·-substituted b a r b i t u r a t e s s e l e c t e d from a standard r e f e r ence (290 . These compounds range i n molecular weight from 172 t o 276 and have d u r a t i o n times ranging from 10 minutes to 600 minutes. The method of a d m i n i s t r a t i o n was e i t h e r i n t r a p e r i t o n e a l o r subcutaneous, using mice, r a t s , o r r a b b i t s as t e s t animals. The compounds were grouped i n t o c l a s s e s according to the dura t i o n o f depressant e f f e c t . These c l a s s e s were formed by d i v i d i n g the d u r a t i o n time expressed i n minutes by ten. The r e s u l t i n g c l a s s d e s i g n a t i o n was rounded up i f the remainder was f i v e o r g r e a t e r , and down otherwise. Thus a compound whose duration time was 227 minutes would be p l a c e d i n c l a s s 23, whereas a compound having a d u r a t i o n time o f 223 minutes would be p l a c e d i n t o c l a s s 22. Compounds with a d u r a t i o n greater than 650 minutes were p l a c e d i n t o c l a s s 65. T h i s r e s u l t e d i n a t o t a l o f 65 d i f f e r e n t c l a s s e s which are d i s t r i b u t e d as shown i n F i g . 3. Three types o f d e s c r i p t o r s were employed f o r these s t u d i e s ; numeric fragment d e s c r i p t o r s , s u b s t r u c t u r a l d e s c r i p t o r s , and environmental d e s c r i p t o r s . The d e s c r i p t o r s were generated using the automated d e s c r i p t o r packages d e s c r i b e d p r e v i o u s l y . A l i s t o f the i n i t i a l s e t of d e s c r i p t o r s used i s given i n Table 1. Each d e s c r i p t o r i s contained i n a minimum o f 20% o f the s t r u c t u r e s . In no case
In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
9.
STUPER ET
AL.
Structure-Activity Studies
Numerical Prior Descriptors — F e a t u r e —Preprocessing Selection
Di scrimi nant Resu1ts Function —». of ^Development -, Analysis
—
Figure 2.
185
Feedback Feature Selection
Basic pattern recognition system for studies of structure-activity relationships
I2H
ΙΟ
ω
3
2\
200
1
400
600
DURATION TIME (MIN.)
Figure 3.
Histogram of barbiturate duration times for the drugs in the data set
In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
186
CHEMOMETRICS : THEORY AND APPLICATION
TABLE I·
Molecular S t r u c t u r e D e s c r i p t o r s f o r the B a r b i t u r a t e Data Set
ATOM AND BOND DESCRIPTORS 1 3 5 7
Number Number Number Number
of of of of
atoms Carbon atoms Oxygen atoms double bonds
2 4 6 8
Number o f bonds Number o f Nitrogen atoms Number o f s i n g l e bonds Length a
ENVIRONMENT DESCRIPTORS Atom Centere 9 - 11
CH -
1, 2, 3
12 - 14
-CH -
1, 2, 3
15 - 17
-CH-
1, 2, 3
3
2
24 - 26
I -C I 0 =
1, 2, 3
27 - 29
-HC =
1, 2, 3
30 - 35
>C -
1, 2, 3
18 - 23
1, 2, 3
1, 2, 3
1, 2, 3
SUBSTRUCTURAL DESCRIPTORS 36
C H
3
C H
39 42
a
b
-CH-
2"
37
-CH (CH )CH -
38
CH -
40
-CH CH -
41
CH CH CH -
43
-HC =
2
3
2
2
3
3
2
2
L e n g t h * 4*(Number o f s i n g l e bonds) + 2*(Number o f double bonds) l
» BED, 2 » WED, 3 » AED
In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
9.
STUPER ET AL.
187
Structure-Activity Studies
does any one d e s c r i p t o r , o r any b i n a r y combination o f d e s c r i p t o r s , c o n t a i n s u f f i c i e n t information to s u c c e s s f u l l y c l a s s i f y the data. Thus the a c t i v e data s e t c o n s i s t s o f 160 compounds each coded with 43 d e s c r i p t o r s . Preprocessing o f the raw data p r i o r t o c l a s s i f i c a t i o n c o n s i s t e d o f a u t o s c a l i n g so t h a t each d e s c r i p t o r had an average o f zero and a standard d e v i a t i o n o f 127. T h i s allowed the data t o be truncated to i n t e g e r values with a n e g l i g i b l e l o s s o f precision. (Loss of p r e c i s i o n i s known t o be n e g l i g i b l e as r e c a l c u l a t i o n a f t e r t r u n c a t i o n y i e l d e d a standard d e v i a t i o n o f 127 and a mean o f 0 +0.17.) Net r e t e n t i o n o f information was assured by t e s t i n g the p r e d i c t i v e a b i l i t y f o r each d e s c r i p t o r before and a f t e r preprocessing. A value o f 250 was used f o r X + i because i t p r o v i d ed f a s t t r a i n i n g and high p r e d i c t i v e a b i l i t y . Since the data were c o l l e c t e d from a s e r i e s o f s t u d i e s on d i f f e r e n t animals, a t d i f f e r e n t l a b o r a t o r i e s , i t i s not unreason able to expect the c l a s s i f i c a t i o n s to d i f f e r . I t was t h e r e f o r e f e l t t h a t an e r r o r rang due t o d i f f e r e n t c l a s s i f i c a t i o w i l l develop a r u l e which answers the question, "Is the d u r a t i o n time l e s s than χ minutes?", where there i s a deadzone o f s e v e r a l minutes around t h i s l e v e l . Thus, t o t e s t f o r d i s c r i m i n a t i o n a b i l i t y a t a t h r e s h o l d l e v e l o f 100 minutes using a t h i r t y minute deadzone, a l l members from c l a s s e s 1 through 10 would c o n s t i t u t e one category, and a l l members from 14 through 65 would c o n s t i t u t e the other category. The l i n e a r l e a r n i n g machine was used to develop d i s c r i m i n a n t f u n c t i o n s which b i s e c t the data with as many d i f f e r e n t thresholds as p o s s i b l e , o b t a i n i n g 100% r e c o g n i t i o n a b i l i t y f o r each range. Attempts a t such d i s c r i m i n a t i o n were accomplished using f i r s t a f i f t y , and l a t e r a t h i r t y , minute e r r o r range. To generate a p r e l i m i n a r y estimate o f the c l u s t e r i n g and s e l f consistency of the data the f o l l o w i n g experiment was done. F i v e t r a i n i n g s e t / p r e d i c t i o n s e t s were chosen with seven compounds i n each p r e d i c t i o n s e t and the remaining compounds i n each t r a i n i n g set. The o v e r a l l data s e t i s d i v i s i b l e i n t o halves by 59 t h r e s holds using 50 minute e r r o r ranges. A l l f i v e t r a i n i n g s e t s were used to develop independent d i s c r i m i n a n t s a t each o f the 59 t h r e s holds. These d i s c r i m i n a n t s were then used to p r e d i c t the seven unknowns i n the r e s p e c t i v e p r e d i c t i o n s e t . The c l a s s assignments were made by examining the sequence o f responses produced by the 59 p r e d i c t i o n s ; i f only one change from answers o f "greater than" t o " l e s s than" occurred, t h i s p o i n t was taken as the p r e d i c t e d d u r a t i o n time. I f there were s e v e r a l changes i n p r e d i c t e d r e s ponse, then the p r e d i c t e d duration time was taken as 30 minutes greater than the s h o r t e s t d u r a t i o n time i n d i c a t e d by the f i r s t change i n response. When t h i s procedure was used, 19 o f the 35 unknowns were c l a s s i f i e d as having duration times w i t h i n 20 min utes o f the a c t u a l value and 31 were c l a s s i f i e d as having d u r a t i o n times w i t h i n 50 minutes o f the a c t u a l value. The d u r a t i o n times n
In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
188
CHEMOMETRICS: THEORY AND APPLICATION
TABLE I I .
F i n a l Sets o f Molecular S t r u c t u r e D e s c r i p t o r s Supporting L i n e a r D i s c r i m i n a n t Functions a t Thresholds I and I I .
THRESHOLD I
ATOM AND BOND DESCRIPTORS
Number o f Oxygen atoms
Number o f double bonds
Number o f s i n g l e bonds
ENVIRONMENT DESCRIPTORS
CHo-
CH -(G,2) 3
ι -CH-(G,1)
-CH 2
-CH CH -
-HO(G,2)
CH CH -
>C=(G 3) (C,l)
2
2
f
2
Average Predictive Ability b
a
ATOM AND BOND DESCRIPTORS
Number o f Oxygen atoms
SUBSTRUCTURAL DESCRIPTORS
3
THRESHOLD I I
G = General
93.8%
SUBSTRUCTURAL
ENVIRONMENT
CHo
-HC-(G,1) -HC=(G,1)
—CH CH ~ 2
2
i >C=(G,3) -C(C,3) i -CH CH (CH )2
3
Average Predictive Ability
94.9%
1 3
search, C = C y c l i c search, 1 = BED, 2 = WED, 3 = AED
^ P r e d i c t i v e a b i l i t y measured using leave one out procedure
o f only four compounds were i n e r r o r by more than 50 minute e r r o r range used f o r each t h r e s h o l d . Thus t h i s p r e l i m i n a r y experiment showed t h a t a s e t o f l i n e a r c l a s s i f i e r s working i n concert c o u l d p r e d i c t the d u r a t i o n times o f the compounds i n the data s e t reas onably a c c u r a t e l y . S i m i l a r r e s u l t s were obtained f o r the 61 poss i b l e thresholds developed using a 30 minute e r r o r range. In order t o g a i n a b e t t e r i n s i g h t i n t o these r e l a t i o n s h i p s two thresholds were s u b j e c t t o exhaustive f e a t u r e s e l e c t i o n . The t h r e s h o l d I data i n c l u d e s c l a s s e s 1 through 10 and 14 through 65. The t h r e s h o l d I I data i n c l u d e s c l a s s e s 1 through 24 and 28 through
In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
9.
STUPER ET AL.
Structure-Activity Studies
189
65. The r e s u l t s o f the s e l e c t i o n process and a p r e d i c t i v e a b i l i t y e s t i m a t i o n i s reported i n Table I I , Through a p p l i c a t i o n o f the variance feature s e l e c t i o n method, a s e t o f features r e s p o n s i b l e f o r t h e s e p a r a b i l i t y o f the data were found. Removing any o f these d e s c r i p t o r s r e s u l t s i n the l o s s of l i n e a r s e p a r a b i l i t y . Therefore, the d e s c r i p t o r s s e l e c t e d cons t i t u t e a minimum set capable o f supporting the r e l a t i o n s h i p w i t h i n the data. The p r e d i c t i v e a b i l i t y , estimated by the leave one out procedure (45), i n d i c a t e d that these f e a t u r e s were capable o f p r o v i d i n g accurate information concerning the d u r a t i o n o f b a r b i t urate a c t i v i t y . Thus, i t i s c l e a r t h a t a r e l a t i o n s h i p i s present which i s r e a d i l y i d e n t i f i e d using the ADAPT system. Further i n v e s t i g a t i o n s using t h i s data set have uncovered several interesting correlations. D e t a i l s o f the experimental r e s u l t s a r e reported elsewhere (46). What has been sought f o r here i s a c l e a r demonstration o f the u t i l i t y o f ADAPT i n e l l u c i dating relations within s e l e c t i o n o f the two s p e c i f i as was i n i t i a l development o f d i s c r i m i n a n t s f o r 61 d i f f e r e n t classes. C l e a r l y such s t u d i e s would be inconvenient without the degree o f o r g a n i z a t i o n provided by automation o f the d e s c r i p t i v e , storage, and p a t t e r n r e c o g n i t i o n techniques. The ADAPT system has c o n s i s t e n t l y shown high u t i l i t y i n s e v e r a l areas and promises t o continue t o a i d i n the a p p l i c a t i o n o f p a t t e r n r e c o g n i t i o n t o problems i n chemistry.
Literature
1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12.
Cited
Minsky, Marvin, Proc. IEEE, 49, 8 (1961). Solomonoff, R.J., Proc. IEEE, 54, 1687 (1966). Rosen, CA., Science, 156, 38 (1967). Nagy, George, Proc. IEEE, 56, 836 (1968) Levine, M.D., Proc. IEEE, 57, 1391 (1969). Nilsson, N.J., Learning Machines, McGraw-Hill Book Co., New York, 1965. Tou, J.T. and Gonzalez, R.C., Pattern Recognition Principles, Addison-Wesley Publishing Co., Reading, Mass., 1974. Kowalski B.R. and Bender,C.F.,Jour. Amer. Chem. Soc., 94 5632 (1972); 95, 686 (1973). Isenhour, T.L., Kowalski, B.R., Jurs, P.C., Crit. Rev. Anal. Chem., 4, 1 (1974). Kowalski, B.R., "Pattern Recognition in Chemical Research," in Computers in Chemical and Biochemical Research, Vol. 2, C.E. Klopfenstein and C.L. Wilkins, Eds., Academic Press, New York, 1974. Jurs, P.C. and Isenhour, T.L., Chemical Applications of Pattern Recognition, Wiley-Interscience, New York, 1975. Kowalski, B.R. and Bender,C.F.,Naturwissenschaften, 62, 10 (1975).
In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
190
CHEMOMETRICS: THEORY AND APPLICATION
13. Kowalski, B.R., Anal. Chem., 47, 1152A (1975). 14. Jurs,P.C.,Proceedings of the Workshop on Chemical Applica tions of Pattern Recognition, Washington, D.C, May 1975. 15. Hansch, C., Unger,S.H.,Forsythe, A.B., Jour. Med. Chem., 16, 1217 (1973). 16. Hiller, S.A., et al., Comp. Biomed. Res., 6, 411 (1973). 17. Ting, K.-L.H., et al., Science, 180, 417 (1973). 18. Perrin, C.L., Science, 183, 551 (1974). 19. Clerc, J.T., Naegeli, P., Seibl, J., Chimia, 27, 639 (1973). 20. Adamson, G.W. and Bush, J.A., Nature, 248, 406 (1974). 21. Chu, K.C., Anal. Chem., 46, 1181 (1974). 22. Kowalski, B.R. and Bender,C.F.,Jour. Amer. Chem.Soc.,96, 916 (1974). 23. Unger, S.H., Cancer Chem. Rpts., Part 2, 4(4), 45 (1974). 24. Chu,K.C.,et al., Jour. Med. Chem., 18, 639 (1975). 25. Craig, P.N. and Waite, J.H., Analysis and Trial Application of Correlation Methodologies for Predicting Toxicity of Organic Chemicals, EPA Offic 26. Stuper, A.J. and Jurs, , Comp , 16, 99 (1976) 27. Brugger, W.E. and Jurs, P.C., Anal. Chem., 47, 781 (1975). 28. Usdin, E. and Efron, D.H., Psychotropic Drugs and Related Compounds, 2nd ed., DHEW Publication No. (HSM) 72-9074, 1972. 29. Doran, W.J., Medicinal Chemistry, Vol. IV, John Wiley and Sons, New York, 1959. 30. Amoore, J.E., Molecular Basis of Odor, Thomas, Springfield, 111., 1970. 31. Engler, E.M., Andose, J.D., Schleyer, P. von R., Jour. Amer. Chem.Soc.,95, 8005 (1973). 32. Williams, J.E., Strang, P.J., Schleyer, P. von R., Ann. Rev. Phys. Chem., 19, 531 (1968). 33. Wipke, W.T., Dyott, T.M., Verbalis, J.G., Abstract, 161st American Chemical Society National Meeting, Los Angeles, CA, March 1971. 34. Wipke, W.T., Gund, P., Verbalis, J.G., Dyott, T.M., Abstract, 162nd American Chemical Society National Meeting, Washington, DC, September 1971. 35. Wipke, W.T., Gund, P., Dyott, T.M., Verbalis, J.G., unpublish ed manuscript. 36. Buffa, E.S. and Taubert, W.H., "Production-Inventory Systems, Planning and Control," Rev. Ed., R.D. Irwin, Inc., Homewood, 111., 1972. 37. Brugger, W.E., Stuper, A.J., Jurs, P.C., Jour. Chem. Infor. Comp. Sci., 16, 105 (1976). 38. Sussenguth, E.H., Jr., Jour. Chem.Soc.,5,36 (1965). 39. Ming, T.-K. and Tauber, S.J., Jour. Chem. Doc., 11, 47 (1971). 40. Figeras, J., Jour. Chem. Doc., 12, 237 (1972). 41. Zander, G.S. and Jurs, P.C., Anal. Chem., 47, 1562 (1975). 42. Bondi, Α., Jour. Phys. Chem., 68, 441 (1964).
In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
9.
STUPER ET AL.
Structure-Activity Studies
191
43. Dixon, W.J.,Ed., BMD-Biomedical Computer Programs, 3rd Ed., Univ. of Calif, Press, Berkeley, CA, 1973. 44. Pietroantonio,L.,and Jurs,P.C.,Pattern Recog., 4, 391 (1972). 45. Lachenbruch, P.A. and Miche, R.M., Technometrics, 10, 1 (1968). 46. Stuper, A.J. and Jurs, P.C., submitted for publication.
In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
10 Enthalpy-Entropy Compensation: An Example of the Misuse of Least Squares and Correlation Analysis R. R. KRUG Chevron Research Co., Richmond, CA W. G. HUNTER Statistics Department and Engineering Experiment Station, University of Wisconsin, Madison, WI 53706 R. A. GRIEGER-BLOCK Chemical Engineering Department, University of Wisconsin, Madison, WI 53706
Whether or not exists between reaction or equilibrium enthalpies and entropies has been the subject of chemical investiga tions for many years. Hinshelwood collected lots of data during the early years of modern kinetic theory to probe for possible functional dependencies between the Arrhenius parameters (1-3). Many of these and subsequent experimental investigations have led to findings that estimated enthalpies varied linearly with estimated entropies. Many chemical theories have been proposed to explain, in chemical terms, why such linear correlations should occur. Linear enthalpyentropy compensation is now widely accepted as occur ring because of chemical factors and is mentioned in many standard chemistry tests (4-8). In the past few decades, first chemists (9-16) and later statisticians (17-24) have begun to doubt that all enthalpy-entropy compensations arise as a result of chemical factors alone. In particular as the compensation temperature, the slope of a compen sation line inΔΗ-ΔScoordinates, approached the range of experimental temperatures, the chemical causality of such correlations was questioned. The debate over which observed c o r r e l a t i o n s were caused by chemical factors and which were caused by nonchemical factors ( i . e . data handling a r t i f a c t s that r e s u l t from the propagation of errors) apparently has not been adequately resolved to date because enthalpyentropy compensations are s t i l l reported and j u s t i f i e d merely by the s i g n i f i c a n c e of the estimated c o r r e l a t i o n c o e f f i c i e n t . In t h i s a r t i c l e we summarize and general i z e our e a r l i e r r e s u l t s ( 2 5 - 2 7 ) that indicate that the s i g n i f i c a n c e of an estimated c o r r e l a t i o n c o e f f i c i e n t i n the enthalpy-entropy plane i s not j u s t i f i c a t i o n f o r 192
In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
10.
KRUG ET AL.
Enthalpy-Entropy Compensation
193
the detection of a chemically caused compensation, but the s i g n i f i c a n c e of an estimated c o r r e l a t i o n c o e f f i c i e n t i n the enthalpy-free energy plane with estimates evaluated a t the harmonic mean of the experi mental temperatures i s strong j u s t i f i c a t i o n f o r the detection of a chemical e f f e c t . This conclusion r e s u l t s from the f a c t that there i s a l i n e a r s t a t i s t i c a l compensation e f f e c t that i s confounded with what ever chemical compensation that might be detected i n the enthalpy-entropy plane. We a l s o present the regression algorithm f o r the estimation of the chemical compensation temperature from an observed c o r r e l a t i o n in AH-AG coordinates. hm T
Chemical Theory The rigorous thermodynami i c a l arguments of L a i d l e r (23) , Hammett (5) , L e f f 1er (7,16), and R i t c h i e and Sager (29) a l l suggest a gen e r a l l y nonlinear f u n c t i o n a l r e l a t i o n s h i p between enthalpies and entropies. To i l l u s t r a t e t h i s r e s u l t , we c a l l upon the s t a t i s t i c a l mechanical d e f i n i t i o n s used by R i t c h i e and Sager. The entropy of a system and the enthalpy of a system can be written i n terms of the sums of energy states that the system occupies. , Zg (i: /kT)exp(-e /kT) S = 1ηΣ χρ(-ε Τ) + R \ [ ^ . ^ %
Κ
9 ι β
i
ι Α
i
g
i
e
x
p
(
/
k
Eg j(εj/kT)exp(-ε j/kT) Eg^xpi-Cj/kT) If we take as the system a chemical plus i t s solvent undergoing reaction or equilibrium, two systematic v a r i a t i o n s that w i l l cause c o i n c i d e n t a l v a r i a t i o n s i n enthalpies and entropies are homologous v a r i a t i o n s of e i t h e r solvent composition (e.g., from polar to nonpolar) or substituents (e.g., from electron releasing to electron withdrawing). Passing through the homologous s e r i e s the energy states occupied by the system w i l l vary i n a systematic manner. Since the same energy states define a l l thermodynamic functions of the system, the thermodynamic param eters (including enthalpy and entropy) w i l l also vary i n a systematic manner such that a p l o t of enthalpy versus entropy, say, would reveal a system a t i c v a r i a t i o n . That a systematic v a r i a t i o n should be l i n e a r i s not obvious from the d e f i n i t i o n s , however. We may assume that i f a resultant v a r i a t i o n
In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
194
CHEMOMETRICS: THEORY AND APPLICATION
i s over a s p e c i a l region or i s s u f f i c i e n t l y short, the p l o t t e d v a r i a t i o n may appear to be l i n e a r . I t i s important to note that t h i s would be a l i n e a r segment of an otherwise nonlinear function. Such a l i n e a r v a r i a t i o n of enthalpy-enthropy p a i r s (AH,AS) i s generally summarized as AH = BAS + AGg where the slope, β, has the dimension of temperature and i s a l t e r n a t e l y c a l l e d the compensation temperature, i s o k i n e t i c temperature or isoequilibrium temperature depending on whether the thermodynamic parameters were estimated from k i n e t i c or equilibrium data. The physical s i g n i f i c a n c e of the compensation temperature i s that at t h i s temperature a v a r i a t i o n i n enthalpy i s e n t i r e l y compensated f o b correspondin v a r i a t i o n i n entropy suc stant. To be consisten equation, intercept of such a l i n e a r r e l a t i o n s h i p i s the free energy at the compensation temperature, AGg (9.fJL3) . S t a t i s t i c a l Theory H i s t o r i c a l l y , compensation temperatures have been determined by least squares (or best graphical f i t , which i s e s s e n t i a l l y l e a s t squares without the computa t i o n a l rigor) and the goodness of f i t has been j u s t i f i e d by the high s i g n i f i c a n c e of the estimated c o r r e l a t i o n c o e f f i c i e n t s between the enthalpy and entropy estimates. Both of these procedures are i n c o r r e c t and, p a r t i c u l a r l y i n t h i s case, often lead to grossly i n c o r r e c t r e s u l t s . I t i s important to remember that thg enthalpy-entropy data pairs are a c t u a l l y estimates (AH,A§) not o r i g i n a l data that can be treated as either independent or as being r e l a t i v e l y free from error as might be r a t i o n a l i z e d f o r o r i g i n a l laboratory data, f o r example, k i n e t i c rate constants-temperature data (k>T) or chemical equilibrium constants-tempera ture data (Κ,Τ). The enthalpy and entropy estimates, AH and AS, both contain uncertainty, and hence l e a s t squares i s an improper technique f o r regression of a functional dependence of one on the other. What i s worse, a c t u a l l y , i s that these estimates are highly c o r r e l a t e d with one another due to t h e i r functional r e l a t i o n s h i p s with the k i n e t i c or equilibrium constants and the experimental ranges over which the data were sampled. Hence a c o r r e l a t i o n analysis might detect a s i g n i f i c a n t A
Λ
In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
10.
KRUG ET AL.
195
Enthalpy-Entropy Compensation
c o r r e l a t i o n that r e s u l t s from these computations as data handling a r t i f a c t s , even i n the absence of any chemical e f f e c t . The s t a t i s t i c a l and chemical compen sations need not be hopelessly confounded, however, because the s c i e n t i s t has knowledge of both the chemi c a l i d e n t i t i e s and h i s choice of experimental sampling points p r i o r to analysis. Using the fundamental d e f i n i t i o n s of chemical k i n e t i c s and regression a n a l y s i s , we w i l l now show (1) that enthalpy-entropy estimates are highly corre lated, (2) that the s t a t i s t i c a l compensation equation i s f u n c t i o n a l l y i d e n t i c a l to the chemical compensation equation, (3) how to separate the chemical from the s t a t i s t i c a l e f f e c t , and (4) how to estimate the chemi c a l compensation temperature and i t s (1-α) confidence i n t e r v a l from k i n e t i c or equilibrium data. To avoid redundancy, we w i l l case of k i n e t i c data include the computational d e t a i l s f o r equilibrium data as well i n the Regression Algorithm. In t h i s discussion, we must make the usual assump tions that errors associated with the dependent v a r i able, the logarithm of the k i n e t i c observations, Zi lu ! i i ' normally and independently d i s t r i b u t e d =
a
r
e
2
with zero mean and constant variance, £ ^ Ν Ι Ο ( 0 , σ ) , and that the independent v a r i a b l e , the inverse experi mental temperatures, x^ = 1/T^, have no uncertainty. That i s , i n practice the experimental temperatures are determined with much greater p r e c i s i o n and accuracy than are the rate constants. To formalize t h i s analy s i s we consider data taken at 1 < i £ η temperatures for 1 £ j <_ m members of a homologous s e r i e s . The k i n e t i c observations are related to the experimental temperatures by the l i n e a r i z e d Arrhenius r e l a t i o n s h i p y i j
= {In A}j - {E/Rl.x.. + ε . .
or more simply by y.
= Χθ. + ε
where the observation vector i s y\ = (In k i . , In k ., .../ In k .) 1 3 3 nj the parameter vector i s Θ* = ({In A}.., {-E/R}..) and the design matrix i s 2
In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
CHEMOMETRICS: THEORY AND APPLICATION
196
Γ
1
1
1
1/T
\_1/T
l
!!! l / T ^
2
i f a l l the m-data sets are taken a t the same 1 < j < η temperatures. The enthalpy and entropy estimates are r e l a t e d to the θ-estimates by AS^ = R{ln A K - Rln(kTe/h) = R 6 i . . + C i ΔΗ^ = Ε. - RT =
-R6..
+
C.,
which may be summarized by = Ζθ^. +
C
where the thermodynami (Δε^,ΔΗ^ί), the a d d i t i v e constant vector i s C ' » (-Rln(kTe/h),-RT) and the matrix Ζ i s Γ Ζ
R
0
0
" -R_
Given that the rate constants are measurable to within an experimental error ε. the usual l i n e a r regression problem i s v. = Χθ. + ε
where ε. * NID(0,o*)
with the l e a s t squares s o l u t i o n θ.=
Ζ
1
(X^-^'y. 1
Λ
ψ . = Ζθ . + C 2
Τ.
~
for each member 1 £ j £ m of the homologous s e r i e s . Unfortunately, t h i s i s where the proper a p p l i c a t i o n of regression analysis usually gnds and experi menters improperly t r y to f i t the (ΔΗ,Δ§) estimates to a l i n e a r r e l a t i o n s h i p . Since NID"ÎO,O*) a l l
*
information about ^
i s contained i n the estimated
f i r s t and second moments. The f i r s t moment, the maximum l i k e l i h g o d estimator (MLE) or l e a s t squares estimate (LSE) has been properly c a l c u l a t e d i n the
In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
10.
197
Enthalpy-Entropy Compensation
KRUG ET A L .
past. Ignorance of the second moment has led to i n c o r r e c t conclusions, however. The second moment can be i l l u s t r a t e d g r a p h i c a l l y as a j o i n t l i k e l i h o o d region Μψ^|γ__.,Χ) f o r the l o c a t i o n of parameter values ^ given the data
and experimental design X or as a r
j o i n t p r o b a b i l i t y region p($j|Ψ^/θ!,Χ) f o the loca t i o n of the estimates variance σ
2
given the true value ψ^, the
and the choice of experimental settings X.
Such j o i n t confidence regions are displayed i n Figure l a for a single member j = 1 of a homologous s e r i e s . To understand how m = 7 (as p l o t t e d i n Figure lb) such l e a s d i s t r i b u t e d i n the equivalently i f the v a r i a t i o n of rate constants by measurement errors was much greater than the v a r i a t i o n by a chemical effect) we consider the following analysis using a sampling theory approach. Since = x£ + £ and £ * NID(0,o ) , then ^ NID(X£,o ) 2
s e t t i n g σ? = σ
2
2
and £j = £ for a l l j because we are
assuming a l l data o r i g i n a t e from the same source. The p r o b a b i l i t y d i s t r i b u t i o n of observations f o r the j t h data s e t i s 1
ρ.(γ_|θ,σ,χ) = 1
exp ( - * (γ_ -χθ) · (y_ -χθ) } 2
(^Γσ )π
l
2σ
2
1
1
J
Since θ. i s a l i n e a r combination of y., Θ. = e
l
f
(X X)- X £j,
t
h
e
n
2
^ NID(£, (X'X)""^ ) and the
p r o b a b i l i t y d i s t r i b u t i o n of the θ-estimates i s ρ.(θ.|θ,σ,χ) = Finally, z
£j
+
Ι
χ , χ
Ι
/ z
expl-
JL(S
.-θ)
i s a l i n e a r combination of s
o
^
3
<
:
%
ij
,
1
·χ·χ(θ
.-θ) )
, ψ =
2
ΝΙΟ(ψ,Ζ(Χ Χ)~ Ζσ ) y i e l d i n g
2
PjTî.|ψ,σ ,Χ)
l /
This d i s t r i b u t i o n of m = 7 such estimates i s given by
In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
198
CHEMOMETRICS: THEORY A N D APPLICATION
Figure 1. (a) The Arrhenius or vont Hoff regression problem in In k versus 1/T coordinates and the resultant joint likelihood and probability regions in the coordinates of the estimated parameters, (b) The Arrhenius or vant Hoff regres sion problem in In k versus 1/T coordinates for the case of j =*m*=7 replicates at each i — η =- 4 temperature and the joint probability region for the seven parameter estimates (designated by +) given the true value (designated by *) in the coordinates of the estimated parameters. Notice that the ratio a/b of major to minor axes of each ellipse in (a) and (b) is identical and a function only of the choice of experimental temperatures.
In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
10.
KRUG ET AL.
199
Enthalpy-Entropy Compensation
m=7 (
Pi-m-7 ^li'
a 2
'
x )
=
TT p ^ S u l i ^ S x )
j—m—/
j=i J
I
This p r o b a b i l i t y d i s t r i b u t i o n contains a l l the information a v a i l a b l e about the thermodynamic parameter r
r
estimates = (AS ,AH ) f o homologou chemical series f o r which th much more to measuremen an extrathermodynamic r e l a t i o n s h i p . We w i l l now use t h i s information to derive the s t a t i s t i c a l compensation equation and to show that such estimates are highly correlated f o r the usual experimental temperature ranges even i n the absence of an extrathermodynamic effect. The C o r r e l a t i o n C o e f f i c i e n t .
The c o r r e l a t i o n
c o e f f i c i e n t f o r a single data p a i r
(AS^,AH^) and f o r a
complete j = m data s e t (AS^,AH^) are determined from the elements of t h e i r respective variance-covariance matrices Σ(1/T) ν(ψ^)
-1
= ζ(x'x) ζσ
= V
2
2
Σ 1/T
=
AH*f
Σ 1/T
η
2
R o
2
Ix'xl
3
v(4)
AS^
Σ (1/T)
ν
=
ι
Ζ(Χ·Χ)~ Ζ—
AH^
m
=
Σ 1/T
Σ 1/T
2rr2
η m Ix'xl
For e i t h e r case, the c o r r e l a t i o n c o e f f i c i e n t i s the same. For the complete data set, the j o i n t p r o b a b i l i t y region has decreased i n area because the variance has decreased from σ to o /m, however. The c o r r e l a t i o n c o e f f i c i e n t between enthalpy and entropy estimates f o r any series f o r which the v a r i a t i o n of rate constants by measurement errors being much greater than by the e f f e c t of an extrathermodynamic r e l a t i o n s h i p i s 2
2
In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
CHEMOMETRICS:
200 Cov P
(ΔΗ^,Δ§^)
Σ1/Τ =
~/ν(ΔΑΛν(Δδ*)
THEORY AND APPLICATION
2
/ηΣ(1/Τ)
where the estimated c o r r e l a t i o n c o e f f i c i e n t Σ(ΔΗ.-<ΔΗ>) (AS .-
3
3
S Σ(ΔΗ..-<ΔΗ>) Σ(Δ§ .-<Δ§>) 2
2
:
i s an estimate of the population parameter p. lim r = ρ m-*-°° Thus, i f a c o r r e l a t i o n c o e f f i c i e n t r i s estimated from (ΔΗ,AS) data p a i r s suc for r includes the value of p, the l i n e a r d i s t r i b u t i o n of enthalpy-entropy estimates i s probably due to the propagation of measurement errors and not due to any detectable extrathermodynamic e f f e c t . That ρ should be near unity i s i l l u s t r a t e d i n Figure 2 f o r data taken on the oximation of methyl thymyl ketone (30). A measurement error i n one rate constant upon r e p l i c a t i o n would r e s u l t i n a s l i g h t l y d i f f e r e n t slope estimate and the intercept estimate at 1/T = 0 would change correspondingly. The very high c o r r e l a t i o n , ρ = 0.99991, between the slope estimate and the intercept estimate i s a consequence of the f a c t that the data are taken f a r from the o r i g i n over a very narrow temperature range on an absolute scale. The S t a t i s t i c a l Compensation Equation. As shown i n Figure 1, the shape and o r i e n t a t i o n of £ j (Ψ j I Y_j ,X) , ρ^ψ^|ψ,σ,Χ) and ρ (ψ|ψ,σ,Χ) are a l l i d e n t i c a l , 7
merely the s i z e and l o c a t i o n change from one to the other. In p a r t i c u l a r , the shapes of the j o i n t c o n f i dence regions are e l l i p t i c a l because the f i t t e d model i s l i n e a r and the orientations are a function of the experimental design, which i s the choice of experi mental temperatures at which the rate constants were measured. The r a t i o of major to minor axes of these e l l i p t i c regions i s determined by a consideration of the r a t i o of eigenvalues of Z^X'XZ" such that X > X and i s found to a good approximation to be 1
x
2
In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
10.
KBUG ET AL.
30
201
Enthalpy-Entropy Compensation
lw{k
20 >AS*
AG*
- f «0
EXPERIMENTAL TEMPERATURE RANGE -tOJ
ft-
Figure 2. Geometric interpretation of the parameter estimates. The indicated lengths are proportional to AG" and AS- and the indicated slope is proportional to —ΔΗ-. These data for the oximation of methyl thymyl ketone (30) indicate a strong dependence of the intercept estimate on the slope estimate because the data were taken over a very small temperature range far from the origin. The three data points are designated by dots. Reprinted with copyright permission by Nature (25) and the Journal of Physical Chemistry (26).
In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
CHEMOMETRICS: THEORY AND APPLICATION
202 a/b = • λι/λ
2
« / η/Σ(1/Τ-<1/Τ>)
z
For the usual experimental temperature ranges of organic chemistry t h i s r a t i o i s usually of the order of 10**. Hence, the j o i n t p r o b a b i l i t y regions appear as l i n e segments and are w e l l characterized by the l i n e that describes the major axis of the e l l i p s e . A canonical analysis of the dispersion matrix Z"" X XZ" reveals that t h i s l i n e i s 1
—
,
1
T
" hmAS + ΔΟ
and d i f f e r s from the extrathermodynamic equation, AH = $AS + AGg, only i n the value of the slope param eter. Hence, of th ture β i s near the temperatures T the compensation that i s detected h m
might only be the s t a t i s t i c a l compensation between parameter estimates that occurs because the range of the independent v a r i a b l e was too small to v a l i d a t e the extrathermodynamic model i n t h i s parameter space. Separation of the Chemical from the S t a t i s t i c a l Compensation. Because any extrathermodynamic e f f e c t i s strongly confounded with the s t a t i s t i c a l compensa t i o n e f f e c t i n the enthalpy-entropy parameter space f o r the usual ranges of experimental temperatures used i n organic chemistry, biochemistry, and even hetero geneous c a t a l y s i s , some s t a t i s t i c i a n s have attempted to solve the problem f o r the value of the compensation temperature i n the o r i g i n a l In k versus 1/T space (19, 20,22,24). The r e s u l t i n g normal equations y i e l d unwieldy nonlinear solutions that are better f o r the detection of the presence of an extrathermodynamic e f f e c t than f o r obtaining good numerical values of a compensation temperature. Others (15,31-34) have proposed c r i t e r i a to determine i f an observed com pensation i s of chemical o r i g i n or i s j u s t the s t a t i s t i c a l a r t i f a c t . We f i n d that the two compensations are separable through a t r a n s l a t i o n of the i n t e r c e p t and that the compensation temperature and i t s c o n f i dence i n t e r v a l can be solved f o r exactly using l i k e l i hood theory and the chemical Maxwell equations. The problem i s to choose an i n t e r c e p t f o r which the slope and i n t e r c e p t estimates are not c o r r e l a t e d . The i n t e r c e p t a t the arithmetic mean of the independent v a r i a b l e has t h i s property. Thus, we rewrite the l i n e a r i z e d Arrhenius equation i n the form
In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
KRUG ET AL.
10.
In t , = ij where
Enthalpy-Entropy Compensation
203
{ I n A - E / R T , }, - { E / R } . ( 1/T.-<1/T>) + ε. . nm j 3 1 13
the independent
v a r i a b l e i s now
(1/T^-<1/T>).
The s l o p e i s s t i l l a measure o f t h e e n t h a l p y , b u t t h e i n t e r c e p t i s now a m e a s u r e o f t h e f r e e e n e r g y a t t h e h a r m o n i c mean o f t h e e x p e r i m e n t a l temperatures. A
G
The
?
T
= h
-RT
m
model
t o
{lnA-E/RT
t o
+
>
RT
h m
ln(kT
h m
e/h)
-
RT^
i s now η.
= XQ. = Wr.,
"2
^2
y. = η 2
"2
where t h e p a r a m e t e r v e c t o r {-E/R}.) and t h e d e s i g -E/R}^ 1
+ ε .
"2
i s ζ' . =
"2
( { l n A - E / R T ^ } ·,
1
1/T -<1/T>
1/T -<1/T>
1/T -<1/T>
X
n
2
The s l o p e a n d i n t e r c e p t p a r a m e t e r s thermodynamic parameters by
are related
to the
4\ = Αζ_ + Β where t h e thermodynamic p a r a m e t e r v e c t o r i s (AG^ ,ΔΗ^) the a d d i t i v e constant vector T h m
B
1
=
(RT
h m
ln(kT
h m
e/h)-RT
" A
R T
h m
hm
Proceeding correlation
consideration
ν(Ψ)
=
,-RT)
= ^
s
and
Ο
= -R
0
no
2
f
a s b e f o r e , we d e t e r m i n e t h a t t h e r e i s b e t w e e n AG^m a n d ΔΗ^ e s t i m a t e s a f t e r — i-hm — of the variance-covariance matrix
Γ Tm.2£ £ ( l / T - < 1 / T > r m
0
2
2
R a m|W W| e
0
η
In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
a
CHEMOMETRICS: THEORY AND APPLICATION
204 Cov(AÔ£ Ρ
=
,ΔΗ^)
-j
. = . •Vuâjg ) ν ( Δ ά Π T nE(l/T-
= 0 2
hm
The r a t i o of variances between these estimates i s found to be a constant that depends only on the choice of experimental temperatures σ
2
A
λ. = σ
H
2 Δ
J
=
£ J
(Σ1/Τ)
2
ηΣ(1/Τ-<1/Τ>)
T
2
hm
such that i f the same experimental temperatures are chosen f o r a l l experiments, a l l such estimate p a i r s w i l l have the same r a t i o of variances ( i . e . , Xj = λ for a l l j i f T ^
= T± f o r a l l j ) .
Since the Maxwell
equations are l i n e a r r e l a t i o n s h i p s between the thermo dynamic p o t e n t i a l s H, G, Ε , and A and the properties S, Τ, P, and V, an extrathermodynamic l i n e a r r e l a t i o n ship between any two must also be r e f l e c t e d by a l i n e a r extrathermodynamic r e l a t i o n s h i p between any other two. In p a r t i c u l a r , i f an extrathermodynamic relationship AH = 3AS + AGg e x i s t s , then by s u b s t i t u t i o n with the Gibbs equation, AH = yAG + (l-Y)AGg where the diagnostic parameter γ i s r e l a t e d to the compensation temperature β by γ = l/(l-T/$) and AGg = AH - $AS = AH + YT AS/(l-y) hm
In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
10.
KRUG ET AL.
Enthalpy-Entropy
205
Compensation
Thus, i f λ ? 1, the Gibbs equation i s i n s u f f i c i e n t to explain detected chemical behavior and an e x t r a thermodynamic e f f e c t i f detected. No s t a t i s t i c a l compensation e x i s t s between AG and AH where AH -—hm T
may be evaluated at any temperature, To t e s t the n u l l hypothesis, H : Q
including
Τ =
T . h m
γ = 1, AH must be
regressed on AG to estimate the slope γ and i n t e r cept (l-y)AGg. Least squares i s an i n c o r r e c t pro cedure, because there i s uncertainty i n both variables. The errors are uncorrelated, however, and the r a t i o of variances i s known, see Figure 3a. The l i k e l i h o o d function i s maximized i n t h i s case by T
3
min ™ a
'
Y
Σ
hm. τ— 2
-τ 2
j=l
(λ+γ ) T
hra.
This type of problem was f i r s t solved f o r the scope estimate by Lindley (35) and l a t e r commented on by others (36-38). To obtain a confidence i n t e r v a l for γ (and hence β) the d i s t r i b u t i o n of γ must be determined. Creasy (39.) solved t h i s type of problem i n transformed coordinates, which correspond i n our case to AH versus ^AG-, (see Figure 3b) , i n which hm the j o i n t p r o b a b i l i t y regions are c i r c u l a r , that i s , the errors propagate randomly with no p r e f e r e n t i a l direction. From the d i s t r i b u t i o n of the c o r r e l a t i o n c o e f f i c i e n t , the d i s t r i b u t i o n of the angle φ that a regression l i n e would make through such a plane i s determined. The d i s t r i b u t i o n of the slope γ i s deter mined from the r e l a t i o n s h i p between γ and φ. For our case, we extend t h i s l i n e of reasoning one more step and from the r e l a t i o n s h i p between γ and β, the d i s t r i b u t i o n and hence the maximum l i k e l i h o o d value and confidence i n t e r v a l of the compensation temperature β i s determined. A
The Regression Algorithm. Given k i n e t i c or equilibrium data, (k,T) or (Κ,Τ), the following algorithm may be used to obtain maximum l i k e l i h o o d estimates and t h e i r (1-a) confidence i n t e r v a l s f o r φ
In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
#
In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
9
Thm
Thm
Figure 3. (a) The linear regression problem in AH-AG coordinates is one for which joint confidence regions X*\y X) have a constant ratio λ of major to minor axes when the data are sampled at identical temperatures, (b) In AH- ^/\AG coordinates the maximum likelihood fit to a line is found by minimizing the sum of squares of residuals which are the perpendiculars to the regression line. Creasy (39) solved for the distribution of the slope estimate from the distribution of the correlation coefficient in similar coordinates.
Ρ g
Ε
10. KRUG ET AL.
Enthalpy-Entropy Compensation
207
γ, a, AGg and the compensation temperature β f o r a homologous s e r i e s of chemical data with 1 _< j _< η temperatures. 1. Regress = In k ^ or In onto (1/T -<1/T>) i
to obtain parameter estimates squares s? η
and r e s i d u a l sum of ^
η Zy. .(1/T.-<1/T>) i=l η Σ(1/Τ.-<1/Τ>) i=l
Ά»
1 3
1
2
and .£(y j-Çij-C {l/T -
s
i
T
-
2j
U^T)
— = η/Σ1/Τ
h m
i
±
= <1/T> -
1
2. Then c a l c u l a t e enthalpy and free energy estimates from the slope and intercept estimates. For k i n e t i c data A a
?
R T
" - hm^j
h m
+
RT
ln
kT
i hm ( hme/h)-RT } hm
j AH^ = -RC2j - RT and f o r equilibrium
data
hm. 3 AHj
=
-R£
2
J
3. The data may be p l o t t e d with j o i n t confidence regions determined by the e l l i p t i c equation 1
1
1
2
(Ψ. - Ψ.)' A~ W W A " ( Ψ . - ψ.) = 2s F(2,n-2,1-α) ~1 ~1 ~1 ~3 3
In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
CHEMOMETRICS: THEORY AND APPLICATION
208
or the data may be p l o t t e d along with standard devia tion increments from the maximum l i k e l i h o o d estimates from Step 2.
3
= ΛτίΔΗ.) = R s?/E(l/T-
•ΔΗ.
2
If a l i n e a r regression appears t h i s p l o t , then proceed with 4 f u n c t i o n a l i t y i s to be f i t t e d , weighted nonlinear technique. are s ^ =ν(ΔΗ^) from above
to be j u s t i f i e d from and 5. I f a nonlinear use an appropriate The weighting factors
2
4. Calculate the following hood estimates using Lindley's s o l u t i o n (35). 2
2
λ = (ΣΙ/Τ) /(ηΣ(1/Τ-<1/Τ>) ) S
S
GG
=
HH
=
E
E
= H
G
Θ
«
A
A
G
s
2
j/ j
H
s
2
2
( S
) / 2 S
X S
GG
2
(ΣΔΗ./3 ) /Σ1/5
ΣΔΗ .AG . / s ^ 3 3
/
2
δ
j/ j
HH-
2
- (ΣΔ0./ ) /Σ1/5
-
ΣΔΗ
2
2
/S ZAG
j
2
2
j
2
/s /ll/s
j
j
j
HG /
2
i
γ = Θ + Θ + 7 sgn( 0 +X ') = s g n ( s
) HG
u
71
Φ = tan-Μγ/^ϊ ) a =
Δ0
β
0
(EAH./S*-YZAG./S*)/E1/S*
3 D
=
â/(l-y)
-
w *
1
-
1
3
3
j
/ ^
5. F i n a l l y (l-a)100% confidence i n t e r v a l s may be calculated from the following upper and lower bound estimates using Creasy's s o l u t i o n (39).
In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
10. KRUG ET AL.
1
t j
Enthalpy-Entropy
. ι sin"
Compensation λ(S
2t a/2 m-2
209 S
S
HH GG" HG
)
f
S
HG'
= v^Ttan^.
a„ = (ZAH /s?-Y ZAG./s?)/Zl/s L U j
2
L
3
AG
$U
= a„/(l-Y ) L U T
This regression algorithm gives maximum l i k e l i h o o d estimates and t h e i r confidence i n t e r v a l s even though there i s error i n both v a r i a b l e s , because an addi t i o n a l r e s t r a i n t i s placed on the system—the r a t i o of variances of dependent to independent variables i s a known constant, a function of the experimental tem peratures. This r e s t r a i n t holds so long as each system i s sampled a t i d e n t i c a l temperatures. I f this r a t i o λ becomes very large, the estimates w i l l con verge on the weighted l e a s t squares estimates. An i n t e r e s t i n g s i d e l i g h t i s the minimum l i k e l i hood estimate, the "worst" value of a parameter given the data. This estimate i s given by (35,36) 2
γ* = Θ + /Θ +λ
sgn (/ΡΤλ)
= -sgn(s ) HQ
$* = Τ /(1-1/γ*) ηπ
Because of the high c o r r e l a t i o n between enthalpyentropy estimates, the a p p l i c a t i o n of l e a s t squares to these enthalpy-entropy estimates w i l l y i e l d numerical values of the compensation temperature that are nearer the minimum l i k e l i h o o d value rather than the maximum l i k e l i h o o d value. Thus by misuse of l e a s t squares, a valuable s t a t i s t i c a l technique, the "worst" numerical value of a chemical parameter has usually been reported i n the l i t e r a t u r e rather than the "best" numerical value.
In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
210
CHEMOMETRICS: THEORY AND
APPLICATION
Our analysis of 37 reported enthalpy-entropy com pensations revealed that only three had compensation temperatures s i g n i f i c a n t l y d i f f e r e n t than the harmonic mean of the experimental temperatures by an analysis i n the A H - A S plane (26) and only 7 had detectable chemical compensations by an analysis i n the A H - A G plane (27). hm T
A p p l i c a t i o n to Chemical Examples To i l l u s t r a t e the necessity of the proper regres sion procedure and proper c o r r e l a t i o n analysis, we compare a data set that c l e a r l y has a l i n e a r chemical compensation with one that c l e a r l y does not show such an extrathermodynamic e f f e c t . The v a l i d i t y of such an e f f e c t f o r t h i s second example has been debated many times i n the l i t e r a t u r (]_r2 We f i n d that dat benzoate (1) display a l i n e a r extrathermodynamic e f f e c t but data f o r the hydrolysis of a l k y l thymyl ketones (30) do not. The r e s u l t s of a comparative c o r r e l a t i o n analysis are l i s t e d i n Table I . As expected from our previous arguments on the c o r r e l a tion c o e f f i c i e n t , both data sets display s i g n i f i c a n t c o r r e l a t i o n s r i n A H - A S coordinates that approximate the expected c o r r e l a t i o n c o e f f i c i e n t ρ due to the propagation of e r r o r s . Only the hydrolysis data has a s i g n i f i c a n t (AO) estimated c o r r e l a t i o n c o e f f i c i e n t in AH-AG coordinates, however. This finding hm indicates that the observed enthalpy-enthropy c o r r e l a t i o n f o r the oximation data i s a r e s u l t of only the propagation of measurement e r r o r s . T
Table I . C o r r e l a t i o n C o e f f i c i e n t s * AH^-AS^
AH^-AG^n
hm Reaction 1. 2. 3.
Oximation of a l k y l thymyl ketones (30) Same as (1) but deleting the methy lated compound Hydrolysis of ethyl benzoate (1)
m 0.9999 0.9724 0 -0.2273
7
0.9999 0.9988 0
0.0770
6
0.9988 0.9987 0
0.9929
12
•Reprinted with copyright permission by the Journal of P h y s i c a l Chemistry (2_7) .
In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
10.
KRUG ET AL.
Enthalpy-Entropy Compensation
211
The j o i n t confidence regions, Μψ|γ_,Χ) and £(ψ|γ_,χ), f o r the oximation and hydrolysis data are p l o t t e d i n Figures 4 and 5 f o r comparison. The oximation data have much greater uncertainty than do the hydrolysis data. I t i s t h i s greater uncertainty that i s largely responsible f o r the apparent compensa tion i n AH-AS coordinates. In f a c t the r a t i o of major to minor axes f o r the oximation data i n AH-AS coordinates i s a/b = 23252 causing the j o i n t con fidence regions to be w e l l represented by the l i n e s of t h e i r major axes. The compensation temperature and other parameter estimates are compared i n Table II f o r estimation by (a) the regression algorithm presented here, (b) weighted l e a s t squares of AH on A Ô using s j ^ as the T
weighting factors an Because λ >> 1 f o r both examples, the a p p l i c a t i o n of weighted l e a s t squares i n the AH-AG plane hm (b) gave estimates close to the maximum l i k e l i h o o d values (a). Also for both examples the minimum l i k e l i h o o d value of the compensation temperature β* i s near the harmonic mean of the experimental tempera tures T as expected. For both examples the value T
Λ
h m
of the compensation temperature as determined b y l e a s t squares of AH on AS (c) was biased toward β* as expected, but f o r the oximation example the value of the compensation temperature as estimated by (c) was numerically much closer to the minimum l i k e l i h o o d estimate β* than to the maximum l i k e l i h o o d estimate P . That the^confidence i n t e r v a l f o r β should appear to exclude β when no chemical compensation i s detected i s i l l u s t r a t e d i n Figure 6. I f the p r o b a b i l i t y d i s t r i b u t i o n f o r the diagnostic parameter γ overlaps unity ( r e c a l l Η : γ = 1, where Ύ = 1 f o r no l i n e a r extrathermodynamic effect) the p r o b a b i l i t y density for β i s t h i n l y d i s t r i b u t e d over a l l possible numbers such that the confidence i n t e r v a l f o r γ traces to a confidence i n t e r v a l f o r β that s t a r t s a t a f i n i t e value, extends to i n f i n i t y , returns from minus i n f i n i t y , and f i n a l l y returns to a f i n i t e value. The MLE of β i s then somewhere i n that i n t e r v a l . When t h i s happens H cannot be rejected and the p r o b a b i l i t y density of β i s d i s t r i b u t e d so t h i n l y that the p r o b a b i l i t y of detecting a compensation temperature i n a reasonably f i n i t e i n t e r v a l i s i n f i n i t e s i m a l , and hence the p r o b a b i l i t y that a l i n e a r extrathermodynamic e f f e c t i s detected i s e s s e n t i a l l y zero. A
Q
In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
212
CHEMOMETRICS: THEORY AND APPLICATION
AS* (eu)
AG* ( M Nature Journal of Physical Chemistry
Figure 4. The 50% joint confidence regions for the oximation of alkyl thymyl ketones (30). The A//=-AS= ellipses are so narrow that they appear as lines. Departure of the methylated compound from a common AG= value causes it to fall off the statistical com pensation "line" between ΔΗ= and AS= estimates. Notice that AG= is estimated more precisely than ΔΗ=. All values were calculated for Τ — T = 308.1 Κ (25, 27). hm
In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
10.
KRUG ET AL.
213
Enthalpy-Entropy Compensation
AS*(eu)
A6*(kcal) Journal of Physical Chemistry
Figure 5. Plots of thermodynamic parameter estimates and their respective 50% confi dence regions for the hydrolysis of ethyl benzoate (1). The linear structure in the ΔΗ=AG= plot indicates that a linear chemical compensation is detected. The data are evalu ated atT = T — 292.6 Κ (27). hm
In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
CHEMOMETRICS: THEORY AND APPLICATION
Figure 6. Probability density functions for y and β. The bold line is the function f(y) through which the well-behaved probability density functions p(y\V T) are mapped into either well-behaved or skewed density functions pjf/?|k,T). The footnote "1" represents parameters obtained from the hydroly sis of ethyl benzoate (1), ana "T represents parameters obtained from the oximation of thymyl ketones (SO). Reprinted with copyright permission by the Journal of Physical Chemistry (27). t
In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
10. KRUG ET AL.
Enthalpy-Entropy
215
Compensation
22
linear compensation line σ
ο
18
-ΗΧ
<
possible nonlinear compensation line
14
22
18 AG*(kcal)
Journal of Physical Chemistry
Figure 7. Carbinol formation from Malachite dyes (41). The data are plotted along with 50% confidence ellipses or one stand ard deviation increments. The variances of the bG= estimates are too small to he noticed on this plot. A nonlinear functionality appears to he suggested by the data. The data are evaluated at Τ = T —» 297.7 K. Reprinted with copyright permission by the Journal of Physical Chemistry (27). Thm
h m
F i n a l l y , we show i n Figure 7 that nonlinear r e l a t i o n s h i p s are e a s i l y v i s u a l i z e d as w e l l when the thermodynamic data are p l o t t e d i n the AH-AG plane. hm I f a preferred nonlinear function i s to be f i t to such data, weighted least squares may be used to obtain numerical parameter estimates when λ >> 1, which i s the usual case. Such structured nonlinear functions are v i r t u a l l y impossible to d i s t i n g u i s h from random (unstructured) s c a t t e r i n AH-AS coordinates because of the dominant s t a t i s t i c a l compensation between parameter estimates i n those coordinates. T
In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
216
CHEMOMETRICS:
THEORY AND APPLICATION
Conclusions The h i s t o r y of enthalpy-entropy compensation i s one characterized by a strong misuse of fundamental s t a t i s t i c a l tools (the method of l e a s t squares and c o r r e l a t i o n analysis) to reach i n c o r r e c t conclusions about the detection and v a l i d i t y of observed compensa tions between enthalpy and entropy estimates. Any chemical e f f e c t i n the enthalpy-entropy plane i s strongly confounded with the l i n e a r s t a t i s t i c a l com pensation pattern due to the l i m i t e d experimental temperature ranges suitable for most chemical i n v e s t i gations. Enthalpy and entropy estimates are s t i l l frequently p l o t t e d versus one another to display f a l s e c o r r e l a t i o n s (compensations) f o r organic, biochemical, and heterogeneous c a t a l y t i c reactions, however. Because both the relevant chemical i d e n t i t i e s and the choice of experimenta to a s c i e n t i s t p r i o analysis, should choose to perform h i s c o r r e l a t i o n analysis and regression i n the AH-AGplane to detect the prehm sence of both l i n e a r and nonlinear r e l a t i o n s h i p s between any thermodynamic variables f o r both k i n e t i c and equilibrium data. T
Literature Cited
1.
Fairclough R.A., Hinshelwood, C. Ν., J. Chem. Soc. (1937) 538. 2. Fairclough, R. Α., Hinshelwood, C. Ν., J. Chem. Soc. (1937) 1537. 3. Raine, H. C., Hinshelwood, C. N., J. Chem. Soc. (1939) 1378. 4. Boudart, M., "Kinetics of Chemical Processes," pp 179-182,194-198,Prentice-Hall, Englewood Cliffs, N.J., 1968. 5. Hammett, L. P., "Physical Organic Chemistry," 2nd Ed., pp 391-408, McGraw-Hill, New York, 1970. 6. Laidler, K. J., "Chemical Kinetics," 2nd Ed., pp 251-253, McGraw-Hill, New York, 1965. 7. Leffler, J. E . , Grunwald, E . , "Rates and Equilibria of Organic Reactions," pp 315-402, Wiley, New York, 1963. 8. Thomas, J. M. and Thomas, W. J., "Introduction to the Principles of Heterogeneous Catalysis," pp263-265,386,413, Academic Press, New York, 1967. f
In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
10. KRUG ET AL.
9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24. 25. 26. 27. 28. 29. 30. 31. 32. 33. 34. 35. 36. 37. 38. 39.
Enthalpy-Entropy Compensation
217
Leffler, J . E . , J. Org. Chem. (1955) 20, 1202 Blackadder, D. A., Hinshelwood, C., J . Chem. Soc. (1958) 2720. Blackadder, D. Α., Hinshelwood, C., J . Chem. Soc. (1958) 2728. Petersen, R. C. Markgraf, J . H., Ross, D. S., J. Amer. Chem. Soc. (1961) 83, 3819. Brown, R. F . , Newsom,H.C.,J . Org. Chem. (1962) 27, 3010. Brown, R. F., J . Org. Chem. (1962) 27, 3015. Petersen, R. C., J . Org. Chem. (1964) 3133. Leffler, J . E . , Nature (1965) 205, 1101. Exner, O., Nature (1964) 201, 488. Exner, O., Coll. Czech. Chem. Comm. (1964) 29, 1094. Exner, O., Nature (1970) 277, 366. Exner, O., Coll Czech Chem Comm (1972) 37 1425. Wold, S., Chem. Scr. (1972) 2, 145. Wold, S., Exner, O., Chem. Scr. (1973), 3, 5. Wold, S., Chem. Scr. (1974) 5, 97. Exner, Ο., Coll. Czech. Chem. Comm. (1975) 40, 2762. Krug, R. R., Hunter, W. G., Grieger, R. Α., Nature (1976) 261, 566. Krug, R. R., Hunter, W. G., Grieger, R. Α., J. Phys. Chem. (1976) 80, 2335. Krug, R. R., Hunter, W. G., Grieger, R. Α., J. Phys. Chem. (1976) 80, 2341. Laidler, K. J., Trans. Faraday Soc. (1959) 55, 1725. Ritchie, C. D., Sager, W., Prog. Phys. Org. Chem. (1964) 2, 323. Craft, M. J., Lester, C. T., J . Amer. Chem. Soc. (1951) 73, 1127. Garn, P. D., J . Therm. Anal.(1975) 7, 475. Good, W. and Ingham, D. B. Electrochem. Acta (1975), 20, 57. Gorbacher, V. M., J . Therm. Anal. (1975) 8, 585. Gorbacher, V. M., Izv. Sib. Otd. Akad. Nauk. SSSR, Ser. Khim. Nauk. (1975) 5, 164. Lindley, D. V., J . Roy. Stat. Soc. Suppl. (1947) 9, 218 Madansky, Α., J . Amer. Stat. Assn. (1959) 54, 1973. Cochran, W. G., Technometrics (1968) 10, 637. Davies, O. L . , Goldsmith, P. L . , "Statistical Methods in Research and Production," pp 208-210, Hafner, New York, 1972. Creasy, M. Α., J . Roy. Stat. Soc. B (1956) 18, 65.
In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
218
40. 41.
CHEMOMETRICS: THEORY AND APPLICATION
David, F. Ν., "Tables of the Correlation C o e f f i c i e n t , " Cambridge u n i v e r s i t y Press, Cambridge, England, 1954. I d l i s , G. S., Ginzburg, O. F., Reakc. Sposobnost Org. Sojedin. (1965) 2, 54.
In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
11 H o w to A v o i d Lying with Statistics A L L A N E. AMES and GEZA SZONYI Polaroid Corp., 750 Main St., Cambridge, MA 02139
I.
Introduction
Statistical analysis of randomly varying data has become commonplace in the age of calculators and computers. Such analyses are often carried out routinely using built-in calculator functions or standard computer subroutines. Parameters derived in such computations usually include the average (mean) and the standard deviation, characteristic of the central tendency and the variability of individual data sets. To compare two data sets with each other, the differences between their averages and the ratio of their variances are used customarily. Little thought is usually given to the fact that the compu tation of the above parameters presupposes that the data analyzed are essentially normally distributed and that this distribution is monomodal, i.e., showing essentially one major peak or central value only. If this is not the case, and standard (parametric) methods are applied for the evaluation of such data, the results obtained will represent an incorrect picture. In other words, the evaluator is: "lying with statis tics", usually without being aware he is doing so (1). To make matters worse, nearly all built-in programs for calculators, as well as the majority of the sub routines and programs for computers are of the para metric type. Nowhere in the instruction manuals is the fact adequately stressed that the usage of para metric methods presupposes known, usually normal, distribution which must be ascertained before these methods are applicable.
219
In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
220
CHEMOMETRICS:
THEORY AND APPLICATION
This paper w i l l show that for many applications the use of d i s t r i b u t i o n - f r e e (nonparametric) methods may be preferable and easier to use than the conven t i o n a l parametric approach (1). (Throughout t h i s paper, the term "parametric t e s t s " pertains to the t-Test and the F-test, used with the appropriate tables (2, 3, £, 5, 6, S, 9) ) . We w i l l also provide a simple method f o r t e s t i n g the normality of the d i s t r i b u t i o n of data sets. Examples w i l l be given to demonstrate the p e n a l t i e s which can be incurred when parametric methods are used f o r the analysis of non parametric data. II.
Methodology
Normally d i s t r i b u t e d data are often represented g r a p h i c a l l y by the well-known, bell-shaped Gaussian curves (10, 11/ 12t 13 the data i n questio order and then grouped i n t o classes. The number of items i n each c l a s s , i . e . , the class frequencies, are then p l o t t e d against c l a s s midpoints i n such standard frequency p l o t s . Another method of representation f o r the same data i s the use of cumulative frequency p l o t s . Cumulative frequencies are obtained by adding ("cumulating") f o r each class a l l the c l a s s frequencies up to that point, divided by the t o t a l number of data i n the data set (10, 15J , as shown i n Figure 1. This paper deals with two applications of the cumulative d i s t r i b u t i o n . In one case, a known con tinuous d i s t r i b u t i o n — t h e normal d i s t r i b u t i o n — i s being compared to an unknown d i s c r e t e d i s t r i b u t i o n i n order to determine i t s normality or deviation from i t . In another case, two unknown cumulative d i s c r e t e d i s t r i b u t i o n s are being compared to each other to deter mine t h e i r sameness or d i f f e r e n c e . For both cases the same t e s t s t a t i s t i c i s a p p l i c a b l e (16, 1J, 18) : Τ = sup
|F(x) - S(x)
I
where: F(x) i s the cumulative d i s t r i b u t i o n function of an unknown d i s t r i b u t i o n ; S(x) i s the cumulative d i s t r i b u t i o n function of e i t h e r a known (e.g. normal) or an unknown d i s t r i b u t i o n ; and, sup = supremum
In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
AMES AND SZONYI
Lying with Statistics
In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
222
CHEMOMETRICS: THEORY AND APPLICATION
s i g n i f i e s maximum, and t o g e t h e r w i t h the two v e r t i c a l l i n e s s y m b o l i z e s t h a t one s h o u l d take the maximum d i f f e r e n c e e n c l o s e d by t h e s e l i n e s . T h u s , Τ i s the maximum v e r t i c a l d i s t a n c e between the two c u m u l a t i v e d i s t r i b u t i o n graphs. T h i s computed T , o r a s i m i l a r l y d e r i v e d v a l u e , i s s u b s e q u e n t l y compared to a p p r o p r i a t e tabulated values at selected confidence l e v e l s . The two d i s t r i b u t i o n s are r e g a r d e d as b e i n g d i f f e r e n t i f t h i s Τ i s g r e a t e r than the c o r r e s p o n d i n g t a b u l a t e d v a l u e , o t h e r w i s e , the o p p o s i t e h o l d s t r u e . A number o f methods have been d e s c r i b e d i n the l i t e r a t u r e t o t e s t whether a g i v e n d a t a s e t can be c o n s i d e r e d to be e s s e n t i a l l y n o r m a l l y d i s t r i b u t e d o r not. T a b l e 1 summarizes the most i m p o r t a n t o f t h e s e methods. As seen from t h i s t a b l e , a l l methods are based on the use o f c u m u l a t i v e d i s t r i b u t i o n f u n c t i o n s , e x c e p t the S h a p i r o - W i l uses the normal c u m u l a t i v e d i s t r i b u t i o n to compute t h e o r e t i c a l frequencies ( 1 9 ) · ) There i s no o v e r a l l agreement, based on the l i t e r a t u r e s u r v e y e d , as to which method i s " b e s t " f o r a l l p o s s i b l e e m p i r i c a l d i s t r i b u t i o n s (2J), 21, 22) . Comparison o f the v a r i o u s n o r m a l i t y t e s t s has l e a d to the r e s u l t t h a t i n some c a s e s , the same d a t a are c o n s i d e r e d normal by some t e s t s and not normal by o t h e r s (2_2) . However, the C h i Square t e s t i s g e n e r a l l y r e g a r d e d to be i n f e r i o r t o a l l o t h e r t e s t s (20,, 21, 22) . The e x t e n t to which p a r a m e t r i c methods can be used f o r b o r d e r l i n e n o r m a l i t y c a s e s has n o t been d e s c r i b e d i n the l i t e r a t u r e s u r v e y e d . This area needs s u b s t a n t i a l e x p l o r a t i o n a n d , a t t h i s s t a g e , i n f o r m e d i n t u i t i o n i s as good a g u i d e as a n y . T h i s paper d e a l s w i t h a s i m p l e t e s t f o r n o r m a l i t y , the L i l l i e f o r s T e s t {23^, 24.) / <* n o t w i t h the q u e s t i o n o f what n o r m a l i t y t e s t t o s e l e c t . We have found t h i s t e s t s a t i s f a c t o r y t o h e l p us to d e c i d e when to use and when not to use p a r a m e t r i c s t a t i s t i c s i n data a n a l y s i s . To a n a l y z e nonnormally d i s t r i b u t e d d a t a , we have been u s i n g the Kolmogorov-Smirnov Two Sample T e s t ( 1 £ , ^17, 18) , which compares two d a t a s e t s o f unknown d i s t r i b u t i o n f o r t h e i r sameness o r d i f f e r ence u s i n g the c u m u l a t i v e d i s t r i b u t i o n a p p r o a c h . The p r i n c i p l e o f t h e s e two t e s t s i s p r e s e n t e d g r a p h i c a l l y i n F i g u r e 2. an
In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
11.
AMES AND SZONYI
Lying with Statistics
223
The L i l l i e f o r s Test of Normality. The computational procedure f o r the L i l l i e f o r s Test involves the following steps (1£, 17, 18): a) Arrange the data set to be tested i n ascending order. b) Calculate the following q u a n t i t i e s : The mean:
- ih
χ - i
f
*i i
The standard deviation: Σ(χ
- χ)
±
/
S = Ν - 1 =
Ν -1 2
Σ(χ - x ) f . = £
Ν - 1
Σχ
2 1
ί
2
1
- (Σχ ί.) /Ν 1
Ν -1
The Z - S t a t i s t i c s f o r each d i s t i n c t data value (Z(I)):
where: x^ = i n d i v i d u a l datum, or class midpoint of data arranged by class frequencies. f^ = frequency of datum or data c l a s s Ν = t o t a l number of data i n the data s e t c) f o r each datum (X(I)), or data class within the data set, compute i t s cumulative frequency (CUM) with the formula: 1
CUM(I) =
1
~ Ν
where: i = the rank order of the datum, or i t s frequency, or the frequency o f the data c l a s s .
In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
224
CHEMOMETRICS:
THEORY AND APPLICATION
This means that the f i r s t entry of the data set i s omitted, but cumulative frequencies are computed f o r a l l other data values, i . e . , from: ( i - 1) to N. The cumulative frequencies thus generated represent the unknown d i s t r i b u t i o n function: F(x^). d) Using the Ζ^-values computed for every datum or data c l a s s , determine the standardized cumulative normal d i s t r i b u t i o n value (XNORM) corresponding to i t . These standardized normal values can be obtained from tables of the Normal P r o b a b i l i t y Functions (25) , using the IBM S c i e n t i f i c Subroutined NDTR (26), or w r i t i n g your own subroutine u t i l i z i n g the error function (27.) . These values represent the cumulative normal d i s t r i bution function: S(x^). e) Compute the absolute difference (DIF) between CUM and XNORM, and determin th l a r g e s t valu (BIG) among these values s t a t i s t i c s f o r the L i l l i e f o r f) By using the L i l l i e f o r s Table (22, 2±, 28.) * determine whether t h i s computed Τ i s s t a t i s t i c a l l y s i g n i f i c a n t or not, at the selected confidence l e v e l . If Τ i s not s i g n i f i c a n t , the data analyzed can be regarded as being e s s e n t i a l l y normally d i s t r i b u t e d so that both parametric and nonparametric methods can probably be used for the analysis of the data set i n question. I f Τ i s s i g n i f i c a n t only nonparametric methods should be used. A simple i l l u s t r a t i o n of t h i s procedure i s presented i n Table 2, using previously published data (22). The procedure described i s applicable to a l l types of data sets, i . e , e i t h e r sets of i n d i v i d u a l data, or sets of data s p e c i f i e d through frequencies within defined classes. Steps a), b) and d) of t h i s procedure are common to a l l normality tests given i n Table 1, with the exception of the Chi Square and the Shapiro-Wilk methods, i . e . , to a l l Empirical D i s t r i b u t i o n Function (EDF) s t a t i s t i c s which use cumulative frequencies (22). Having done these basic c a l c u l a t i o n s , a l l EDF based tests can be c a r r i e d out e a s i l y with a computer or c a l c u l a t o r .
In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
Square
Lilliefors
Chi
T e s t Method
max
x W
j
DT=S(XJι i
j
(O.-T.)
2
I
1
" η
Computing Formula
1
2
The d a t a a r e a r r a n g e d i n a s c e n d i n g o r d e r and the DT v a l u e s computed by the f o r m u l a g i v e n , i . e , not c o n s i d e r i n g the f i r s t d a t a c e l l . The l a r g e s t value obtained i s evaluated using the Lilliefors Table. T h i s method can be used w i t h d a t a s e t s g i v e n i n terms o f f r e q u e n c i e s o r i n d i v i d u a l d a t a . Sample s i z e s as low as 4 can be t e s t e d . T h i s method w i l l f a i l i f t h e f i r s t d a t a c e l l c o n t a i n s a d i s p r o p o r t i o n a t e l y l a r g e number o f d a t a compared t o the r e s t o f the s e t .
2
i-sequential data p o i n t s or frequen cies (i-l)/n=computes the cumulative distribution v a l u e s o f the data set S(x^)"corre sponding nor mal cumula tive d i s t r i bution values
3
The d a t a a r e a r r a n g e d i n d e s c e n d i n g o j d e r and t h e n s u b d i v i d e d i n t o a r b i t r a r y c l a s s e s , each c o n t a i n i n g a t l e a s t 5 numbers. The " o b s e r v e d f r e q u e n c i e s " a r e t h e number o f d a t a i n each c l a s s . The " t h e o r e t i c a l f r e q u e n c i e s " a r e t h o s e b a s e d on the cumula t i v e normal d i s t r i b u t i o n f u n c t i o n . The computed X i s evaluated using a χ - T a b l e a t k-3 degrees o f freedom, where: k«number o f c l a s s e s . T h i s meth od r e q u i r e s a r e l a t i v e l y l a r g e sample s i z e ( a t l e a s t 15-30) and i s c o n s i d e r e d a t e s t o f low power.
Remarks
0.-Observed class fre quency Τ -Theoretical class fre quency
Symbols Used
P r i n c i p a l Methods f o r N o r m a l i t y T e s t i n g
Table
23,24
19,23,24
References
In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
n
Cramer-von Mises (CramerSmirnov)
Kuiper
Kolmogorov D (Modified Kolmogorov, E )
(Kolmogorov One-Sample Statistics)
2
w =
2
* n
+
V = D+
= max
+
D~
+
(D ,D~)
22, 46
22,46
22,46,47,48 49,50
References
2
2
22,46,47,48 The data are arranged i n ascending order. The W 49,50,51 computed i s evaluated using a p p r o p r i a t e t a b l e s . From t h i s W , a modified W can be c a l c u l a t e d , and t h i s i s evaluated using a s p e c i a l t a b l e (22,46), With the standard t a b l e , sample s i z e s as low as 2 can be evaluated. Usable f o r i n d i v i d u a l data o r those given as frequency d i s t r i b u t i o n s .
The sum o f the l a r g e s t arid DT i s c a l c u l a t e d by the formulas g i v e n . T h i s sum i s m u l t i p l i e d by a c o r r e c t i o n f a c t o r and evaluated u s i n g a s p e c i a l t a b l e (22,46). Usable f o r i n d i v i d u a l data o r those given as frequency d i s t r i b u t i o n s .
Symbols as given
Symbols as given and: (2.-1)/n»computes the c u mulative d i s tribution values o f the data s e t .
The l a r g e s t D+ o r Dj value, c a l c u l a t e d by the above formulas i s chosen. T h i s value i s m u l t i p l i e d by a c o r r e c t i o n f a c t o r and evaluated u s i n g a s p e c i a l t a b l e (22,46). Usable f o r i n d i v i d u a l data o r those given as frequency d i s t r i b u t i o n s .
The data are arranged i n ascending order and the DJ-values computed, as given by the formula. The l a r g e s t D^-value obtained i s evaluated using the Kolmogorov Table (two-sided). The method i s usable with sample s i z e s as low as 2, with data given i n d i v i d u a l l y o r as frequency d i s t r i b u t i o n s . T h i s method can supplement the L i l l i e f o r s Test i f i t i s suspected t h a t i t may f a i l . A modified v e r s i o n o f t h i s t e s t m u l t i p l i e s the l a r g e s t D^-value by a c o r r e c t i o n f a c t o r and evaluates t h i s m o d i f i e d value using a s p e c i a l t a b l e (22,46).
Remarks
Symbols as given
Symbols as given and: i/n»computes the cumulative d i s t r i bution o f the data s e t
i
= S ( x ) - jj
max
+
Kolmogorov
D
Symbols Used
Computing Formula
T e s t Method
(continued)
Table 1
In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
Method
ShapiroWilk
AndersonDarling
Watson
Test
2
2
2
ln(l-S(n+l-i)]}/n-n
X i
—{£(2^1) [lnS( ) +
2
a
(
X
X
n-i+r il
(Xi-x)2
n-i+l
W •
- χ)
Used
S(Xi)/n
2
The d a t a a r e a r r a n g e d i n a s c e n d i n g o r d e r . The A computed i s m o d i f i e d and t h i s m o d i f i e d v a l u e i s e v a l u a t e d u s i n g a s p e c i a l t a b l e (22,46). Usable f o r i n d i v i d u a l d a t a o r t h o s e g i v e n as f r e q u e n c y distributions.
2
T h i s i s a m o d i f i c a t i o n o f t h e Cramer-von Mises method. The computed U i s m o d i f i e d and t h i s m o d i f i e d U i s then e v a l u a t e d u s i n g a s p e c i a l t a b l e (22,46). Usable f o r i n d i v i d u a l data o r t h o s e g i v e n as f r e q u e n c y d i s t r i b u t i o n .
Remarks
2
-V' 2
n
i
The d a t a a r e a r r a n g e d i n a s c e n d i n g o r d e r . The s p e c i a l c o e f f i c i e n t s needed f o r t h e c a l c u l a t i o n and t h e t a b l e needed t o e v a l u a t e t h e W computed a r e o n l y x- x / n a c c e s s i b l e v i a the reference l i s t e d . Samples a s a « s p e c i a l c o low as 3 c a n be e v a l u a t e d . efficients
χ."data 1
Symbols as given
S(x)-
Symbols a s g i v e n , and:
Symbols
NOTE: The various tests for normality have been recently reviewed by Shapiro et al. (32, 33, 34), Schuster (35), Stephens (22), Dyer (20), Shapiro and Francis (21), Klimko and Antle (36) and Govindarajulu (37).
£u 2
=—τ
F o r odd number o f d a t a : n«2K+l
W -
ί
F o r even number o f d a t a : n«2K
A
U » W -n(5(x)-l/2)
Computing Formula
(continued)
Table 1
32,33,34
22,46
22,46
References
CHEMOMETRICS:
228
THEORY AND
APPLICATION
Table 2 ILLUSTRATION OF THE APPLICATION OF THE LILLIEFORS TEST
X(J)
Z(J)
1
148
-0.9618
2
154
-0.7214
3
158
-0.561
4
160
5
CUM
XNORM
PI F
0.0909
0.2358
0.1448
0.4809
0.2727
0.3156
0.0429
161
-0.4408
0.3636
0.3300
0.0336
6
162
-0.4008
0.4545
0.3446
0.1100
7
166
-0.2405
0.5454
0.4052
0.1402
8
170
-0.0802
0.6363
0.4681
0.1683
9
182
0.4008
0.7272
0.6554
0.0719
10
195
0.9218
0.8181
0.8212
0.0030
11
236
2.6451
0.9090
0.9960
0.0869
BIG « 0.1683
X = 172. Lilliefors Table Values
24.95195
80 85 90 95 99
0.206 0.217 0.230 0.249 0.284
S i n c e BIG i s s m a l l e r than a l l t h e t a b u l a t e d v a l u e s , i t i s r e g a r d e d a s being s t a t i s t i c a l l y normally
n o t s i g n i f i c a n t and t h e d a t a s e t t o be
essentially
distributed.
In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
11. AMES AND SZONYI
Lying with Statistics
229
The Kolmogorov-Smirnov Test—Nonparametric Testing of Two D i s t r i b u t i o n s f o r Differences or S i m i l a r i t y . The Kolmogorov-Smirnov Two-Sample Test, also c a l l e d "The Two-Sample Smirnov Test", i s used when two data sets are being compared i n order to deter mine whether the two sets belong to the same popula t i o n or not regardless of t h e i r underlying d i s t r i butions (_16, 17, JL8) . This i s the same as asking whether two data sets are i n some way d i f f e r e n t from each other. To t h i s end, cumulative d i s t r i b u t i o n s are c a l c u l a t e d f o r each set, as w e l l as the differences between the cumulative values. The largest absolute difference thus obtained i s then compared to appro p r i a t e tabulated value t th desired confidenc l e v e l . I f t h i s larges s i g n i f i c a n t , the tw , otherwise they are the same. The computational procedure f o r the Two-Sample Smirnov Test i s the following (JL6, Γ7, JL8) : a) Sort both data sets i n ascending order. The two sets are designated as vectors: X(I) and Y ( J ) . where: X(I) = ^1' ^2' ^3'···' ^n Y(J)
=
Y^,
Y2, Y3,...,
Y
m
and: η and m are the lengths of the vectors, i . e . , the number of data i n X and Y, r e s p e c t i v e l y . b) Form two new vectors: XX and YY, both of length: η + m. The XX vector i s the augmented X vector containing zeros wherever there i s a number i n the ordered Y vector. S i m i l a r l y , the YY vector i s the augmented Y vector containing zeros wherever there i s a number i n the ordered X vector. c) Compute the cumulative frequencies for both sets by -
= l,n
1
and m 3
=
1
'
m
In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
CHEMOMETRICS:
230
THEORY AND APPLICATION
for each X and Y value. These cumulative frequencies are inserted into the cumulative d i s t r i b u t i o n vectors X(I)C and Y(I)C at points corresponding to the l o c a t i o n of data e n t r i e s i n the augmented X and Y vectors. X(I)C and Y(J)C are representative of the two unknown d i s t r i b u t i o n functions: F(x ) and S(x ) ±
jL
d) The difference (DIF) between X(I)C and Y(J)C i s computed f o r each value of the augmented vectors. The l a r g e s t p o s i t i v e or negative value i s determined. This represents: T, the t e s t s t a t i s t i c . e) By using Tables of the Two-Sample Smirnov Test s t a t i s t i c (.29, 30) determine whether t h i s computed Τ i s s t a t i s t i c a l l y s i g n i f i c a n t or not at the confidence l e v e l selected the two data sets ar regarde g y belonging to the same population, otherwise they are not. (Different tables should be used f o r data sets of equal and unequal length.) A simple i l l u s t r a t i o n of t h i s procedure i s presented i n Table 3, using previously published data (31) . Although the main a p p l i c a t i o n of the Smirnov Test i s i t s use of comparing two nonnormally d i s t r i b u t e d data, i t can be used to evaluate normally d i s t r i b u t e d data sets as w e l l . When used on normally d i s t r i b u t e d data, differences between the two sets w i l l be detected, regardless of whether these a r i s e from differences i n the means or differences i n the variances. Furthermore, the Smirnov Test may detect differences between normally d i s t r i b u t e d data sets when parametric methods do not allow c l e a r - c u t decisions.
In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
11.
AMES AND SZONYI
231
Lying with Statistics
Tabje 3 ILLUSTRATION OF THE APPLICATION OF ΤΙ1Γ TWO-SAMPLE SMIRNOV TEST M
J
LQJL
XX
15
1
5.2
0
0
2
5.7
0
0
3
5.9
0
0
8.7
4
6.5
0
0
6.5
9.3
5
9.9
6
I 1
7.6
2
8.4
3
8.6
4 5 6
YY
Y(J)C
PI F
5.2
0.067
-0.067
5.7
0.133
-0.133
5.9
0.200
-0.200
0.267
-0.267
7
10.1
7
8
10.6
8
9.8
8.4
0.2222
0
0.4000
-0.1778
9
11.2
9
10.8
8.6
0.3333
0
0.4000
-0.067
10
11.3
8.7
0.4444
0
0.4000
11
11.5
0
0.4444
9.1
0.467
-0.0226
12
12.3
9.3
0.5555
0
0.467
0.0885
13
12.5
0
0.5555
9.8
0.5337
0.0218
14
13.4
9.9
0.6666
0
0.5337
0.1329
15
14.6
10.1
0.7777
0
0.5337
0.244
10.6
0.8888
C
0.5337
0.3551
0
0.8888
10.8
0.6004
0.2684
10
Smallest Difference:
DIF -
-0.3333
Largest Difference:
DIF -
0.3995
11.2
0.9999
0.6004
0.3995
0
0.9999
11.3
0.6671
0.3328
0
0.9999
11.5
0.7338
0.2661
0
0.9999
12.3
0.8005
0.1994
0
0
0.9999
12.5
0.8672
0.1327
0
0.9999
13 4
0.9339
0.066
0
0.9999
J4.6
1.000
0
at Y(5)-6.8
at
Τ-ο Sample Ssirr.ov Test Table Values
X(9)-ll.2
90 95 99
0.04444
η./:Γ8? 0.5333 0.64444
Based on the tabulated values, the two data set belong e s s e n t i a l l y to the same population.
In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
CHEMOMETRICS:
232
THEORY AND APPLICATION
4
Figure 3.
Two bi-modal distributions
I I I . Examples of How to Avoid Lying with S t a t i s t i c s Let us now give some examples to demonstrate what happens when parametric s t a t i s t i c s are used incorrectly. A Monte Carlo Simulation was used to generate the two d i s t r i b u t i o n s shown g r a p h i c a l l y i n Figure 3. The corresponding data sets and the r e s u l t s of the parametric t-Test and F-Test (£, 9^) are shown i n Table 4. These tests lead to the s u r p r i s i n g conclusion that the two d i s t r i b u t i o n s are i n d i s t i n g u i s h a b l e . I f , however, the Two-Sample Smirnov Test i s applied to the same data, the two data sets c l e a r l y d i f f e r from one another as one would expect from Figure 3. The two-sided t e s t s t a t i s t i c i s 0.5 which i s s i g n i f i c a n t a t the 99th p e r c e n t i l e . This contradiction between the r e s u l t s of the parametric and nonparametric tests indicates nonnormality of one or both data sets. Indeed, when the L i l l i e f o r s t e s t i s performed on these data, as shown i n Tables 5 and 6, both sets show nonnormality. This means that only nonparametric methods are useful f o r comparing the data sets i n question. To reassert the point made a t the beginning
In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
11.
AMES AND SZONYI
Lying with Statistics
233
Table 4 PARAMETRIC EVALUATION OF TWO DATA SETS GENERATED BY MONTE CARLO SIMULATION
Data Set 1
Data Set 2
-50.06828
-157.53665
-50.05978
-105.65264
-50.04611
-104.06236
-50.03810
- 76.14665
-50.02966
- 76.76301
-50.02777
2
- 65.85975
-49.94664
- 57.93762
-49.94615
- 55.93762
-49.85856
- 54.95785
-23.75126
30.14344
- 5.20842
49.94298
19.62010
49.95922
19.66091
49.97837
25.08628
49.98032
26.31711
49.98634
35.13998
50.00938
61.47794
50.02284
66.56312
50.04142
78.51408
50.07682
- 9.82857
- 17.29912
2176.994635
4992.86817
X s
-49.99072
s
46.658271
Tabulated t - v a l u e s :
70.660232 t
38 90 *38,95 *38,97.5 38,99 /
t
Computed
t:
0.39450
1.3042 1,6860 2.0244 2.4286
Tabulated F-values:
F
F
F
F
Computed
F:
19,19,90 19,19,95 19,19,97.5 19,19,99
0.43602
In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
1.8226 2.1686 2.5270 3.0282
CHEMOMETRICS: THEORY AND APPLICATION
234
Table 5 LILLIEFORS
TEST OP DATA SET 1
Lilliefors
Test
J
X(J)
Z(J)
XNORM
CUM
DIF
1
-50.06828
-0.86241
0.19423
0.00000
0.19423
2
-50.05978
-0.86223
0.19428
0.05000
0.14420
3
-50.04611
-0.86193
0.19436
0.10000
0.09436
4
-50.03810
5
-50.02966
-0.86138
0.19445
0.20000
6
-50.02777
-0.86154
0.19446
0.25000
0.05553
7
-49.99072
-0.86075
0.19468
0.30000
0.10531
8
-49.94664
-0.85980
0.19494
0.35000
0.15503
9
-49.94615
-0.85979
0.19495
0.40000
0.20504
10
-49.85856
-0.85791
0.19546
0.44999
0.25453
11
-23.75126
-0.29837
0.38270
0.50000
0.11729
12
-5.20842
0.09904
0.53944
0.55000
0.01055
13
19.62010
0.63117
0.73603
0.60000
0.13603
14
19.66091
0.63205
0.73632
0.64999
0.06632
15
25.08628
0.74833
0.77287
0.70000
0.07207
16
26.31711
0.77471
0.78074
0.75000
0.03074
17
35.13998
0.96380
0.83242
0.80000
0.03242
18
61.47794
1.52829
0.93678
0.85000
0.08678
19
66.56312
1.53728
0.94921
0.89999
0.04921
20
78.51408
1.89341
0.97084
0.95000
0.02084
LILLIEFORS
TEST OF SET 1 IS
0.2545.
L i l l i e f o r s Table Values
80 85 90 95 99
0.160 0.166 0.174 0.190 0.251
In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
11. AMES AND SZONYI
Lying with Statistics
235
Table 6 LILLIEFORS TEST OF DATA SET 2
Lilliefors J
2(J)
X(J)
Test XNORM
COM
DIF
1
-157.53665
-1.98467
0.02359
0.00000
0.02359
2
-105.65264
-1.25039
0.10557
0.05000
0.05557
3
-104.06236
-1.22709
0.10974
0.10000
0.00974
4
-76.14665
-0.93282
0.20247
0.15000
0.05247
5
-76.01930
6
-72.76301
-0.76493
0.21624
0.25000
0.03376
7
-65.85975
-0.08724
0.24596
0.30000
0.05405
8
-57.93762
-0.57512
0.28260
0.35000
0.06739
9
-55.18869
-0.53622
0.29590
0.40000
0.10409
10
-54.55785
-0.53295
0.29703
0.44999
0.15296
11
30.14344
0.67141
0.74902
0.50000
0.24902
12
49.94298
0.95162
0.02935
0.55000
0.27935
13
49.95922
0.95185
0.82941
0.60000
0.22941
14
49.97837
0.95212
0.82948
0.64999
0.17948
15
49.98032
0.95215
0.82949
0.70000
0.12949
16
49.98634
0.95223
0.82951
0.75000
0.07951
17
50.00938
0.95256
0.82959
0.80000
0.02959
18
50.02284
0.95275
0.82964
0.85000
0.02035
19
50.04142
0.95501
0.82970
0.89999
0.07029
20
50.07682
0.95352
0.82983
0.95000
0.12016
LILLIEFORS TEST OF SET 2 IS
0.2793
Lilliefors Values
Table
80 85 90 95 99
0.160 0.166 0.174 0.190 0.231
In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
236
CHEMOMETRICS:
THEORY AND APPLICATION
of t h i s paper: the a p p l i c a t i o n of parametric methods to nonnormally d i s t r i b u t e d data may lead to drawing the wrong conclusions. Our i n t e r e s t i n nonparametric methods was caused by some production problems. Here i t was important to decide whether a c e r t a i n additive d i d or d i d not improve the q u a l i t y of a c e r t a i n product. The data, together with the parametric tests performed i s shown i n Table 7. Based on these t e s t s , the p a r t i c u l a r additive apparently did not improve product q u a l i t y . Since p h y s i c a l l y t h i s d i d not make too much sense, the same data were reevaluated by subjecting them to the Two-Sample Smirnov Test. The highest computed value thus obtained was 0.319, as opposed to the tabulated values of: 0.21496(80), 0.24466(90), 0.271531(95), 0.30406(98) and 0.32527(99) Based on t h i s t e s t , the two dat the additive i n questio Subjecting the data sets to the L i l l i e f o r s Test showed that both sets were nonormally d i s t r i b u t e d , giving computed values of: 0.31814 and 0.30726, resp., as opposed to tabulated values of: 0.11384(90), 0.1253(95) and 0.14581(99). Thus, the parametric tests are not applicable. What i s p a r t i c u l a r l y s t r i k i n g i n t h i s case i s the large sample s i z e which was taken to insure greater confidence i n the r e s u l t s obtained, but which only reinforced an i n c o r r e c t conclusion. This example i l l u s t r a t e s some important points. These are: 1. When p h y s i c a l r e a l i t y apparently c o n t r a d i c t s the s t a t i s t i c a l r e s u l t s obtained, i t i s incumbent on the applied s t a t i s t i c i a n to reevaluate both h i s data and h i s approach. 2. Although a large sample s i z e i s generally desirable, i t i s no insurance f o r obtaining correct answers. The preceding examples demonstrate that i n c e r t a i n circumstances the parametric tests may f a i l to i n d i c a t e s i g n i f i c a n t differences between d i s t r i butions. The next example demonstrates the opposite s i t u a t i o n , namely, that misapplied parametric tests can i n d i c a t e s i g n i f i c a n t differences between two d i s t r i b u t i o n s when i n f a c t there are no d i f f e r e n c e s . The s p e c i f i c circumstances are rather general, but we
In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
11. AMES AND SZONYI
Lying with Statistics
237
have noticed t h i s to be the case i f the parent d i s t r i bution i s skewed and the sample s i z e small. Table 8 shows the data and the r e s u l t s of parametric and nonparametric tests performed on such data. Based on the parametric tests, the two data sets d i f f e r with regards to v a r i a b i l i t y . This i s , however, not the case based on the Two-Sample Smirnov Test. The L i l l i e f o r s Test shows that one of the data sets i s nonnormally d i s tributed. IV.
Conclusions
The preceding examples have demonstrated the problems that can a r i s e when parametric s t a t i s t i c ^ are used without proper foundation, that i s , the independent v e r i f i c a t i o n of s u f f i c i e n t normality f o r the parametric test important f o r the commo i s that the more widely applicable nonparametric methods are more powerful (as compared to i n d i v i d u a l parametric tests) than the parametric tests f o r deciding when the "test" d i f f e r s from the " c o n t r o l " . (Some examples of the nonparametric s t a t i s t i c s into data analysis can be found i n the following references: (38, 39, 40, 41, 42, 43, 44, 45).) As a general approach to answering the question "Is the ' t e s t d i f f e r e n t from the 'control'?", we recommend the following strategy. 1) For the f i r s t step i n the intercomparison apply the two sample Smirnov t e s t . I f t h i s step leads to the conclusion that no difference e x i s t s , no further analysis i s necessary. 2) I f the f i r s t step establishes that a difference e x i s t s and i t i s necessary to resolve the difference i n terms of d i s t r i b u t i o n a l parameters, apply the L i l l i e f o r s (or other appropriate) t e s t . 3) If the data are normally d i s t r i b u t e d use the parametric tests as necessary. I f not, f i n d the transformations which do make the data properly parametric. I f these transformations cannot be found, the questions being asked of the data cannot be answered and new questions must be found. The computer programs used i n the simulations and data analysis are a v a i l a b l e on request. Also a v a i l a b l e i s the "Comprehensive S t a t i s t i c a l Screening Program", 1
In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
CHEMOMETRICS: THEORY AND APPLICATION
Table 7 PRACTICAL DATA AMD RESULTS OF THE PARAMETRIC TESTS
Data 1
Data 2
5.65000
5.25000
5.17000
5.68000
5.75000
5.25000
5.40000
5.61000
5.75000
5.25000
5.08000
5.89000
5.75000 5.00000
5.79000
5.75000
5.25000
5.86000
5.62000
5.75000
5.25000
5.20000
5.72000
5.75000
5.25000
5.20000
5.67000
5.75000
5.25000
5.25000
5.84000
5.75000
5.25000
5.09000
5.69000
5.75000
5.25000
5.33000
5.58000
5.75000
5.25000
5.52000
5.66000
5.75000
5.25000
5.25000
5.82000
5.75000
5.25000
5.26000
5.85000
In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
11.
AMES AND SZONYI
Lying with Statistics
239
Table 7 (continued) Data 1
Data 2
5.75000
5.25000
5.35000
5.77000
5.75000
5.25000
5.20000
5.25000
5.76000
5.60000
5.75000
5.25000
5.27000
5.72000
5.75000 5.09000
5.66000
5.75000
5.25000
5.43000
5.70000
5.75000
5.25000
5.26000
5.76000
5.75000
5.25000
5.28000
6.00000
5.75000
5.25000
5.40000
5.66000
5.75000
5.25000
5.46000
5.75000
5.75000
5.25000
5.20000
5.61000
5.75000
5.25000
5.33000
5.50000
Mean 1 Variance 1 Stand. Dev. 1 C o e f f . Var. 1
Tabulated t - v a l u e s :
fc
t
t
fc
C a l c u l a t e d t-value
Mean 2 Variance 2 Stand. Dev. 2 C o e f f . Var. 2
5.50659 0.07181 0.26797 4.86643 98,90 98,95 98,97.5 98,99
0.47423
1.2902 1.6606 1.9845 2.365
5.48219 0.06057 0.24611 4.46933
Tabulated F-values:
F
F
P
F
Calculated F-ratio
49,49,90 49,49,95 49,49,97.5 49,49,99 1.18553
In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
1.4496 1.6123 1.7691 1.9724
240
CHEMOMETRICS: THEORY AND APPLICATION
Table 8 ANALYSIS OF SKEWED DATA
Data Set 1
Data Set 2
-0.85837 -0.79696 -0.78426 -0.78426 -0.61911 -0.61911 -0.57363 -0.5422 -0.3886 -0.3701 -0.37918 -0.35139 -0.33228 -0.23139 0.14258 0.57340 1.21056 1.58031 1.58031 4.11062
-0.90514 -0.84636 -0.83421 -0.83421 -0.79696 -0.79696 -0.78426
0.078786 1.488776 1.220154 Lilliefors Tabulated Lilliefors Values Tabulated t values
0.2503
0.1759 0.174 0.190 0.231
^38,90 38,95 38,97.5
1.3042 1.686 2.024
computed t : 1.673
*19,19,95 19,19,97.5 19,19,99
2.168 2.527 3.028
computed F-. 6.322
T
0.30 0.35 0.40
computed T: 0.349
T
t
F
F
Tabulated Two Sided Smirnov Values
-0.412353 0.235497 0.485281
20,90 20,95 20,99
T
T
t
Tabulated F values
-0.63336 -0.50991 -0.40681 -0.14344 -0.07251 -0.04782 0.05691 0.26831 0.41030 0.71030
20,20,80 20,20,90 20,20,95
T
T
In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
11.
AMES AND szoNYi
Lying with Statistics
241
combining L i l l i e f o r s , Kolmogorov-Smirnov, and a number of standard parametric t e s t s . We would l i k e to express our thanks to the Polaroid Corporation f o r the permission to publish t h i s work and to Jean Frederiksen of Polaroid f o r many h e l p f u l suggestions and stimulating discussions i n connection with t h i s paper. Literature Cited
1. Mason, R.L., "FORTRAN Programs for Non-Parametric Studies", Naval Underwater Systems Center, New London, Connecticut, AD-769 649. National Techn. Information Service, U.S. Dept. of Comm., Springfield, Va., 1973. 2. Dixon, W.J. and Massey, F.J., "Introduction to Statistical Analysis", 3rd ed. pp 114-119 McGraw-Hill Book Co. Ne York, 1969. 3. Dunn, O.J. and Clark, V.A., "Applied Statistics: Analysis of Variance and Regression", pp. 50-53, John Wiley & Sons, New York, 1974. 4. Dixon and Massey, loc. cit., pp. 109-113. 5. Dunn and Clark, loc. cit., pp. 53-55. 6. Natrella, M.G., "Experimental Statistics", National Bureau of Standards Handbook 91, pp. 4-1 to 4-14, U.S. Government Printing Office, Washington, D.C., 1963. 7. Natrella, loc. cit. pp. 4-8 and 4-9. 8. Owen, D.B., "Handbook of Statistical Tables", pp. 27-30, Addison-Wesley, Reading, Mass., 1962. 9. Owen, loc. sit., pp. 63-87. 10. Davies, O.L. and Goldsmith, P., "Statistical Methods in Research and Production", 4th rev. ed., pp. 12-16, Hafner Publishing Co., New York, 1972. 11. Dixon and Massey, loc. cit., pp. 9-10. 12. Haseloff, O.W. and Hoffman, H.J., "Kleines Lehrbuch der Statistik", p. 82, Walter de Gruyter & Co., Berlin, 1968. 13. Maisel, L., "Probability, Statistics and Random Processes", pp. 56-59, Simon & Schuster, New York, 1971. 14. Mosteller, F., Rourke, R.E.K. and Thomas, G.B., Jr., "Probability with Statistical Applications," pp, 259-268. Addison-Wesley Publishing Co., Reading, Mass., 1970. 15. Bethea, R.J., Duran, B.S. and Boullion, T.L., "Statistical Methods for Engineers and Scientists", p. 43, Marcel Dekker, Inc., New York, 1975. 16. Conover, W.J., "Practical Nonparametric Statistics", pp. 293-326, John Wiley & Sons, New York, 1971. 17. Siegel, S., "Nonparametric Statistics for the Behavioral Sciences", pp. 127-136, McGraw-Hill Book Co., New York, 1956. 18. Hollander, M. and Wolfe, D.A., "Nonparametric Statistical Methods", pp. 219-228, John Wiley & Sons, New York, 1973.
In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
242
CHEMOMETRICS: THEORY AND APPLICATION
19. Dixon and Massey, loc. cit., pp. 243-244. 20. Dyer, A., Biometrika (1974) 61, pp. 185-189. 21. Shapiro, S.S. and Francis, R.S., J. Am. Stat. Assoc. (1972) 67, pp. 215-216. 22. Stephens, M.A., J. Am. Stat. Assoc. (1974) 69, pp. 730-737. 23. Liffiefors, H.W., J. Am. Stat. Assoc. (1967) 62, pp. 399-402. 24. Conover, loc. cit. pp. 302-306. 25. Beyer, N.H., "Handbook of Tables for Probability and Statistics", 2nd ed., pp. 125-134, The Chemical Rubber Co. Cleveland, Ohio, 1968. 26. IBM Application Program, System/360, Scientific Subroutine Package, Version III, Programmer's Manual, Program Number 360A-CM-03X, 5th ed., p. 78, 1970. 27. Alger, P.L., "Mathematics for Science & Engineering", pp. 303-304, McGraw-Hill Book Co., New York, 1969. 28. Conover, loc. cit. p. 398. 29. Conover, loc. cit 399 d 400-401 30. Massey, P.J., Jr. pp. 435-441. 31. Conover, loc. cit. p. 311. 32. Shapiro, S.S. and Wilk, M.B., Biometrika (1965) 52, pp. 591-611. 33. Shapiro, S.S., Wilk, M.B. and Chen, J.H., J. Am. Stat. Assoc. (1968) 63, pp. 1343-1372. 34. D'Agostino, R.B., Biometrika (1971) 58, pp. 341-348. 35. Schuster, E.I., J. Am. Stat. Assoc. (1973) 68, pp. 713-715. 36. Klimko, L.A. andAntle,C.E., Comm. Statistics (1975) 4, pp. 1009-1019. 37. Govindarajulu, Ζ., Comm. Statistic Theory Meth. (1976) A5, pp. 429-453. 38. Reed, A.H., Clinical Chemistry (1971) 17, p. 275. 39. Arnoldi, C.C., Acta Chirug. Sci. (1972) 138, p. 25. 40. Scholler, R., Path. Biol. (1973) 21, p. 375. 41. Wu, G.T., Twomey, S.L. and Thiers, R.E., Clinical Chemistry (1975) 21, pp. 315-320. 42. Smith, W.B., Tex. J. Sci. (1971) 22, p. 252. 43. Vestal, C.K., Monthly Weather Rev. (1971) 99, p. 650. 44. Britton, P.W., J. Sanit. Eng. (1972) 98, p. 717. 45. Locks, M.O. and Pauler, G.L., Proc. Am. Reliability and Maintainability Symposium (1974) 7, pp. 226-228. 46. Stephens, M.A., J. Roy. Stat. Soc., Ser. B, (1970) 32, pp. 115-122. 47. Conover, loc. cit., p. 397. 48. Massey, F.J., Jr., J. Am. Stat. Assoc. (1951) 46, pp. 69-78. 49. Owen, loc. cit., pp. 413-425. 50. Siegel, loc. cit. pp. 47-51. 51. Stephens, M.A. and Maag, M.R., Biometrika (1968) 5, pp. 428-430.
In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
12 S I M C A : A Method for Analyzing Chemical Data in Terms of Similarity and Analogy SVANTE W O L D and MICHAEL SJÖSTRÖM* Research Group for Chemometrics, Institute of Chemistry, Umeå University, S-901 87 Umeå, Sweden
1.
Introduction
Chemists study the behavior of molecules and mixtures of different molecules. Henceforth these mixtures or molecules are called chemical systems or objects. As in all branches of science, the increased quantification of measurements makes mathematics increasingly important both for the analysis of the measured data and for the prediction of the behavior of yet unstudied systems. Roughly, we can distinguish 2 extremes of mathe matical models employed for these purposes. The first type, best suited for rather simple systems, we can call global hard models. There are two contrasts; global-local and hard-soft. The term global is used to indicate that the models, in principle, describe the behavior of the system "everywhere". The polari zation hard-soft is used mainly to describe the amount of information contained in the model. Hard models contain much information since they describe the system in terms of fundamental quantities. In addi tion, the deviation between the hard model and the measured data must not be larger than the errors of measurement. Developed mainly i n physics, global hard models describe a system i n terms of fundamental physical quantities such as mass, charge, energy and time. Examples of t h i s type of model employed i n chemistry are quantum mechanical models and k i n e t i c models i n 1976-77 on leave to: Laboratory for Chemometrics, Department of Chemistry, BG10, Seattle, WA 98195, USA. β
243
In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
244
CHEMOMETRICS:
THEORY AND APPLICATION
the form of systems of d i f f e r e n t i a l equations* The hard global models, when properly applied, have great advantages both i n t h e i r far-reaching predictions and t h e i r i n t e r p r e t a t i o n i n terms of fundamental q u a n t i t i e s . However, t h e i r use i s l i m i t e d to systems of moderate complexity, since the complexi t y of the models — i n terms of the number of i n v o l ved parameters and equations — grows f a s t e r than exponentially when the number of components i n the modelled systems increase. Since global models of the type used i n physics for a long time were the only mathematical models a v a i l a b l e , these were applied also i n the study of very complex chemical systems involving many atoms and molecules, such as organic compounds reacting with each other i n a solvent. For some aspects of such studies, f o r instanc mathematical analysi notably the r e l a t i o n s h i p between structure and reacti v i t y , the r e s u l t s of the mathematical analysis usually have been disappointing, i n p a r t i c u l a r with regard to the predictions r e s u l t i n g from the analysis. I t was soon found that i n order to apply the global hard models to systems of the complexity of common chemical systems, either the models had to be s i m p l i f i e d to a degree that t h e i r i n t e r p r e t a t i o n was made dubious or the chemical system studied had to be s i m p l i f i e d so f a r that i t l o s t much of i t s relevance to the chemical problem that o r i g i n a t e d the study. These d i f f i c u l t i e s with global hard models have made chemists continue to analyze t h e i r experimental data i n a q u a l i t a t i v e way. This q u a l i t a t i v e analysis i s often made i n terms of analogy and s i m i l a r i t y . Consider, as an example, an organic chemist studying properties of esters. When he s t a r t s the study of a new eater, he does not expect t h i s ester to display an i d e n t i c a l behavior to e a r l i e r studied esters. Rather, he expects a s i m i l a r behavior. What i s meant by " s i m i l a r i t y " i s d i f f i c u l t to state p r e c i s e l y , but the i n t u i t i v e meaning i s rather easy to see i n each i n d i v i d u a l case. In the same way, much of chemical theory i s based on the concepts of s i m i l a r i t y and analogy — see f o r instance Theobald (JL) f o r an i l l u m i n a t i n g discussion. Thus, the p e r i o d i c system orders the elements i n columns within which we f i n d elements with " s i m i l a r " chemical properties. Organic molecules with the same functional group, say COOR, display analogous propert i e s with regard to. r e a c t i v i t y , s t a b i l i t y , spectra and so on.
In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
12.
WOLD AND SJÔSTRÔM
SIMCA
245
With the large masses of q u a n t i t a t i v e data being produced i n today's chemical laboratories — mass spectrometry, ΝMR, U V , I R , A t o m i c A b s o r p t i o n , E l e c t rophoresis and Amino-acid a n a l y s i s , etc. — the need to q u a n t i f y the concepts o f s i m i l a r i t y and analogy becomes u r g e n t . It i s f a i r l y easy to perceive a q u a l i t a t i v e s i m i l a r i t y between systems when one o r two measurements a r e made o n e a c h o f t h e m , b u t extremely d i f f i c u l t when the number o f d a t a o b s e r v e d on each system becomes larger. As w i l l be seen b e l o w , the q u a n t i f i c a t i o n of s i m i l a r i t y a n d a n a l o g y c a n b e made i n a straight f o r w a r d m a n n e r u s i n g t h e two c o n c e p t s of classes — i . e . l o c a l models — and s i m i l a r i t y models — i . e . soft models. B e f o r e we g o i n t o t h i s h o w e v e r , let us f i n i s h t h i s s e c t i o n by f i t t i n g the d i f f e r e n t types of mathematical models The c o n t r a s t s betwee and s o f t models s h o u l d n o t be seen as d i c h o t o m i e s of mutually exclusive mathematical tools. Rather, we see i n each c h e m i c a l s t u d y one p a r t t h a t i s b e s t studied by means o f g l o b a l h a r d m o d e l s and one p a r t t h a t is b e s t s t u d i e d b y means o f s o f t m o d e l s . Usually there i s a l s o a r e g i o n of o v e r l a p where e i t h e r t y p e o f mo d e l can be e m p l o y e d (see Figure 1). L e t us i l l u s t r a t e t h i s w i t h the example of C ΝMR w h i c h we discuss below. Here one p r o b l e m i s to u t i l i z e the s p e c t r a to determine whether a g i v e n m o l e c u l e has exo o r endo configuration. This is a problem of rather large c o m p l e x i t y where the fundamental knowledge i s rather scarce about which f a c t o r s determine the difference i n s p e c t r a between exo and endo m o l e c u l e s . Hence this problem i s best approached using soft models. Another problem i s the i n t e r p r e t a t i o n of the emerging patterns of the spectra i n the two c l a s s e s o f exo and endo m o l e c u l e s . T h i s i s a p r o b l e m w h e r e we require answers i n terms of fundamental concepts a n d we there fore t r y to use hard models f o r t h i s interpretation, say c o r r e l a t i o n s with charge densities calculated by some q u a n t u m m e c h a n i c a l m e t h o d . 1
I n t h i s way h a r d a n d s o f t to each other, each type being they are designed.
3
models useful
form complements f o r the purpose
S i n c e the use o f h a r d models (2^3) i s t r e a t e d by numerous authors i n the chemical l i t e r a t u r e , we w i l l n o t d w e l l more on t h i s a r e a b u t w i l l h e r e i n s t e a d out l i n e the b a s i s and use of s o f t models (of a special type) i n chemical investigations. We u r g e t h e reader to remember the complementary n a t u r e o f the different types of mathematical models. The p r e s e n t treatment
In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
CHEMOMETRICS: THEORY AND APPLICATION
Figure 1. In a given problem different aspects are approached by means of dif ferent types of mathematical modeb learning exo ( q
set
=1)
7
7
,6
1
\6
2Λ
x
k
x
1
CH
2
NH
3
k
k
x
3
9
CH
2
10
Ν H
OH
11
OH
4
F
1 2
CN
5
CN
1 3
COOH
6
COOH
1 4
COOCH
7
COOCHj
1 5
CH 0 Η
8
CH OH
k
k 3
2
2
2
Figure 2. Learning set of 2-substituted norboranes. The exo compounds form class 1 (q — 1) and the endo compounds class 2 (q— 2).
In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
12.
WOLD AND STOSTRÔM
SIMCA
247
thus discusses the use of mathematics i n an area where t h i s e a r l i e r has been d i f f i c u l t , we do not propose that the s o f t models should be used where hard models are appropriate. However, there are many types of data analysis where, i n our view, hard models are presently used rather inappropriately. This i s because there i s a lack of information about the existence and use of s o f t models. In such cases there might emerge an apparent c o n f l i c t between the two types of models. This c o n f l i c t w i l l c e r t a i n l y be resolved when experience with the use of a l l kinds of mathematical models i s common i n a l l branches of chemistry. 2.
Example
To i l l u s t r a t e th s h a l l use the SIMCA-analysi bornanes (Figure 2) (4). The data were analyzed to f i n d out whether l^C NMR data of these compounds could be used to determine i f the structure of a p a r t i c u l a r compound i s exo or endo and whether there existed consistent patterns f o r each type of molecules which could be u t i l i z e d i n the assignment of NMR-spectra. The data are shown i n Table I. The reader i s r e f e r red to section 5 for a more d e t a i l e d discussion of which kind of information that one desires from t h i s type of data analysis. 3.
The Idea of Classes; Experimental Design and S i m i l a r i t y Models
The q u a n t i f i c a t i o n of the s i m i l a r i t y concept which we here propose (SIMCA) i s based on three fundaments (5) . The f i r s t i s the experiment design which SIMCA presently can handle. Thus, we s h a l l assume that on each object (chemical system) studied, a number of measurements have been made — the same variables measured on a l l objects i n the study. In the norbornane example, t h i s number (henceforth denoted M) i s seven; seven ^ C NMR s h i f t s , one f o r each atom 1-7 i n Figure 2, are c o l l e c t e d f o r each of the studied compounds. This design i s rather common i n chemical studies, the data c o l l e c t e d i n a study can often be arranged i n a matrix as i n Figure 3. As w i l l be shown l a t e r i n section 6.7 we need not assume that a l l positions i n the matrix a c t u a l l y contain measured values, SIMCA works also with incomplete data. Of c r i t i c a l importance for the success or f a i l u r e 3
American Chemical Society Library 1155 16th St. N. W. Washington, D. C. 20036 In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
CHEMOMETRICS: THEORY AND APPLICATION
248
of the data analysis i s that at l e a s t some of the variables have a relevance f o r the problem; i . e . that they are i n some way r e l a t e d to the chemical problem under study. As w i l l be seen below, one primary r e s u l t of the data analysis i s information about t h i s relevance, whether the data contain any information at a l l and i n such case, which parts that carry t h i s information. Second, we s h a l l assume that the chemical systems have been ordered i n classes i n such a way that each class contain only s i m i l a r systems (see Figure 3). usually, there are also i n the study a number of systems of uncertain c l a s s assignment. These we c o l l e c t i n the " t e s t set" (see Figure 3) i n order to l a t e r make an assignment by means of the mathematical models applied to the classes. This ordering i n t o classes i s usually easy i n chemica are accustomed to formulatin of classes. In the norbornane example we have the two classes exo and endo compounds between which we wish to f i n d the differences i n NMR spectra. Though we here i l l u s t r a t e the methodology by an example with two classes, the methods can also equally well be applied to problems with a s i n g l e c l a s s or with a large number of classes. Depending on whether the number of classes i s one or two or more, however, d i f f e r e n t types of information can be extracted from the data, a discussion of which i s given i n section 5. I t i s evident that the concept of classes considerably s i m p l i f i e s the discussion, analysis and i n t e r p r e t a t i o n of the behavior of chemical as well as other natural systems. This i s the reason f o r the populari t y of t h i s concept. Compounds are c l a s s i f i e d as organic or inorganic, aromatic or a l i p h a t i c , acids or bases, i o n i c or covalent and further into a large number of subclasses depending on the chefhical problem studied. Reactions are c l a s s i f i e d according to "mechanisms", S ^ l or S 2, solvent a s s i s t e d or not, f i r s t or second order, a c i d or base catalyzed etc., etc. Thus, instead of considering a large number of i n d i v i d u a l cases or systems, one r e l a t e s to a much smaller number of classes into one (or few) of which each case or system i s assigned. The t h i r d fundament i s the existence of mathemati c a l models which, under rather general assumptions, can approximate data observed on the s i m i l a r objects i n a s i n g l e c l a s s . These s i m i l a r i t y models, here c a l l e d l o c a l and s o f t because they are l i m i t e d to a single class and are of an approximate nature, can be derived by means of simple mathematical tools such as Taylor N
In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
12.
WOLD AND SJÔSTRÔM
SIMCA
249
expansions. They can be considered as analogous to polynomials. In the same way as a polynomial of s u f f i c i e n t l y high order always can approximate the v a r i a t i o n of any continuous function i n a l i m i t e d i n t e r v a l , the s i m i l a r i t y models shown i n eq~. Π Π can, with s u f f i c i e n t l y many terms, approximate any data matrix measured on a c o l l e c t i o n of s i m i l a r systems, i . e . , systems showing a l i m i t e d d i v e r s i t y . In conclusion, the s i m i l a r i t y concept can present l y be q u a n t i f i e d i n s i t u a t i o n s where a. measurements of the same type have been made on a number of systems, b. at l e a s t two measurements have been made on. each system, c. the systems can be ordered into one or several classes plus a " t e s t set" so that within each class there are only " s i m i l a r " systems. For each class we can then, on the basis of the data of the objects s t r u c t a mathematica 6.1). I f we have Q classes i n the study we thus obtain Q d i f f e r e n t models, each describing the data structure within a single c l a s s . There are other ways to handle the c l a s s i f i c a t i o n problem, most of which are reviewed by Kanal (6) and Cacoullos {!) · Some of these methods have been applied i n chemical data analysis by Isenhour and Jurs (S), Kowalski (9,10) and others. The main difference between these methods and SIMCA which we here discuss i s that the scope of SIMCA goes further than mere c l a s s i f i c a t i o n , namely to get an approximate descript ion of the data structure within each class i n terms of a quantitative model. 4.
Graphical Representation
I t i s convenient to have a graphical representa tion of the data observed on the systems under study (Figure 3). The best way found so f a r i s to represent the data measured on one system as a point i n the Mdimensional space obtained when each v a r i a b l e i s given one orthogonal coordinate axis. This space, the measurement soace, i s henceforth referred to as Mspace. The ^ C NMR data i n Table I would thus be re presented as points i n a 7-dimensional space; one point f o r each compound. Though i t i s d i f f i c u l t to v i s u a l i z e spaces with more than 3 dimensions, we can discuss examples i n 2 or 3 dimensions and draw analogies to higher dimensional spaces. Figure 4 shows one of many 2-dimensional representations of the norbornane data and i t i s seen that even t h i s simple p l o t reveals i n t e r e s t i n g
In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
250
CHEMOMETRICS: THEORY AND APPLICATION Chemical
system
12
k
(object)
1
1
yik M ! ν
Class I
"
ν
'
Class 2
>
'
V
' '
V
Class Q Unassigned objects • ·
v
Learning set (Training, Reference set)
ν
'
Test set
Figure 3. The data matrix in a quantitative analysis of simifority. On each chemic tem (object), the values ordered in classes dispfoying internal simihrity. The test set contains objects o known* class assignment.
a (ppm)
5.01
-5.0 '
'( , ' ο' ' ' ' .
m h { 1>P,U)
Figure 4. Plot of the difference in relative shift for the C (AB ) and C (&8 ) carbons against the relative shift for the carbon (AS**) for the compounds 1-15 (see Figure 2). The variables are further reseated to the same experimental width. Thus a = (1.43 AB AB )/V2 and b — 2.73 Δ8 *. Circles refer to exo compounds and squares to endo compounds. Two well separated classes are obtained. 7
7lr
6k
7k
6
6k
4
In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
12.
WOLD AND SJÔSTRÔM
SIMCA
251
r e g u l a r i t i e s within the two classes. Graphical methods w i l l be discussed further i n Section 6.8. 5.
Which Information
Is Wanted from the Data Analysis
For each p a r t i c u l a r problem there are of course some questions which are s p e c i f i c f o r the problem. There are usually also a number of questions which are common to a l l problems of t h i s kind. These questions r e l a t e to mainly three types of information; (a) information about each separate c l a s s , (b) information about the r e l a t i o n s ( s i m i l a r i t y and d i s s i m i l a r i t y ) between objects and classes and (c) information about the r e l a t i o n s between classes. 5.1
A Single Class.
The informatio the analysis of the data observed on the objects "known" to belong to the c l a s s (the t r a i n i n g , learning, or reference s e t of the c l a s s ) . The primary goal i s usually to get a picture of the "data p r o f i l e " of the c l a s s , i . e . , the t y p i c a l behavior of the objects i n the c l a s s . In SIMCA, t h i s data p r o f i l e i s described i n terms of the parameters α and β i n eq. (1) (see section 6.2), which i n turn can be seen to define a simple surface (line or hyperplane) i n M-space. Second, i t i s often of i n t e r e s t which variables that "take much part" i n t h i s data p r o f i l e and which ( i f any) that are less involved i n the c l a s s p r o f i l e . This i s further discussed i n section 6.5.2. Third, i n the analysis of r e a l data, the learning set of the c l a s s i s often chosen under some uncertain ty. Hence, one i s i n t e r e s t e d to f i n d out whether one or several of the objects i n the learning s e t have deviating properties — e i t h e r due to an i n c o r r e c t class assignment or because of a non-typical behavior — one wants to f i n d o u t l i e r s i n the t r a i n i n g s e t (see section 6.5.1.). 5.2
Relations Between Objects and Classes.
The class assignment of the objects i n the t e s t set i n sometimes the primary goal of the data a n a l y s i s . In chemistry one i s usually also interested i n other aspects as touched on i n sections 5.1 and 5.3. We stress t h i s because some methods used i n "pattern recognition" are "optimal" when (i) the sole goal i s the c l a s s i f i c a t i o n of unassigned objects and ( i i ) one i s completely c e r t a i n that the objects i n the t r a i n i n g
In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
252
CHEMO METRICS : THEORY AND APPLICATION
set a l l are c o r r e c t l y assigned and t y p i c a l of the classes. When more information i s desired (see above) and when the p o s s i b i l i t y of o u t l i e r s must be considered, some of t h i s c l a s s i f i c a t i o n optimality must be compromised for other aspects such as information about the data p r o f i l e s of the classes and o u t l i e r s . The information about the r e l a t i o n between the objects and the classes can take d i f f e r e n t forms. The crudest i s that an object e i t h e r belongs to a class or not. Though s u f f i c i e n t i n some cases, i t i s usually desired to instead get a p r o b a b i l i t y measure of t h i s class assignment. Thus, SIMCA gives, for each object, a p r o b a b i l i t y for each c l a s s that the object belongs to the class (see section 6.6). These p r o b a b i l i t i e s need not sum up to one. We can consider a p a r t i c u l a r object to have very low p r o b a b i l i t i e s to belong to the given "known" classe kind". We can also object to belong to two (or several) classes — e i t h e r the data are not s u f f i c i e n t to make an unique assignment, or the object i n f a c t does belong to several classes l i k e a chemical compound having several funct i o n a l groups or a patient s u f f e r i n g from several diseases. 5.3
Relation Between Classes.
Here we might be interested i n whether two of the classes are "close" to each other compared to the other classes — we desire information about the "distances" between the classes (section 6.5.7). Another often important type of information i s the "distance" between two classes for each of the v a r i a b l e s ; i . e . , how important each variable i s to d i s t i n g u i s h between two classes. When c o l l e c t i n g such information for one variable over a l l c l a s s - p a i r s or, perhaps better, over a l l "close" c l a s s - p a i r s , we get an idea about the discrimination power of the variable (section 6.5.4). In many problems, the discrimination power of the variables i s of great importance for the i n t e r p r e t a t i o n of the data structure; i n p a r t i c u l a r when a large number of variables have been included without p r i o r information about t h e i r relevance for the problem. When reducing the number of variables (see section 6.5.6), however, we must not base t h i s reduction on the discrimination power of the variables since t h i s w i l l r e s u l t i n a gross over estimation of the differences between classes. Instead, as discussed below, t h i s s e l e c t i o n i s better based on the modelling power of the variables which b a s i c a l l y i s how much each variable
In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
12.
WOLD AND SJOSTRÔM
SIMCA
253
"takes part" i n the modelling of a l l the classes instead of j u s t a s i n g l e class as discussed i n section 5.1 (section 6.5.3). 6.
The SIMCA Method
The method* i s based on modelling each c l a s s by a separate model. The main features are described i n the following sections. Of those, 6.3 through 6.6 are rather technical i n nature and might be skimmed by the reader more interested i n the p h i l o s o p h i c a l points. 6.1
Describing a Class by Mathematical Model.
We can see one primary goal of the data analysis as f i n d i n g "the regular behavior" of the objects i n the d i f f e r e n t classes. Th to s i m p l i f y the discussio l e c t i o n of objects — instead of discussing each i n d i v i d u a l object we reduce the complexity by the discussion of a much smaller number of classes. With t h i s p h i l o s o p h i c a l motivation for the classes i t i s natural to t r e a t each class as independent from the other classes. This corresponds, i n mathematical terms, to the construction of one mathematical model for each c l a s s . The f i r s t question i s whether t h i s i s at a l l possible, a f t e r a l l we know only that the objects i n one class are i n some way s i m i l a r . The t r i c k i s to use mathematical models which can approximate any regular behavior of a class of s i m i l a r objects. Let us, for i l l u s t r a t i v e purposes for a while look at a simpler case. For b i v a r i a t e data y = f(χ) we know a large number of approximative models. Thus, provided that f(x) has some continuity properties, i t can be Taylor expanded around χ = x giving a polynomical series of a r b i t r a r i l y good approximating power when s u f f i c i e n t l y many terms are included i n the expansion. In the same way, using multidimensional Taylor expansions, i t can be shown that under some rather general assumptions, data observed of a class of s i m i l a r objects can be approximated a r b i t r a r i l y well by the p r i n c i p a l components expansion (1) i n the next Q
*
SIMCA i s an acronym for S t a t i s t i c a l I s o l i n e a r Multiple Components Analysis (due to Dave Duewer) or Soft Inde pendent Modelling of Class Analogy or SIMple C l a s s i f i cation Algorithm or SIMilarographic Computer Analysis or
In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
CHEMOMETRICS:
254
THEORY AND APPLICATION
section (5). The chemist should not be surprised by t h i s r e s u l t . Simple few term p r i n c i p a l components (PC) models have f o r some time been used to describe chemical data observed on s i m i l a r objects (usually chemical reacttions) under the names of Linear Free Energy Relationships (LFERs) and Extra Thermodynamic Relationships (ETRs) (LI,12). The Bronsted and Hammett equations (13,14) are notable examples. The r e l a t i o n between LFERs and chemical pattern recognition has been pointed out by Hammond (15). The most important condition f o r these PC models to be able to approximate the data observed on a class of s i m i l a r objects i s that the variables i indeed are, a t l e a s t i n part, a r e a l i z a t i o n of t h i s s i m i l a r i t y within the c l a s s . In other words, the variables chosen to characterize th problem. This conditio fundamental importance f o r a l l methods of data analysis. One advantage with the SIMCA method i s that i t provides a measure f o r the relevance of each v a r i a b l e (see sections 6.5.2 and 6.5.4). This gives the opportunity to include variables of doubtful relevance i n i n i t i a l stages of the data analysis, followed by c a l c u l a t i o n of t h e i r actual relevance and the deletion of i r r e l e v a n t v a r i a b l e s . 6.2
P r i n c i p a l Components (PC) Models As S i m i l a r i t y Models.
A data matrix Y (q) with the elements y j ^ can, provided that a few assumptions are f u l f i l l e d , be approximated a r b i t r a r i l y c l o s e l y by the PC model (j>) :
Yi
( q ) k
- «i
< q )
•
e <*> ia
e
i k
W
*
( H
q
)
k
(i)
a=l Here the index £ indicates that the data belong to c l a s s c[. The assumptions on which t h i s model i s based are two; (i) the data are generated by a "smooth" process and ( i i ) the cases (objects) with index k are "similar". The f i r s t assumption corresponds to an assumption that the function generating the data i s several times d i f f e r e n t i a b l e so i t can be Taylor expanded. This assumption i s very natural with chemical data. Whatever d e t a i l e d theory one has f o r the processes generating the data yj.k, most such theories can be
In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
12.
WOLD AND SJOSTROM
SIMCA
255
f o r m u l a t e d as b e i n g s o l u t i o n s t o an o p e r a t o r e q u a t i o n w h i c h a u t o m a t i c a l l y makes t h i s " s m o o t h n e s s " assumption fulfilled. We n o t e t h a t t h i s f i r s t a s s u m p t i o n i s g e n e r a l l y f u l f i l l e d when a r e measured d a t a . When i s of some o t h e r k i n d , s a y d i s c r e t e d a t a d e d u c e d f r o m a chemical s t r u c t u r e such as the presence o r absence o f a nitro-group i n a certain position, this assumption i s f a r from f u l f i l l e d . N u m e r i c a l l y , t h e method works anyway, b u t t h e u s e r must be c a u t i o n e d t h a t t h e i n t e r p r e t a t i o n and use o f t h e r e s u l t s i s l e s s c e r t a i n . The s e c o n d a s s u m p t i o n a b o u t s i m i l a r i t y i s n e e d e d t o make t h e number o f t e r m s i n t h e T a y l o r e x p a n s i o n s l i m i t e d , i . e . , t h e number o f p r o d u c t t e r m s i n m o d e l (1), A , s m a l l . T h i s i s i n a n a t u r a l way r e l a t e d t o how the c l a s s d e f i n i n g the choice o f the data i s selected. If i ti corresponds to a n a t u r a m o d e l (1) w i l l d e s c r i b e d a t a w e l l b u t i f d o n e i n a " b a d " way, t h e m o d e l (1) w i l l g i v e a n i l l - d e f i n e d description. G e o m e t r i c a l l y , m o d e l (1) c o r r e s p o n d s t o a n A d i m e n s i o n a l hyper p l a n e i n M-space. For the simplest c a s e s , A = 0 a n d A = 1, t h i s c o r r e s p o n d s t o a s i n g l e p o i n t and a s t r a i g h t l i n e . T h i s makes t h e g r a p h i c a l i l l u s t r a t i o n o f m o d e l (1) r a t h e r s i m p l e ( s e e F i g u r e 7). M o d e l (1) i s r e l a t e d t o s e v e r a l m a t h e m a t i c a l methods f i n d i n g use i n c h e m i s t r y . LFERs have a l r e a d y been mentioned i n the p r e v i o u s s e c t i o n . Ratio-matching (16) i s a s p e c i a l c a s e o f e q . (1) when t h e v a r i a b l e means a r e s m a l l a n d t h e number o f p r o d u c t t e r m s A, i s one. The l a t t e r i s o f t e n f u l f i l l e d f o r d a t a observed on c l o s e l y s i m i l a r o b j e c t s . Factor analysis (12,18.,19.) h a s a m a t h e m a t i c a l f o r m u l a t i o n c l o s e l y r e l a t e d t o e q . (1). T h e m a i n d i f f e r e n c e t o SIMCA i s t h a t one u s u a l l y i n f a c t o r a n a l y s i s assumes t h a t a s i n g l e model i s v a l i d o v e r a l l d a t a ( i . e . , a g l o b a l m o d e l ) , w h i l e SIMCA i s b a s e d o n o n e s e p a r a t e m o d e l f o r each c l a s s . S t a t i s t i c a l l y , t h e m o d e l (1) i s c l o s e l y connected t o t h e m o d e l s o f m u l t i p l e r e g r e s s i o n (20). Thus, t h e v a l u e s $j[ i n e q . (1) c o r r e s p o n d t o t h e i n d e p e n d e n t v a r i a b l e s X i i n m u l t i p l e r e g r e s s i o n . The p r i n c i p a l d i f f e r e n c e i s t h a t t h e l a t t e r (X) a r e known b y o b s e r v a t i o n w h i l e t h e f o r m e r (β) a r e e s t i m a t e d from t h e Ν realization vectors (k = 1,2,...,N). O n c e t h e v a l u e s o f β_ a r e e s t i m a t e d i n SIMCA f r o m the observed v a l u e s o f f o r the t r a i n i n g s e t of the class — l a t e r uses o f t h e model c o r r e s p o n d t o m u l t i ple l i n e a r regressions with the values of fixed to a
a
In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
CHEMOMETRICS: THEORY AND APPLICATION
256
these estimated values. Thus, the SIMCA c l a s s i f i c a t i o n of an unassigned object corresponds to multiple regressions, one f o r each c l a s s model (index cj) , of the data of the object (denoted by on the parameter vectors J3 v.
(
^ip
- a «.
i
q
)
-
-
T
t *ap a— ι q
Σ
( q )
( q )
β &ia
( q )
+ e ip
+
e
(2)
( 2 )
The "distance" between the object £ and the c l a s s models i s r e l a t e d to the r e s i d u a l standard deviation s Tq) (q)
.
Σ i=l
ie p
( q ) i D
}
2
1/2 / (M-A ) q
(3)
4
To conclude t h i s s e c t i o n , we state the two main numerical problems involved i n the a p p l i c a t i o n of model (1) to a given data matrix Y, obtained by observing the values of M v a r i a b l e s on one c l a s s of Ν s i m i l a r objects (for Q classes, each c l a s s i s treated i n the same way, giving Q independent c l a s s models). (i) Estimation of the number of product terms i n eq. (1), A. The estimation of t h i s number corresponds to the separation of the data Y i n t o s i g n a l — the parameters α and JB — and noise (£ik) · We r e a l i z e that a large A gives small residuals (ε) — much s i g n a l and l i t t l e noise, and the reverse. Hence i t i s of utmost importance to get a good value of A, a value too large w i l l give too much apparent structure i n the data leading to too f a r reaching conclusions. A value of A too small w i l l correspond to an u n d e r - u t i l i z a t i o n of the i n f o r mation a c t u a l l y contained i n the data. ( i i ) Once the values of A are estimated f o r the c l a s s , the estimation of the parameters α and £ i n the model (1) i s numerically rather t r i v i a l . This i s further discussed i n section 6.3. 6.3
Parameter Estimation i n the PC Model (1).
For a given A, the estimation of the parameters α and β i n eq. (1) can be accomplished by several numerical methods. This problem i s i n mathematics c a l l e d by many names; singular value decomposition, eigenvector decomposition, Karhunen-Loêve expansion, matrix diagonalization and other names. In SIMCA we
In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
12.
WOLD AND SJÔSTRÔM
251
SIMCA
have, f o r d i f f e r e n t reasons, used a s p e c i a l type of algorithm c a l l e d a NIPALS algorithm (21./22.) . This works i t e r a t i v e l y , determining one vector &a+i a f t e r another (j^) using the f a c t that these vectors are mutually orthogonal. This i t e r a t i v e "peeling" procedure gives two advantages; i t i s e f f i c i e n t i n the c r o s s - v a l i d a t o r y estimation of A (see next section) and i t can be adapted to handle incomplete data matrices (some observations are missing). 6.4
The Number of Components (A) i n PC Models.
As discussed i n section 6.2, i t i s e s s e n t i a l not to overestimate or underestimate the amount of i n f o r mation contained i n the data matrix Y. This amount i s d i r e c t l y r e l a t e d to the number of terms i n model (1), i . e . , A. We have use v a l i d a t i o n which give the optimal p r e d i c t a b i l i t y of the data set back on i t s e l f using model (1). This takes some explanation. Assume that we have estimated the parameters ot^ and gja/ £ fc f o r a = 1,2 up to A f o r a given data matrix Y of a p a r t i c u l a r c l a s s . We can then see i f the addit i o n of another term £i (A+1) ,i(A+l) k b e t t e r s the f i t of model (1) to the data Y i n the following way. a
f
1.
Calculate the r e s i d u a l s a f t e r A terms
a=l 2.
Delete part of the matrix elements i n { e f i t the model to the remaining elements
e e
ik
( q ) i k
R (q) (q) + *
} and
fl
B
9
+
6
(
^
5
<'
Since the NIPALS method (21,22) works also with incomplete data, t h i s f i t t i n g i s e a s i l y accomplished 3.
For the deleted elements (denoted by £ik*)» c a l culate the p r e d i c t i o n e r r o r
A
ik
(q) _ * (q)_ (q) " ik ~ i,(A+l) e
B
6
(q) (A+l),k
In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
( 6 )
CHEMOMETRICS:
258 £.
THEORY AND APPLICATION
q
Restore the matrix } and then delete other elements and repeat steps 4 and 5 u n t i l a l l elements of ( Φ } have been deleted once. Form the sum of a l l p r e d i c t i o n errors i n square M N D
=
Σ
ZΣ
q
c
A
i
k
(
q
>
(7)
2
i k Compare D with the sum of r e s i d u a l squares S (q) S
( q )
=
Σ ?
q
e .
k
(
q
)
(8)
2
i k I f the r a t i o betwee (A+l)st product ter error of model (1) and we conclude that A terms were sufficient. If D/S i s less than one, we make a components analysis of the f u l l matrix Σi }/ subtract the ( A + l ) term and s t a r t again at step 1, but with A = A+l. The d e t a i l s of t h i s c r o s s - v a l i d a t i o n methodology w i l l be published elsewhere (2_3) ; i t s u f f i c e s here to say that the method works very w e l l with both simulated and r e a l data. I f the values of A f o r the d i f f e r e n t classes come out very d i f f e r e n t ( d i f f e r by more than two) one should use d i f f e r e n t values of Aq f o r the d i f f e r e n t c l a s s e s . If they agree within 1, however, i t i s p r a c t i c a l to use the same value of Aq for a l l the classes. s t
k
g
6.5
Information Contained i n the Residuals.
The SIMCA method i s based on the l e a s t squares framework. Thus the parameters ot, £ and £ i n eq. (1) are calculated to minimize the sum of squared residuals (£ifc). This estimation i s consistent f o r most probable d i s t r i b u t i o n s of £ i but i n order for the methods below to be e f f i c i e n t , the residuals should be approximately normally d i s t r i b u t e d . Hence, t h i s should be tested f o r . This i s e a s i e s t done by making histograms of £ifc f o r each variable and c l a s s . I f the residuals are highly non-normal, the data should be transformed, the reader i s referred to Box and Cox (24) f o r a discussion of the problem. When the number of objects i n each class i s small, however, (smaller than 20-30) any t e s t of normality i s highly uncertain and crude and i n t h i s t y p i c a l case k
In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
12.
SIMCA
WOLD AND SJÔSTRÔM
259
one can only hope that normality i s approximately fulfilled. The s i z e of the residuals a f t e r f i t t i n g data of a class reference s e t or data of i n d i v i d u a l objects to the various class models reveals i n t e r e s t i n g pieces of information. This w i l l be discussed a f t e r the d e f i n i t i o n of s u i t a b l e measures of "residual s i z e " . Besides equations (1) through (3) we s h a l l need, i n the treatment below, the following q u a n t i t i e s ; The t o t a l r e s i d u a l standard deviation of class q; B ' Î ' (£ik f q- D 1/2 (q) (9) I i k / ( » q - \ - D (M-A^1) r o m
=
£
e
(
q
)
2
The r e s i d u a l standar over a l l the data i 1
Q
W
q=l
Q Σ (M-Ag) k=l
£
η 1/2 (q)2 / N -A -1 (10) ik q
q
The corresponding standard deviation of the t r a i n ing s e t data y_: 1/2 Q Q (11) ( Σ N ) -1 Σ i,y q-1 q=l k=l q
s
The average of V j ^ over the t r a i n i n g s e t
=
Q Σ q=l
(12)
1
t k=l
q=l
The r e s i d u a l standard deviation of the obiects i n c l a s s r when f i t t e d to the class model 2 (flip from eq. 2: γ· then denotes data belonging to c l a s s vectors r) (q) _
Μ
ι
Ν,
P=l
(q)2 -ip
/
1/2
(M-A ) q
In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
(13)
CHEMOMETRICS: THEORY AND APPLICATION
260
The corresponding standard
deviation of v a r i a b l e
i^ 1/2 ( q )
s.
M
=
?r
(M-A ) Q
6.5.1
^
(q)2
(14)
l p
p-1
O u t l i e r s i n a Class Reference Set.
The s i z e of the residuals for a p a r t i c u l a r object (denoted by £ . ) can be compared with the "normal" s i z e of the class by means of, say, an F-test. I f t h i s comparison shows that the residuals of the object are large, t h i s indicates that the object i s an o u t l i e r , e i t h e r because i t was i n c o r r e c t l y assigned to the c l a s s or because i t shows an a t y p i c a l behavior. Thus, the t e s t i s based on th
F
=
s
N
(q)2
t
q
/
( < S o
^
) 2
(15)
The standard deviations are defined i n e q (3) and (9). The c o r r e c t i o n Nq/(Ng-Ag-l) i s needed because the vector £ has been included i n the computa t i o n of the c l a s s parameters α and The F-value i n eq. (15) i s compared with c r i t i c a l F-values with (M-Aq) and (N -Aq-1) (M-Aq) degrees of freedom. If o u t l i e r s are found i n the t r a i n i n g set, they should, as long as they are not too many (say not more than 10% of the t r a i n i n g set) be deleted and the c l a s s parameters should be recalculated. %
q
6.5.2
The Modelling Power of a Variable.
In the method l i k e SIMCA where each class i s described by a mathematical model, measures of how much of the v a r i a t i o n i n the o r i g i n a l data that are described by the models are of great i n t e r e s t . These measures are simplest obtained by comparing r e s i d u a l standard deviations with the corresponding data standard deviations. Thus we can c a l c u l a t e the modelling power of a variable i over a l l classes as:
• i
-
1 -
(
s
i
s
/ i,y>
In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
(16)
261
SIMCA
WOLD AND SJÔSTRÔM
12.
Here a n d s - ^ y a r e d e f i n e d i n e q . (10) and respectively. A value of ^ c l o s e t o one i n d i c a t e s a h i g h and c l o s e t o z e r o a low m o d e l l i n g p o w e r . (11)
6.5.3
The
Data Relevance
for a
Class.
S i m i l a r l y t o the m o d e l l i n g power f o r a s i n g l e v a r i a b l e ( p r e v i o u s s e c t i o n ) we c a n c a l c u l a t e how much o f t h e v a r i a t i o n i n t h e d a t a o f a c l a s s c[ i s d e s c r i b e d by t h e m o d e l :
Ψο
(<3)
=
l-s <*> /
(q)
s,
0
Here, i s given standard deviation o e q . 12) :
(17)
i n eq.
(9)
and
Sy^q)
is
the
1/2 (q)
_
M
N
Σ i=l
a i k=l q
(„\ ( y
( i
q
2
k
(q) The a v e r a g e o f j L ^ ' measure o f the t o t a l d a t a 6.5.4
ο - yi )
)
/
(M.N
over a l l classes relevance.
Discriminatory
Power o f
a
(18)
)
gives
a
Variable,
By c o m p a r i n g t h e f i t o f o b j e c t s i n two given c l a s s e s £ a n d r t o (a) t h e i r own c l a s s m o d e l s and (b) to the o t h e r c l a s s model ( o b j e c t s i n c l a s s £ to model r a n d v i c e v e r s a ) we c a n g e t a n i d e a o f t h e distance b e t w e e n t h e two c l a s s e s o v e r a l l v a r i a b l e s (section 6.5.8) a n d a l s o f o r e a c h v a r i a b l e . Using the r e s i d u a l s t a n d a r d d e v i a t i o n s i n eq. (14) we c a n c a l c u l a t e t h e d i s c r i m i n a t o r y power o f v a r i a b l e i ^ between the two c l a s s e s £ and r a s : 1/? (q)2 (r)2 -1 (r,q) (19) i/r i>q •i ΤΈΤΤ s 2 i , r i»q n
+
s
S
(r ο ) A value of ' c l o s e to zero i n d i c a t e s a low d i s c r i m i n a t o r y p o w e r a n d much a b o v e one a g o o d p o w e r . When m o r e t h a n two c l a s s e s a r e p r e s e n t , an a g g r e g a t e measure i s o b t a i n e d by summation o v e r a l l c l a s s - p a i r s or j u s t over a l l close c l a s s p a i r s , Hpair a c c o r d i n g t o t h e d i s t a n c e s d i s c u s s e d i n s e c t i o n 6.5.7) q
In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
CHEMOMETRICS: THEORY AND APPLICATION
262 (r,q)
(20)
n
/ pair r
r
q
6.5.5
Weighting of Variables.
I t i s customary to scale the variables so that they a l l have the same variance over the t r a i n i n g set (auto-scaling) (9). This i s to avoid an exaggerated influence of a variable showing large v a r i a t i o n i n the o r i g i n a l data. A f t e r the f i r s t stage of the analysis, the mea sures of modelling power and discrimination power give a d d i t i o n a l information about the relevance of the variables f o r the problem. Hence, i n l a t e r stages, the variables might, i f so warranted, be weighted by a m u l t i p l i c a t i o n by t h e i r modelling power Such a weightin on the a n a l y s i s , however make a zero-one weighting as discussed i n next section. 6.5.6
Selection of Variables
from Many.
When a large number of variables have been i n c l u d ed, a number of these often turn out to have a low modelling power (ψ^ smaller than, say, 0.3 i n eq. 16). Such variables are then best completely deleted and the computations redone for the reduced data set. We should note that variables must not be deleted s o l e l y on the basis of t h e i r discriminatory power since t h i s leads to a gross exaggeration of the differences between classes; the design of a method looking for differences must not be conditioned to maximize such d i f f e r e n c e s . A reasonable compromise can be reached, however, by deleting only such variables which have both low modelling and low discrimination power. 6.5.7
Distance Between Two
Classes.
In the same way as the discrimination power for single variables were c a l c u l a t e d i n section 6.5.4, we can c a l c u l a t e a distance between two classes using the r e s i d u a l standard deviations over a l l v a r i a b l e s . (q)2 (r
D '<3)
=
(r)2 1
+
1/2 (21)
sο
(q)2
+
sο
(r)2
In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
12.
WOLD AND SJÔSTRÔM
263
SIMCA
r
A distance D^ '3) close to zero indicates that the two classes r and (j are v i r t u a l l y i d e n t i c a l (with respect to the given data) and values larger than one indicate r e a l d i f f e r e n c e s . A quantitative t e s t on the s i g n i f i c a n c e of ο' 2' i s obtained by using the f a c t that (D(r,q) + 1 ) i s approximately F - d i s t r i b u t e d , with (N - Ag) (M-A ) + (Ng-A ) (M-A ) and (N -A -1) (M-A ) + (Nq-Aq-1) (M-Aq) degrees of freedom. Γ/(
2
r
g
r
r
r
r
r
6.6
C l a s s i f i c a t i o n of New Objects
(Test Set).
In order to assign objects i n the t e s t set to the classes, each of these objects i s f i t t e d to each class model according to eq. (2). The r e s i d u a l standard deviation (eq. 3) corresponds to the orthogonal d i s tance between object £ and c l a s s £ (Figure 5) However when c a l c u l a t i n g th class one also, obtain range of 6.k class. If an object £, when f i t t e d to the c l a s s model (eq. 2), gets c o e f f i c i e n t s t f a l l i n g outside the normal range of £ f o r the c l a s s , one instead c a l c u lates the "distance" between object £ and class c[ as d (q) (Figure 5) f
o
r
t
h
e
a
a
a
p
Variab
Variable I
Figure 5. When α point ρ gets coefficients t (eqn. 2) outside the "normal range* of a class q (eqn. 23, 24), the distance between the object ρ and the class is measured by dp (eqn. 22). ep
(Q)
In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
CHEMOM ETRICS : THEORY AND APPLICATION
264
1/2
(22)
Here the summation i s made only over terms where i s outside the normal range £a?low' £a?nigh This range i s defined by the minimal and maximal values of plus minus t/2 standard deviations of <3 where t* i s the t d i s t r i b u t i o n with Nq degrees of freedom , (23) θ (q) t*/2 · -'6,a a, max a,lim s (q) , - t*/2 (24) a.min 9,a t
a
( < 3 )
,(q)2 q (25) ak Σ k=l The c o e f f i c i e n t Φ i n eq. (22) i s introduced to make S p ^ and ( t - £a,lim) comparable N
}
3
2
a
-
-P
( q )
/
s
e,a
(26)
( q )
The object £ i s then assigned tgv the class sshowing the best f i t , i . e . , the smallest dp # provided that t h i s d p i s not much larger than the standard devia tion of the class according to an F-test (M-Aq) and (Nq-Aq-1) (M-Aq) degrees of freedom: =
d
(q)2 T
(q)2
(27)
/ s<
Furthermore, a unique assignment i s obtained only when the r a t i o between the next smallest (here denoted by S ρ ) and the smallest r e s i d u a l variances ( S p ) i s larger than a c r i t i c a l F-value with M-A and M-Aq degrees of freedom W
r
_ =
d
. (r)2 , , (q)2 P p 7
d
(28)
To conclude: the c l a s s i f i c a t i o n of an object ρ i n the t e s t s e t can give several d i f f e r e n t r e s u l t s .
In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
12.
WOLD AND SJÔSTRÔM
SIMCA
265
a.
Object £ i s uniquely assigned to c l a s s <£. I t f i t s the c l o s e s t c l a s s 3 within the t y p i c a l standarddeviation of that c l a s s (F i n eq. (27) i s smaller than the c r i t i c a l value). In addition, the distance to the next c l o s e s t c l a s s i s much larger according to the F-test i n eq. (28).
b.
Object £ f i t s s e v e r a l classes q^, 3 3 . · ·3η· fits each of these classes within the t y p i c a l standard deviations of the classes (eq. 9), but the r a t i o s between the distances to the classes 3,1···3η not d i f f e r e n t according to eq.* (28). The reason for t h i s ambiguity can be two; e i t h e r the data are not s u f f i c i e n t to d i s t i n g u i s h between the d i f f e r e n t classes with respect to object £ (more variables with b e t t e r discrimination power then be measured) or the object a c t u a l l being e i t h e r a g proper t i e s of several classes l i k e a compound with several functional groups. T
t
a
c.
r
e
Object £ f i t s no c l a s s within i t s t y p i c a l standard deviation (eq. 9). We then conclude that the object i s of a new, h i t h e r t o not see, type, a member of a new c l a s s . 6.7
Missing Data.
The presentation has so f a r been i m p l i c i t l y based on the assumption that the data matrix of each c l a s s reference set has been complete, i . e . , a l l elements Zik defined by measured values. In many chemical problems, however, the class data matrices are incomplete due e i t h e r the d i f f i c u l t y to measure a l l values, or the absence of a defined value f o r some variables and objects, e.g., a compound may have an i l l defined melting point or a c e r t a i n peak might be missing i n the IR spectrum of a compound. The model (1) may s t i l l be f i t t e d to an incomplete data matrix using an extended NIPALS method (23) . A l l degrees of freedom i n the expressions for the various r e s i d u a l standard deviations must then be changed to take missing observations into account, but that presents no problem, 1 i s subtracted from the denomi nators f o r each observation missing i n the c a l c u l a t i o n . The determination of the number of product terms in the models (section 6.4) can hopefully also be modified to cope with missing data. Presently, how ever, with missing data one must use a standard value a
r
e
In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
CHEMOMETRICS: THEORY AND APPLICATION
266
of Ag. We recommend using Ag = 2 or 3 unless M i s smaller than 10 when Ag should be smaller than M/4 (Ag=l i s always admissible, however, even i f M=2 or 3). 6.8
Graphical Methods.
The model (1) has a simple representation i n Mspace being a l i n e (A=l) or an Α-dimensional hyper plane. In cases when A=l gives a good representation of the data, t h i s can be seen also i n l i n e a r p r o j e c t ions of M-space down on two-space. Such projections w i l l project also the l i n e of the class model to a l i n e i n the 2-dimensional space. This i s the basis for the following graphical method which can e a s i l y be applied with the help of graph paper and a c a l c u l a t o r when the number of variables M i s smaller than 5 or 6 (below we i l l u s t r a t a.
Scale the variables by d i v i d i n g them by t h e i r range over a l l data.
b.
Divide the variables "randomly" into two groups, say group A containing variables 1 and 4 and group Β containing variables 2 and 3.
c.
P l o t , for each object, the sum of the variables i n group A against the sum of the variables i n group B, i . e . ( y
+
y
)
a
a i n s
+
y
t h e
lk 4k g t (y?k* 3k*> asterisks denoting that the y-values are scaled. d.
P l o t instead the differences between the variables i n groups A against the differences between the variables i n group B, i . e . , (y,, * - yav*) against
e.
Redistribute the variables creating two new groups A· and B', say A'=variables 1 and 3 and B v a r i ables 2 and 4.
3
1
Go through steps c and d for these two new
groups.
In t h i s way one obtains four d i f f e r e n t p l o t s of the data and patterns that are consistent i n two o r ' three of the plots are i n d i c a t i o n s of r e a l r e g u l a r i t i e s also i n M-space. As with a l l methods based on projections, i t i s important to r e a l i z e that a single p l o t might reveal apparent r e g u l a r i t i e s which i n f a c t are a r e s u l t of the choice of projections. One must therefore always
In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
12.
WOLD AND SJOSTROM
267
SIMCA
u s e a t l e a s t two d i f f e r e n t p r o j e c t i o n s a s g r a p h i c a l i l l u s t r a t i o n s o f d a t a and r e s u l t s . When a c o m p u t e r i s u s e d , a l a r g e number o f d i f f e r e n t p r o j e c t i o n s c a n be g e n e r a t e d more o r l e s s automatically. The r e a d e r i s r e f e r r e d t o K o w a l s k i f o r an i l l u m i n a t i n g d i s c u s s i o n . 6.9
The
Norbornane
(25)
Example.
The a n a l y s i s o f t h e n o r b o r n a n e d a t a g a v e t h e f o l l o w i n g r e s u l t s , the chemical s i g n i f i c a n c e of which i s discussed i n next s e c t i o n . 0.
The d a t a w e r e f i r s t a u t o s c a l e d by subtracting t h e v a r i a b l e means a n d d i v i d i n g b y 4 . 6 9 times the standard d e v i a t i o n s (bottom o f T a b l e I ) A l l the r e s u l t scaled data
1.
T h e d i m e n s i o n a l i t y o f t h e d a t a was f o u n d by a p p l y i n g c r o s s v a l i d a t i o n s e p a r a t e l y to the two c l a s s m a t r i c e s d e f i n e d b y t h e d a t a i n Table I. T h i s shows t h a t e a c h c l a s s i s b e s t d e s c r i b e d w i t h m o d e l (1) having A=l.
2.
F i t t i n g m o d e l (1) t o t h e two c l a s s m a t r i c e s g i v e s t h e p a r a m e t e r v a l u e s o f t h e two classes shown i n T a b l e I I f o r t h e t h r e e v a r i a b l e s c o n t a i n i n g any i n f o r m a t i o n (see b e l o w ) . V a r i a b l e s w i t h low d i s c r i m i n a t i o n and m o d e l i n g power were s t e p w i s e d e l e t e d f r o m f u r t h e r a n a l y s i s and t h e c o m p u t a t i o n s were r e d o n e f o r the r e d u c e d d a t a m a t r i c e s (see T a b l e I I I ) .
3.
The d a t a i n t h e l e a r n i n g s e t w e r e f i t t e d t o t h e two c l a s s m o d e l s (3 v a r i a b l e s ) g i v i n g t h e r e s u l t s i n T a b l e IV.
4.
The d a t a i n the t e s t s e t were f i t t e d t o t h e two c l a s s m o d e l s (3 v a r i a b l e s ) g i v i n g t h e r e s u l t s i n T a b l e V.
6.9.1
C l a s s i f i c a t i o n o f Sxo and S u b s t i t u t e d Norbornanes.
Endo
2-
T h e r e s u l t s f r o m t h e SIMCA c l a s s i f i c a t i o n show t h a t o n l y the s o - c a l l e d γ-carbons ( c , C and C7) c o n t a i n i n f o r m a t i o n w h e t h e r a compound i s an endo o r exo 2 - s u b s t i t u t e d n o r b o r n a n e . Thus the s h i f t differ e n c e s o b s e r v e d f o r t h e α, β a n d Δ c a r b o n s c o n t a i n only 4
6
In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
268
CHEMOMETRICS: THEORY AND APPLICATION
Table I . R e l a t i v e C NMR s h i f t s Δδ i n ppm f o r the Cy-C-j carbons i n the norbornane framework*-. The Δδ values are o b t a i n e d as s h i f t d i f f e r e n c e s between the a c t u a l s t r u c t u r e and the u n s u b s t i tuted norbornane (X=H)
Comp,
substituent
class
Δδ Δ6 Δ6 Δδ Δδ Δδ Δδ lk 2k 3k 4k 5k 6k 7k
1 -CH (exo)
1
6.7
6.7
2 -NH (exo)
1
8.9
25.3
12.4 -0.4 -1.2 -3.1 -4.4
3 -OH(exo)
1
7.7
44.3
12.3 -1.0 -1.3 -5.2 -4.1
5 -CN(exo)
1
5.5
1.0
6.3 -0.3 -1.5 -1.6 -1.3
6 -COOH(exo)
1
4.6
16.7
4.4 -0.2 -0.3 -1.0 -1.8
7 -C0 CH (exo)
1
5.1
16.4
4.2 -0.4 -1.1 -1.4 -2.1
8 -CH OH(exo)
1
1.8
15.1
4.4 -0.2
9 -CH (endo) 3
2
5.4
4.5
10.6
1.4
0.5 -7.7
0.2
10 -NH (endo)
2
6.8
23.3
10.5
1.2
0.6 -9.5
0.3
11 -OH(endo)
2
6.3
42.4
9.5
0.9
0.2 -9.7 -0.9
12 -CN(endo)
2
3.4
0.1
5.5
0.2 -0.7 -4.9
0.0
13 -COOH(endo)
2
4.2
16.2
2.1
0.9 -0.6 -4.8
1.9
14 -C0 CH (endo)
2
4.0
15.9
2.2
0.7 -0.7 -5.0
1.7
15 -CH 0H(endo)
2
1.7
12.8
4.0
0.4
0.2 -7.2
1.4
mean value Δ δ ^
5.37
15.8
8.02 0.25 -0.36 -3.65 -1.22
standard dev. 0^
1.68
15.8
3.15 0.78
3
2
10.1
0.5
0.2 -1.1 -3.7
4 -F(exo)
2
3
2
2
2
3
2
0.2 -0.7 -3.3
0.75
3.44
2.31
— Data from Grutzner, J . Β., J a u t e l a t , M, ; Dence, J . Β., Smith, R. Α., and Robert J . D. ; J . Amer. Chem. Soc. (1970), 92, 7107.
In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
12.
SIMCA
WOLD AND SJOSTRÔM
Table I I .
269
Parameters f o r the exo and endo c l a s s models
i
i l
class 1 (q=l)
i=4 i=6 i=7
-0.201 0.058 -0.165
0.775 0.613 0.149
class 2 (q=2)
i=4 i=6 i=7
0.153 -0.206 0.172
0.495 -0.764 -0.415
Table I I I . D i s c r i m i n a t i o n (Φ) and modelling (Ψ) power (see s e c t i o n 6.5.3 and 6.5.5) f o r the norbornane carbons as a f u n c t i o n o f the number o f v a r i a b l e s f i t t e d t o eq. (9). Values are shown f o r seven and three v a r i a b l e s .
i=l
i=2
i=3
Φ
±
0.4
1.
1.
Ψ
±
0.23
0.25
0.45
Φ
±
V
i=4 14. 0.36 22. 0.72
i=5 5. 0.50
i=6 46. 0.60 69. 0.74
i=7 20. 0.55 23. 0.50
In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
CHEMOMETRICS:
270
THEORY AND APPLICATION
Table IV. The r e s u l t o f the SIMCA c l a s s i f i c a t i o n o f the com pounds 1-15 i n the l e a r n i n g s e t and with A«l i n eq. (1) R e s i d u a l variance c l a s s 1 Ο s 0.012 and c l a s s 2 Ο = 0.012. F - t e s t according t o eq. (15). 2
2
Compound no.
known c l a s s
standard deviation (closest class)
standard deviation (next c l o s e s t class)
F* (closest class)
1
1
0.10 (1)
0.37 (2)
1.1 (1)
2
1
0.10 (1)
0.24 (2)
1.1 (1)
3
1
4
1
0.03 (1)
0.57 (2)
0.1 (1)
5
1
0.11 (1)
0.25 (2)
1.3 (1)
6
1
0.07 (1)
0.28 (2)
0.5 (1)
7
1
0.07 (1)
0.31 (2)
0.5 (1)
8
1
0.05 (1)
0.37 (2)
0.3 (1)
9
2
0.08 (2)
0.44 (1)
0.7 (2)
10
2
0.03 (2)
0.48 (1)
0.1 (2)
11
2
0.07 (2)
0.43 (1)
0.6 (2)
12
2
0.12 (2)
0.25 (1)
1.4 (2)
13
2
0.09 (2)
0.38 (1)
0.9 (2)
14
2
0.04 (2)
0.36 (1)
0.2 (2)
15
2
0.09 (2)
0.39 (1)
0.9 (2)
4
^crilT -
0
In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
12.
WOLD AND SJÔSTRÔM
SIMCA
271
Table V. The r e s u l t o f the SIMCA c l a s s i f i c a t i o n o f the compounds 16-43 i n the t e s t s e t . F - t e s t according t o eq. (27).
Compound no.
known class
standard deviation (closest class)
Fâ. (closest class)
standard deviation (next c l o s e s t class
16
1
0.13 (1)
1.3 (1)
0.42 (2)
17
2
0.13 (2)
1.3 (2)
0.27 (1)
18
1
0.09 (1)
0.6 (1)
0.40 (2)
19
2
0.11 (2)
1.0 (2)
0.43 (1)
20
1
0.08 (1)
(1)
(2)
21
2
0.07 (1)
0.5 (1)
0.26 (2)
21·
2
0.10 (2)
0.8 (2)
0.25 (1)
22
1
0.08 (1)
0.5 (1)
0.36 (2)
23
2
0.20 (1)
3.2 (1)
0.33 (2)
23·
2
0.10 (2)
0.8 (2)
0.33 (1)
24
1
0.08 (1)
0.5 (1)
0.41 (2)
25
2
0.09 (2)
0.6 (2)
0.45 (1)
26
1
0.06 (1)
0.3 (1)
0.38 (2)
27
2
0.09 (2)
0.6 (2)
0.30 (1)
28
1
0.09 (1)
0.5 (1)
0.42 (2)
29
2
0.19 (2)
3.0 (2)
0.35 (1)
30
1
0.10 (1)
0.8 (1)
0.45 (2)
31
2
0.30 (2)
7.7 (2)
0.39 (1)
32
1
0.09 (1)
0.7 (1)
0.40 (2)
33
2
0.04 (2)
0.2 (2)
0.29 (1)
F*. (right class i f not closest)
5.6 (2)
In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
CHEMOMETRICS:
272
Table V.
THEORY AND APPLICATION
(continued)
34
1
0.14
(1)
1.5
(1)
0.44
(2)
35
2
0.30
(2)
7.8
(2)
0.37
(1)
35'
2
0.15
(2)
2.0 (2)
0.48
(1)
36
1
0.11
(1)
0.9
(1)
0.41
(2)
37
2
0.07
(2)
0.5
(2)
0.45
(1)
38
1
0.09
(1)
0.6
(1)
0.41
(2)
39
2
0.05
(2)
0.2
(2)
0.36
(1
40
1
0.18
(2)
41
2
0.32
(2)
8.7
(2)
0.50
(1)
42
1
0.10
(2)
0.8
(2)
0.32
(1)
43
2
0.40
(1)
13.1 (1)
0.44
(2)
4
^crit- '
(2)
(1)
(1)
9.5
(1)
16.1 (2)
0
i n d i v i d u a l e f f e c t s of the substituent but no s p e c i f i c information whether the substituent i s s i t u a t e d i n exo or endo postion. The f i n d i n g that models with dimensionality A=l describes the data well shows that the s h i f t parameters c o n s i s t of an i n d i v i d u a l s u b s t i tute^ e f f e c t ûjç ' i n addition to the exo/endo e f f e c t s i*i The c l a s s i f i c a t i o n can be v i s u a l i z e d by the e a r l i e r described p l o t t i n g method (see section 6 . 8 ) . In t h i s case two w e l l separated classes are obtained by p l o t t i n g the difference between the r e l a t i v e s h i f t for C7 and Cg against the r e l a t i v e s h i f t f o r C 4 . The variables were rescaled to 2.73 C 4 , 1.0 and 1.43 C7, where the c o e f f i c i e n t s are obtained by d i v i d i n g the experimental width f o r each variable by the v a r i a b l e q
In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
12.
WOLD AND SJÔSTRÔM
SIMCA
273
with minimum width observed (C ) . (See Figures 4,7 and 8·) The obtained parameters from the t e s t sets can be used to c l a r i f y i f other exo or endo 2-substituted compounds c l o s e l y r e l a t e d to the norbornanes behave i n a s i m i l a r way. Therefore a number of such compounds were chosen to form a t e s t set (compounds 16-43, see Figure 6). The autoscaled s h i f t differences were f i t t e d to the parameters obtained f o r the t r a i n i n g sets (a and £ are f i x e d to the values obtained i n the analysis of the reference s e t s ) . The goodness of f i t of the n o n c l a s s i f i e d objects to the class models can be estimated by F-test by comparing the r e s i d u a l variances f o r these objects with the r e s i d u a l variances of the reference sets. (See Table V.) Most of the structures are w e l l - c l a s s i f i e d i . e . , they have the same variance as observe reference sets, a f a c assignments i n most cases. (See Figure 7.) However, some exceptions can be noted (Figure 8). 6
(a)
The tentative assignment given for the C4 and C2 carbons i n structure 23 does not provide a correct c l a s s i f i c a t i o n . The reversed assignment f o r C4 and C2 c l a s s i f i e d t h i s structure w e l l .
(b)
For compound 21 two d i f f e r e n t assignments have been proposed f o r the C5 and C 7 . One of these c l a s s i f i e d 21 c o r r e c t l y .
(c)
Applying the proposed assignment f o r 35 t h i s compound f i t neither c l a s s . However, a f t e r reversing the assignment f o r C2 and C4 the compound f i t n i c e l y to the endo c l a s s .
(d)
A group of compounds the terpenes 40-43 have been found not to be w e l l - c l a s s i f i e d with the parameters from the t e s t set, showing that compounds with methyl groups on the C7 don't behave i n the same way as the s h i f t s f o r the compounds i n the test-set.
7.
Applications
The SIMCA-methodology i s rather new and the number of published applications therefore i s small. Most of the early applications concern s t r u c t u r e - r e a c t i v i t y r e l a t i o n s of the Hammett-equation type with the data analysis of a single class (12,26-28). A single c l a s s
In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
CHEMOMETRICS: THEORY AND APPLICATION
test s e t
Sfcr dtr ar 28(29)
30(31)
34(35)
36(37Γ
40(41)
32(33)
38(39)
2
42(43)
Figure 6. The test set of exo and endo norbornanes and related compounds. Even numbers refer to the exo compounds, the odd numbers to the endo compounds.
In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
12.
WOLD AND SJÔSTROM
275
SIMCA
a p p l i c a t i o n i n A n a l y s t i c a l Chemistry addresses the comparison of a n a l y t i c a l methods with respect to t h e i r performance on r e a l data (29) . The e a r l i e s t use of SIMCA i n pattern recognition with several classes was made by Duewer, Kowalski and Schatzki (30) who analyzed the 22-dimensional data of 40 classes of simulated o i l - s p i l l s . They showed that i t i s possible to determine the source of an o i l - s p i l l on the basis of such data. This was followed by the t e s t of SIMCA on the c l a s s i c a l I r i s data of Fisher (5) and the analysis of s p e c t r a l data (IR and UV) of two classes of unsaturated carbonyl compounds (12). Sjôstrôm and Edlund analyzed the C NMR data of norbornanes and chlorinated biphenyls (A). The norbornane analysis has been used i n the present a r t i c l e as an i l l u s t r a t i o n , Strouf and Wold have used SIMCA to t r y to f i n d r e g u l a r i t i e and unstable comple features such as the s i z e and e l e c t r o n e g a t i v i t y of ligands and c e n t r a l atoms, a 75% p r e d i c t i o n rate of the s t a b i l i t y of the complexes was achieved, a rate which i s s i g n i f i c a n t l y better than chance. The s i m i l a r i t y model (1) has also been used i n c l u s t e r analysis i n the grouping of l i q u i d phases for gas chromatography on the basis of retention data of 10 standard compounds (_32) . We w i l l not here i n d e t a i l discuss areas where SIMCA has possible a p p l i c a t i o n s . Chemistry i s f u l l of problems which i n a natural way can be formulated i n terms of s i m i l a r i t y assessment and/or c l a s s i f i c a t i o n . S u f f i c e i t here to mention s t r u c t u r a l determination of unknown compounds on the basis of spectra, source determination of samples ( o i l - s p i l l s , blood-stains, paper-samples), type assessment of b i o l o g i c a l i n d i v i duals (chemotaxonomy) r e l a t i o n between structure and r e a c t i v i t y and — e x c i t i n g but d i f f i c u l t — the e f f o r t s to f i n d r e l a t i o n s between chemical structure and b i o l o g i c a l a c t i v i t y (mutagenicity, t o x i c i t y , e t c . ) . 13
8.
Future Developments
The main l i n e s of development can be seen for the SIMCA methodology. The f i r s t i s to extend the s i m i l a r i t y models beyond the simple PC models of eq. (1) to cope with p a r t i c u l a r types of problems where the PC model i s less e f f i c i e n t . One example i s when one wishes to introduce c a u s a l i t y i n the mode. Take, for instance, the norbornane a p p l i c a t i o n discussed above, and l e t us post the problem how the v a r i a t i o n of ^ C NMR spectra within each class depends on the structure
In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
276
CHEMOMETRICS: THEORY AND APPLICATION
|I
>.3>
1.3»
•fl. IP
1Ϊ·*
IP
ill I1 111
If* ο S ~
i!'s
Ι δ, §s J-3 υ 8
In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977. 9
k
k
b(ppm)
51
a(ppm) test V
set \
Q41
^b(ppm)
iS
Figure 8. (a) Classification of the endo compounds 21, 23, and 35 with two different assignments of their C Ν MR spectra. The reversed spectra are denoted with 21' 23', and 35'. Compare results in Table V. (b) The terpenes 40-43 and compound 31 are not well-classified by the learning set. d (broken line) is the difference between an observation and the lines equation for the corresponding class. The normalized distance d /y/(M — 1) (M number of variables) is significantly larger than the typical normalized distance (residual standard deviation for the class) for the observations in the reference set. Compare F-test Table V.
a (ppm)
CHEMOMETRICS: THEORY AND APPLICATION
278
of the compounds. I f we describe each structure by means of a number of s t r u c t u r a l descriptors, say, the s i z e , e l e c t r o n e g a t i v i t y and T a f t σ-value of the 12 substituents i n positions 1 through 7 (including hydrogen), we have 36 s t r u c t u r a l descriptors for each compound (block 1). These, together with the 7 NMRs h i f t s (block 2) give 4 3 variables for each compound. However, we now wish to analyze the data i n terms of cause and e f f e c t , we wish to "explain" the v a r i a t i o n of the l a t t e r 7 variables (block 2) as the " e f f e c t " of the v a r i a t i o n of the former 36 (block 1). This can be done by expanding the model (1) to two PC-models — one for each block of variables — plus a multiple r e l a t i o n between the θ-vectors i n block 1 and the Θvectors i n block 2. Such "path" models which current ly are being developed for use i n econometrics, sociology and othe s o c i a l science (33,34) w i l l hav many i n t e r e s t i n g application chemical data. The second l i n e of development involves the a p p l i c a t i o n of the SIMCA methodology to data where the grouping (classes) i s not i n i t i a l l y known. Dataanalysis aimed at f i n d i n g the "natural" grouping of a number of objects with respect to multidimensional data measured on the objects i s usually c a l l e d c l u s t e r analysis. A SIMCA c l u s t e r analysis would thus t r y to group the objects so that within each of the r e s u l t i n g groups (clusters) the data would be well described by model (1) (or one of i t s path-model extensions). Mathematically and computationally, t h i s i s a much more d i f f i c u l t problem than the one where the classes are defined à p r i o r i but the r e s u l t s are also more rewarding i n terms of new knowledge. Possible chemical applications of c l u s t e r analysis are as many as those of c l a s s i f i c a t i o n . Any large c o l l e c t i o n of chemical systems, say solvents, reactions of a c e r t a i n type, compounds of a c e r t a i n type, can be subjected to a c l u s t e r analysis and, i f a natural grouping i s found, the r e s u l t i n g c l u s t e r s correspond to a s u b c l a s s i f i c a t i o n of the chemical systems which then can be u t i l i z e d p r a c t i c a l l y and/or interpreted t h e o r e t i c a l l y . 9.
Discussion
The analysis and i n t e r p r e t a t i o n of and complex r e l a t i o n s h i p s and the behavior of complex systems such as those investigated i n Chemistry i s often done using the concepts of s i m i l a r i t y and classes. In the t y p i c a l case one has observed r e l a t i v e l y much data on each
In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
12.
WOLD AND SJOSTRÔM
SIMCA
279
system (spectra, thermodynamic data, GC and trace element "spectra") but has r e l a t i v e l y l i t t l e theoreti c a l information about the fundamental processes generating the data. By d i v i d i n g the systems into classes so that the systems display a regular behavior ( s i m i l a r i t y ) within each c l a s s , the analysis i s considerably s i m p l i f i e d ; the complexity of a large c o l l e c t i o n of i n d i v i d u a l systems i s reduced to the moderate complexity of a small number of classes. The quantitative analysis of the data i n each class by means of simple mathematical models such as eq. (1) makes i t possible to e x t r a c t most of the information contained i n the data i n a way which i s e a s i l y interpreted. The l i n e a r structure of model (1) makes a graphical display of data and r e s u l t s straightforward. What might be importan that SIMCA does no each c l a s s . The cross-validatory estimation of the number of components (A) assures that only " r e a l " r e g u l a r i t i e s are described i n the data. Thus, i f there i s no difference between two (or several) classes with respect to the data, t h i s i s found by the method and the chemist need not i n t e r p r e t a r t i f a c t s introduced by the problem formulation. I t i s p h i l o s o p h i c a l l y i n t e r e s t i n g that the quantitative s i m i l a r i t y model (1) seems to c l o s e l y correspond to the i n t u i t i v e meaning of s i m i l a r i t y as the concept i s used i n chemistry. Thus, data observed on chemical systems which are s a i d to be s i m i l a r , can usually be described by model (1) with one term (A=l). When comparing SIMCA with other methods of quanti f y i n g s i m i l a r i t y we note that SIMCA often performs better on r e a l data sets that "optimal" methods such as the K-nearest neighbor method (5./3C0 . The comparisons made on r e a l data are too few to allow any d e f i n i t e conclusions, however. T h e o r e t i c a l l y , SIMCA has the advantage before distance based methods of being only mildly dependent on the s c a l i n g of the i n d i v i d u a l v a r i a b l e s . The independent handling of each class makes SIMCA i n s e n s i t i v e to the number of classes which i s a severely l i m i t i n g factor with l i n e a r discriminant analysis and s i m i l a r methods such as the l i n e a r learning machine. Another severe l i m i t a t i o n with most methods i s that the number of variables must not approach the number of objects i n one c l a s s . The number of a v a i l a b l e parameters then approaches the number of data with disastrous consequences for the robustness
In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
CHEMOMETRICS:
280
THEORY AND APPLICATION
of the r e s u l t s . SIMCA extracts only A components (eq. 1) from the data i n each class and as long as A i s small compared to the number of variables and objects (less than 1/4) i n the classes, the r e s u l t s are robust even i f the number of variables i s very large. Missing data are e a s i l y handled with SIMCA, a necessary condition f o r any method to be used i n routine a p p l i c a t i o n . We believe that SIMCA, being s p e c i f i c a l l y developed f o r chemical data analysis, has many features which makes i t useful f o r these purposes. So far, we have not detected any drawbacks with the method, i t seems s u f f i c i e n t l y f l e x i b l e to adapt to most patterns revealed by r e a l data. The experience with the method i s s t i l l l i m i t e d , however, and the a p p l i c a t i o n of the metho data sets and the continuin methods w i l l provide further information about advan tages and drawbacks of SIMCA i n p a r t i c u l a r and quantitative s i m i l a r i t y models i n general. F i n a l l y , we wish to stress that mathematical methods f o r data analysis such as SIMCA i n no way are a substitute f o r chemical s k i l l or i n t u i t i o n . Rather, analogously to a good instrumental method which s i m p l i f i e s the c o l l e c t i o n of accurate and relevant measurements, the mathematical method f a c i l i t a t e s the extraction o f relevant information from a given set of data. Thus, the chemist i s freed from much of the drudgery of making lengthy c a l c u l a t i o n s and can instead address himself to the more rewarding tasks of formulating problems, designing c r u c i a l experiments producing relevant data and i n t e r p r e t i n g the r e s u l t s of the data analysis. Literature Cited
1. Theobald, D. W.; Chem. Soc. Rev. (1976), 5,203. 2. Schoenfeld, P.S. and DeVoe, J . R.; Anal. Chem. (1976) 48, 403 R. 3. Bard, Y. and Lapidus, L.; Catalysis Rev. (1968), 2, 67. 4. Sjöström, M. and Edlund, U.; J . Magn. Resonance (1977, 25., 285. 5. Wold, S.; Pattern Recognition (1976), 1, 127. 6. Kanal, L.; IEEE Trans. Inform. Theory (1974) 20, 697. 7. Cacoullos, T. (Ed.); "Discriminant Analysis and Applications", Academic Press, New York, 1973.
In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
12.
WOLD AND SJÖSTRÖM
SIMCA
8. Jurs, P., and Isenhour, T.; "Chemical Applications of Pattern Recognition", Wiley, New York, 1975· 9. Kowalski, B. R.; in "Computers in Chemical and Biomedical Research" Vol. 2 (Ed.s, C. Ε. Klopfenstein and C. L. Wilkins), Academic Press, New York, 1974 10· Kowalski, B.R,; Anal. Chem. (1975) 47, 1152 A 11. Chapman, Ν. B. and Shorter, J.; (Ed.s) "Advances in Linear Free Energy Relationships", Vol. 1, Plenum Press, London, 1972. 12. Wold, S., and Sjöström, M.; in "Advances in Linear Free Energy Relationships" (Ed.s, N.B. Chapman and J. Shorter), Vol. 2, Plenum Press, London, 1978. 13. Bell, R. P., "The Proton in Chemistry", Methuen, London, 1959. 14. Hammett, L. P. 2nd ed., McGraw-Hill 15. Hammond, G. S.; J . Chem. Educ. (1974) 51, 558. 16. Anders, O. U.; Anal. Chem. (1972) 44, 1930. 17. Howery, D. G.; Int. Labor. (1976) March/April, 11. 18. Weiner, P. H. and Howery, D. G.; Anal. Chem. (1972) 44, 1189. 19. Malinowski, E. R., and Weiner, P. H.; "Factor Analysis in Chemistry", to be published. 20. Draper and Smith, "Applied Regression Analysis", Wiley, New York, 1966. 21. Wold, H.; in "Festschrift for Jerzy Neyman" (Ed., F. Ν. David), Wiley, New York, 1966. 22. Wold, H.; in "Multivariate Analysis", (Ed., P. R. Krishnaiah), Academic Press, New York, 1966. 23. Wold, S.; to be published. 24. Box, G. E. P. and Cox, D. R. J . Roy; Statist. Soc. Β (1974) 36, 11. 25. Kowalski, B. R., J . Amer. Chem. Soc. (1973) 95, 686. 26. Wold, S., and Sjöström, M.; Chem. Scripta (1972), 2, 49. 27. Sjöström, M., and Wold, S.; Chem. Scripta (1974), 6, 114. 28. Sjöström, Μ., and Wold, S.; Chem. Scripta (1976), 9, 200. 29. Carey, R. N., Wold, S., and Westgard, J . O.; Anal. Chem. (1975) 47, 1824. 30. Duewer, D. L . , Kowalski, B. R., and Schatzki, T. F.; Anal. Chem. (1975) 47, 1973. 31. Strouf, O., and Wold, S.; Acta Chem. Scand. A (1977), in press. 32. Wold, S.; J . Chromatogr. Sci. (1975), 13, 525.
In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
282
33. 34.
CHEMOMETRICS: THEORY AND APPLICATION
Wold, H.; i n Statistics" York, 1975. Wold, H.; i n M. B l a l o c k ) ,
"Perspectives i n P r o b a b i l i t y and (Ed., J . Gani) Academic Press, New "Quantitative Socioloty" (Ed., H. Academic Press, New York, 1975.
In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
INDEX
A Absorbance, spectrophotometric Abstract factor analysis Acid rain ADAPT architecture Additivity model AED AFA Analysis classification continuous property factor linear discriminant nonlinear regression unsupervised Antibacterial potency of N-benzyl erythromycins Antibacterial potency of nitrophenols Applications of pattern recognition to structure-activity relations ARCH obsidian data Architecture, ADAPT ARTHUR Augmented environment descriptors .. Automated data analysis using pattern recognition techniques
67 53 80 174 175 165 180 53 1 15 53 46 153 15 161 159 173 24 175 14 180 174
Β BACLASS 32 BAGAUS 32 BAHIST 33 Barbiturate study 184 BASET 32 BASPLINE 32 BATEST 33 BAYES 32 Bayesian classification 21,32 BED 180 Behren-Fisher 35 N-Benzyl erythromycins, antibacterial potency of 161 Bona environment descriptors 180
C Calcium, concentration in blood serum Calculations, interpattern distance matrix
4 34
Category CHANGE CHCATEGORY CHDATA CHFEATURE CHJOIN Chromatography, gas-liquid Chronocoulometry CHSPLIT
15 49,50 50 50 21,50 50 68 127 50
CHXXX City-block distance Class maker
49 19,35 177
Classification s )
analysis based on interpattern distance of complex mixtures linear multiple binary nonparametric methods of parametric methods of Classifiers Clustering hierarchical Q-mode unsupervised Compensation, enthalpy-entropy Complex mixtures, classification of .... Comprehensive statistical screening program Continuous property analysis CORREL Correlation coefficient Correlation matrix CPC Cresolphthalein complexone Crystals, spin labels in inclusion CUM
15 42 172 131 130 169 169 169 22 82 41 192 172 237 15 33 199 82 8 4 120 223
D Data vector DEA Dendrogram, hierarchical De novo model Descriptor developer Deviation, residual standard Dialysis DIF Discriminant developer
In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
15 8 82 185 178 57 4 224 183
CHEMOMETRICS : THEORY AND APPUCATION
286 Discriminant function development.... 170 DISTANCE 34 DISTANCE-HIER 22 Distance city-block 19 Gaussian feature-space 36 Gaussian overlap-integral 19,35 interpattern, classification based on 42 Mahalanobis 19 Diversity, limited 249 Drug activity 71
£ Eigenanalysis 84 ENDIT 49 Enthalpy-entropy compensation 192 Enthalpy-entropy pairs 194 Environment descriptors 180 Equation, statistical compensation 200 Error embedded 5 extracted 53 function, imbedded 61 real 53 root-mean-square 58 theory of 53,54 Erythromycins, N-benzyl, antibac terial potency of 161 ETR 254 Evaluation set 16 Extra thermodynamic relationships .... 254 Extracted error 53 Extrathermodynamic model 165 F FA Factor(s) analysis defined indicator function Feature selection by weighting Feedback feature selection File generator Function, factor indicator
53 53 1 54,64 15 43 171,184 176 54
G Gas-liquid chromatography Gaussian feature-space distance Gaussian overlap-integral distance Global models GRAB
68 36 .19,35 244 44
H Hansch HCL-A
165 8
HCL-B HIER Hierarchical clustering Hierarchical dendrogram 8HQ-A 8HQ-B Hyperplane Hyperspace
8 21,41 22 82 8 8 170 170
I IE Imbedded error Inclusion crystals, spin labels in IND INDUMP INFILL INPUT Input, secondary Intermeasurent correlations Interpattern distance matrix calculations Interval, limited
53 53,61 120 54 49 49 49 1 33
34 249
Κ KACROSS 37 KADISP 38 KAMALIN 38 KAORTH 38 KAPICK 38 KAPRIN 37 Karhunen-Loève transform 19,29,39 KARLOV 39 KATRAN 37 KAVARI 37 KAVECTOR 38 KNN 42 Kolomogorov-Smirnov test 33,229 Kolomogorov-Smirnov two sample text 222 L Labels, spin lipid Learning machine( s ) attributes LEAST LEDISC LEPIECE LESLT LFER Lilliefors test of normabty Limited diversity Limited interval Linear classifications
In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
117 117 169 171 46 46 47 46 165,254 223 249 249 131
INDEX
287
Linear ( continued ) discriminant analysis free energy relationships Lipid spin labels
46 254 117
M Mahalanobis distance 19,35 Mapping 8 Maps, single species 81 Mass spectra 68 Materials science 172 Mathematical models 59 Measurement 15 Measurement-by-measurement plots .. 22 Membrane systems, spin labels in model 123 Modeling of chemical experiments 173 Models, global 244 MOLMEC 177 Monte Carlo simulation 23 MULTI 4 Multiple binary classifications 130
Property Pythagorean relationship Q Quantum mechanical methods Q-mode clustering
166 82
R Rain, acid Rain chemistry RE Real error Recognition, pattern REGRESS Regression algorithm Regression analysis, nonlinear Residual standard deviation Responses
53 16 48 205 153 57 3
RSD
57
80 80
53
S
Ν
NEW 49 NIPALS 265 Nitrophenols, antibacterial potency of 159 Nitroxide free radicals 117 NLM 42 NMR 65 Nonlinear mapping 42 Nonlinear regression analysis 153 Nonparametric methods of classification 169 Nonparametric testing 229 Norbornane 267 Normality, Lilliefors test of 223 Ο
Object Optimization Output, secondary
16 58
15 1,3,8 3 Ρ
Parametric methods of classification 169 Pattern recognition 16,166 chemical applications of 172 PLANE 47 Plots, measurement-by-measurement.. 22 PNN 43 Prediction of properties from molecular structure 173 Preprocessor 183 Principal components 254 Prior feature selection 183
SCALE SELECT Semiempirical linear free energy related model SICLASS SICSVA SIJACOBI
SIMCA
45 44 165 41 39 40
39,243,253
Single species maps SIPRINCO
81 40
sium
Spectral data analysis Spectrophotometric absorbances Spin labels in inclusion crystals lipid in model membrane systems Standard deviation, residual Statistical compensation equation Statistical isolinear multicategory analysis Statistics STEP Structure-activity relations, applica tions of pattern recognition to Structure-activity studies Structure, prediction of properties from molecular
40 172 67 117 120 117 123 57 200 39 219 48 173 165 173
Τ Target transformation Target-transformation factor analysis Test, Kolmogorov-Smirnov
In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
75 73 229
CHEMOMETRICS:
288 Test set 1β Three-dimensional model builder 177 Threshold logic unit 170 TLU 170 Training set 16 Transform, Karhunen-Loève 19,29,39 Transformations, target 75 TREE 21,42 TTFA* ramifications of 78 TUMED 51 TUNE 21,49,51 TURAND 51 TUTRAN 51 TUXXX 49
V VACALC Varimax rotation Varivalues VARVAR VATKIT
50 86 86 22,49 50
w WED 180 WEIGHT 43 Weighted environment descriptors .... 180 Weighting, feature selection by 43
U UDRAW Uncertainty Unsupervised analysis Unsupervised clustering UTILIT Utilities
THEORY AND APPLICATION
X
176 16 15 4 4 49
XE XNORM
Zinc
In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
53 224
82