The data analysis handbook

DATA HANDLING IN SCIENCE AND TECHNOLOGY - VOLUME 14 The data analysis handbook DATA HANDLING IN SCIENCE AND TECHNOLO...

Author: I.E. Frank | Roberto Todeschini

147 downloads 2920 Views 4MB Size Report

This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!

Report copyright / DMCA form

DOWNLOAD PDF

DATA HANDLING IN SCIENCE AND TECHNOLOGY - VOLUME 14

The data analysis handbook

DATA HANDLING IN SCIENCE AND TECHNOLOGY Advisory Editors: B.G.M. Vandeginste and S.C. Rutan Other volumes in this series: Volume 1 Volume 2 Volume 3 Volume 4 Volume 5 Volume 6 Volume 7 Volume 8 Volume 9 Volume 10 Volume 11 Volume 12 Volume 13 Volume 14

Microprocessor Programming and Applications for Scientists and Engineers by R.R. Smardzewski Chemometrics: A Textbook by D.L. Massart, B.G.M. Vandeginste, S.N. Deming, Y. Michotte and L. Kaufman Experimental Design: A Chemometric Approach by S.N. Deming and S.L. Morgan Advanced Scientific Computing in BASIC with Applications in Chemistry, Biology and Pharmacology by P. Valko and S. Vajda PCs for Chemists, edited by J. Zupan Scientific Computing and Automation (Europe) 1990, Proceedings of the Scientific Computing and Automation (Europe) Conference, 12-15 June, 1990, Maastricht, The Netherlands, edited by E.J. Karjalainen Receptor Modeling for Air Quality Management, edited by P.K. Hopke Design and Optimization in Organic Synthesis by R. Carlson Multivariate Pattern Recognition in Chemometrics, illustrated by case studies, edited bv R.G. Brereton Sampling of Heterogeneous and Dynamic Material Systems: theories of heterogeneity, sampling and homogenizing by P.M. Gy Experimental Design: A Chemometric Approach (Second, Revised and Expanded Edition) by S.N. Deming and S.L. Morgan Methods for Experimental Design: principles and applications for physicists and chemists by J.L. Goupy Intelligent Software for Chemical Analysis, edited by L.M.C. Buydens and P.J. Schoenmakers The Data Analysis Handbook, by I.E. Frank and R. Todeschini

DATA HANDLING IN SCIENCE AND TECHNOLOGY -VOLUME

14

Advisory Editors: B.G.M. Vandeginste and S.C.Rutan

The data analysis

handbook

I L D I K ~E. FRANK Jerll, Inc., 790 Esplanada, Stanford, CA 94305, U.S.A.

and

ROBERTO TODESCHINI Department of Environmental Sciences, University of Milan, 20133 Milan, Italy

ELSEVIER Amsterdam

- Lausanne - New York - Oxford -Shannon -Tokvo

1994

ELSEVIER SCIENCE B.V. Sara Burgerhartstraat 25 P.O. Box 211,1000 AE Amsterdam, The Netherlands

ISBN 0-444-81659-3 Q 1994 Elsevier Science B.V. All rights reserved. No part of this publication may be reproduced, stored in a retrieval system or transmitted in any form or by any means, electronic, mechanical, photocopying, recording or otherwise, without the prior written permission of the publisher, Elsevier Science B.V., Copyright & Permissions Department, P.O. Box 521,1000 AM Amsterdam, The Netherlands. Special regulations for readers in the USA - This publication has been registered with the Copyright Clearance Center Inc. (CCC), Salem, Massachusetts. Information can be obtained from the CCC about conditions under which photocopies of parts of this publication may be made in the USA. All other copyright questions, including photocopying outside of the USA, should be referred to the publisher. No responsibility is assumed by the publisher for any injury and/or damage to persons or property as a matter of products liability, negligence or otherwise, or from any use or operation of any methods, products, instructions or ideas contained in the material herein. This book is printed on acid-free paper. Printed in The Netherlands

V

Introduction Organizing knowledge is another way to contribute to its development. The value of such an attempt is in its capability for training, education and providing deepening insights. Separating the organization from the production of knowledge is arbitrary. Both are essential to the advancement of a field. How many times have you looked for a short and accurate description of an unknown or vaguely familiar term encountered in a paper, book, lecture or discussion? How often have you tried to figure out whether two different terminologies in fact refer to the same method, or whether they are related to different techniques? How can you get a comprehensive, yet concise introduction to a topic in data analysis? It is our hope that this book will be helpful in these as well as in many other contexts. This handbook can be used in several ways at different levels. It can serve as a quick reference guide to a rapidly growing field where even an expert might encounter new concepts or methods. It can be a handy companion to text books for undergraduate or graduate students and researchers who are involved in statistical data analysis. It provides a brief and highly organized introduction to many of the most useful techniques. This handbook is written in a dictionary format; it contains almost 1700 entries in alphabetical order. However, unlike an ordinary dictionary, in which each entry is a separate, self contained unit, rarely connected to others by cross-reference, this book is highly structured. Our goal is to give not only definitions and short descriptions of technical terms, but also to describe the similarities, differences, and hierarchies that exist among them through extensive cross-referencing. We are grateful to many of our colleagues, who have contributed their expertise to this book. At the risk of doing injustice to others who are not named individually, we would like to thank Kim Esbensen, Michele Forina, Jerome Friedman, Bruce Kowalski, Marina Lasagni, Luc Massart, Barbara Ryan, Luis Sarabia, Bernard Vandeginste, Svante Wold for their valuable help.

This Page Intentionally Left Blank

VII

Contents ............................................. User’s Guide . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Acronyms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

XI11

................................................

XVII

................................

1

..............................................

353

Introduction

Notation

The Data Analysis Handbook

References

V

IX

This Page Intentionally Left Blank

IX

User’s Guide This handbook consists of definitions of technical terms in alphabetical order. Each entry belongs to one of the following twenty topics as indicated by their abbreviations: [ALGEI [ANOVA] [CLASI [CLUS] [DESC] [ESTIM] [EXDEI [FACT] [GEOM] [GRAPH] [MISC] [MODEL] [MULT] [OPTIM] [PREP] [PROB] [QUALI [REGR] [TEST] [TIME]

linear algebra analysis of variance classification cluster analysis descriptive statistics estimation experimental design factor analysis geometrical concepts graphical data analysis miscellaneous modeling multivariate analysis optimization preprocessing probability theory quality control regression analysis hypothesis testing time series

Each topic is organized in a hierarchical fashion. By following the cross-references (underlined words) one can easily find all the entries pertaining to a topic even if they are not located together. Starting from the topic name itself, one is referred to more and more specific entries in a top-down manner. We have intentionally tried to collect related terms together under a single entry. This organization helps to reveal similarities and differences among them. Such “mega” entries are for example: classification, control chart, data, design, dispersion, distance, distribution, estimator, factor rotation, goodness-of-fit, hierarchical

.

.

. .

.

.

. ..

X

User’s Guide

abbreviation to which it belongs, and finally by its (: example: b

artificial intelligence (AI) [MISC]

b

class [CLAS] (: category)

Synonyms), if any. For

There are three different kinds of entries: regular, synonym, and referenced. A regular entry has its definition immediately after the entry line. For example:

association matrix [DESC] Square matrix of order n defined as the product of a data matrix W(n, p ) with its transpose: b

A=XXT A synonym entry is defined only by its synonym word indicated by the symbol : and typeset in italics. To find the definition of a synonym entry, one goes to the

text under the entry line of the synonym word typeset in italics. For example: b

.-

least squares regression (LS) [REGR] ordinary least squares regression

A referenced entry has its definition in the text of an other entry indicated by the symbol + and typeset in italics. To find the definition of a referenced entry, one references the text under the entry line of the word in italics. For example:

agglomerative clustering [CLUS] + hierarchical clustering b

The text of a regular entry may include the definition of one or more referenced entries highlighted in bold face letters. When there are many referenced entries collected under one regular entry, called a “mega” entry, they are organized in a hierarchical fashion. There may be two levels of subentries in a mega entry, The first indicated by the symbol and the second by the symbol 0. For example, the subentries in the mega entry “hierarchical clustering” are:

hierarchical clustering [CLUS] agglomerative clustering o average linkage o centroid linkage b

...

o weighted average linkage o weighted centroid linkage

User’s Guide

o

divisive clustering association analysis

o

Williams-Lambed clustering

XI

...

In the text of a regular entry one is referred to other relevant terms by underlined words. We highly recommend reading also the definitions of these underlined words in conjunction with the original entry. We have made a special effort to keep mathematical notation simple and uniform. A collection of the most often appearing symbols are found on page XVII. There are several figures and tables throughout the book to enhance the comprehension of the definitions. A list of acronyms helps decipher and locate the full terminologies given in the book. Finally, we have included a list of references. Although far from complete, this bibliography reflects our personal preferences and suggestions for further reading. Books, flagged by the symbol m, and important papers, flagged by the symbol a, are organized according to the same topical scheme as the entries of the handbook.

This Page Intentionally Left Blank

XI11

Acronyms ACE AFA AZ AZC AID ANCOVA ANN ANOVA AOQ AOQL AQL AR ARZMA ARL A M ASN BBM BZBD BLUE BR CARS0 CART CCA CCD cdf CFA CFD CLASSY CTM cv DA DASCO

Alternating Conditional Expectations Abstract Factor Analysis Artificial Intelligence Akaike’s Information Criterion Automatic Interaction Detection ANalysis of COVAriance Artificial Neural Network ANalysis Of VAriance Average Outgoing Quality Average Outgoing Quality Limit Acceptable Quality Level AutoRegressive model AutoRegressive Integrated Moving Average model Average Run Length AutoRegressive Moving Average model Average Sample Number Branch and Bound Method Balanced Incomplete Block Design Best Linear Unbiased Estimator Bayes’ Rule Computer Aided Response Surface Optimization Classification And Regression n e e Canonical Correlation Analysis Central Composite Design cumulative distribution function Correspondence Factor Analysis Complete Factorial Design CLassification by Alloc and Simca Synergism Classification Tree Method cross-validation Discriminant Analysis Discriminant Analysis with Shrunken COvariances

Acronyms

XIV

EDA EFA EMS

ER ESE EVOP FD FFD GA gcv GLM GLS GOF GOLPE GOP GSA GSAM IC IE IKSFA ILS IR WLS K" KSFA LCL LDA LDCT LDF LDHC LLM LMS LOO LOWESS LRR LS LTPD LTS MA MAD MADM MANOVA MARS MCDM

Exploratory Data Analysis Evolving Factor Analysis Expected Mean Squares in ANOVA Error Rate Expected Squared Error Evolutionary Operation Factorial Design Fractional Factorial Design Genetic Algorithm generalized cross-validation Generalized Linear Model Generalized Least Squares regression Goodness Of Fit Generating Optimal Linear Pls Estimation Goodness Of Prediction Generalized Simulated Annealing Generalized Standard Addition Method Influence Curve Imbedded Error Iterative Key Set Factor Analysis Intermediate Least Squares regression Iteratively Reweighted Least Squares regression K Nearest Neighbors method Key Set Factor Analysis Lower Control Limit Linear Discriminant Analysis Linear Discriminant Classification Tree Linear Discriminant Function Linear Discriminant Hierarchical Clustering Linear Learning Machine Least Median Square regression Leave-One-Out cross-validation Locally WEighted Scatter plot Smoother Latent Root Regression Least Squares regression Lot Tolerance Percent Defective Least Trimmed Squares regression Moving Average model Mean Absolute Deviation Median Absolute Deviation around the Median Multivariate ANalysis Of VAriance Multivariate Adaptive Regression Splines MultiCriteria Decision Making

xv

Acronyms

MDS Mhl MIF MIL-STD ML MLS MR

MS MSE MSS MST NER NIPALS NLM NLPLS NMC NMDS NN OCC OLS OR OVAT PARC PC PCA PCP PCR Pdf PFA PLS PP PPR PRESS PRIM PSE QC QDA QPLS RDA RE RMS RMSD RMSE

MultiDimensional Scaling Multivariate Image Analysis Malinowski’s Indicator Function MILitary STanDard table Maximum Likelihood Multivariate Least Squares regression Misclassification Risk Mean Squares in ANOVA Mean Square Error Model Sum of Squares Minimal Spanning Tree Non-Error Rate Nonlinear Iterative PArtial Least Squares NonLinear Mapping NonLinear Partial Least Squares regression Nearest Means Classification Non-metric MultiDimensional Scaling Neural Network Operating Characteristic Curve Ordinary Least Squares regression Operations Research One-Variable-At-a-Time PAttern Recognition Principal Component Principal Component Analysis Principal Component Projection Principal Component Regression probability density function Principal Factor Analysis Partial Least Squares regression Projection Pursuit Projection Pursuit Regression Predictive Residual Sum of Squares Pattern Recognition by Independent Multicategory Analysis Predictive Squared Error Quality Control Quadratic Discriminant Analysis Quadratic Partial Least Squares regression Regularized Discriminant Analysis Real Error Residual Mean Square Root Mean Square Deviation Root Mean Square Error

Acronyms

XVI

RR RSD RSE RSM RSS SA SAM

sc

SDEC SDEP SEA4 SIMCA SLC SMA SMART SPC SPLS

ss

SVD SWLDA SWR TDIDT TSA TSS TTFA UCL UNEQ VIF VSGSA WLS WNMC XE

Ridge Regression Residual Standard Deviation Residual Standard Error Response Surface Methodology Residual Sum of Squares Simulated Annealing Standard Addition Method Sensitivity Curve Standard Deviation of Error of Calculation Standard Deviation of Error of Prediction Standard Error of the Mean Soft Independent Modeling of Class Analogy Standardized Linear Combination Spectral Map Analysis Smooth Multiple Additive Regression Technique Statistical Process Control Spline Partial Least Squares regression Sum of Squares in ANOVA Singular Value Decomposition StepWise Linear Discriminant Analysis StepWise Regression Top-Down Induction Decision Tree Time Series Analysis Total Sum of Squares Target Transformation Factor Analysis Upper Control Limit UNEQual covariance matrix classification Variance Inflation Factor Variable Step size Generalized Simulated Annealing Weighted Least Squares regression Weighted Nearest Means Classification extracted Error

XVII

Notation -

n. objects - n. variables - n. responses - n. components, factors - n. groups, classes, clusters

-

-

-

-

-

-

-

-

-

-

TOTAL, n P r

INDEX ilst j

M G

m

MATRIX

variable response error coefficient component factor loading eigenvector eigenvalue communality variance/covariance correlation distance/dissimilarity similarity association hat matrix total scatter matrix within scatter matrix between scatter matrix identity matrix

For matrix X: - determinant - inverse - trace - transpose

x Y IE B T

F IL

v

A

H $3

k g

ELEMENT Xij

Yik eik bkj tim f im lj m uj m

4 4 $k

R

rjk

D S

SSt

A

as t

w T

W

B I[

1x1

w-'

4t hii

XVIII For random variable x: expected value estimate variance bias probability density function cumulative distribution function probability kernel - quantile

-

For variable Xj :

- mean - standard deviation - lower value -

uppervalue

For group g: - number of objects - prior probability - density function - centroid - covariance matrix

Notation

acceptance number /QUAL]

A a error [TEST] + hypothesis testing b

b

A-optimal design [EXDE] design (0 optimal design)

-3

b

Abelson-’lbkey’s test [TEST]

.--, hypothesis test

absolute error [ESTIM] + error b

b

absolute moment [PROB]

.--, moment b

abstract factor [FACT]

: common factor b

abstract factor analysis (MA) [FACT]

: factor analysis b acceptable quality level (AQL) [QUAL] + producer’s risk

acceptable reliability level [QUAL] ----+ producer’s risk b

b

acceptance control chart [QUAL] control chart (o modified control chart)

--+

b

acceptance error [TEST] hypothesis testing

-N

b

acceptance line [QUAL] lot

b

acceptance number [QUAL] lot

-N

1

2

acceptance sampling [QUAL]

acceptance sampling [QUAL] Procedure for providing information for judging a lot on the basis of inspecting only a (usually) small subset of the lot. Its purpose is to sentence (accept or reject) lots, not to estimate the lot quality. Acceptance sampling is used instead of inspecting the whole lot when testing is destructive, or when 100% inspection is very expensive or not feasible. Although sampling reduces cost, damage, and inspection error, it increases type I and type I1 errors and requires planning and documentation. The samples should be selected randomly and the items in the sample should be representative of all of the items in the lot. Stratification of the lot is a commonly applied technique. The specification of the sample size and of the acceptance criterion is called sampling plan. The two most popular table of standards are the Dodae-Romig tables and the militaw standard tables. Acceptance sampling plans can be classified according to the quality characteristics or the number of samples taken. b

attribute sampling Sampling in which the lot sentencing is based on the number of defective items found in the sample. The criterion for accepting a lot is defined by the acceptance number. chain sampling Alternative to single sampling when the testing is very expensive and destructive, the acceptance number is zero, and the OC curve is convex, i.e. the lot acceptance drops rapidly as the defective lot fraction becomes greater than zero. This sampling makes use of cumulative results of several proceeding lots. The lot is accepted if the sample has zero defective items, and rejected if the sample has more than one defective item. If the sample has one defective item the lot is accepted only if there were no defective items in a predefined number of previous lots. Chain sampling makes the shape of the OC curve more desirable near its origin. continuous sampling Sampling for continuous production, when no lots are formed. It alternates 100% inspection with acceptance sampling inspection. The process starts with 100% inspection and switches to sampling inspection once a prespecified number of conforming items are found. Sampling inspection continues until a certain number of defective items are reached. double sampling Sampling in which the lot sentencing is based on two samples. First an initial sample is taken and, on the basis of the information from that sample, the lot is either accepted or rejected, or a second sample is taken. If a second sample is taken the final decision is based on the combined information from the two samples. A double sampling plan for attributes is defined by four parameters: nl, size of the first

acceptance sampling (QUAL]

3

sample; n2, size of the second sample; a l , acceptance number of the first sample; a2, acceptance number of the second sample. First a random sample of nl items is inspected. If the number of defective items dl is less than or equal to a1 the lot is accepted on the first sample. If dl is greater than a2, the lot is rejected on the first sample. If dl is between a1 and a2, a second random sample of size n2 is inspected resulting in d2 defective items. If dl d2 is less than or equal to a2, then the lot is accepted, otherwise rejected.

+

lot-plot method Variable sampling plan that uses the frequency distribution estimated from the sample to sentence the lot. It can also be used for nonnormally distributed quality characteristics. Ten random samples, each of five items are usually used to construct the frequency distribution and to establish upper and lower lot limits. The lot-plot diagram, which is the basis of lot sentencing, is very similar to an average chart. multiple sampling Extension of the double sampling in which the lot’sentencing is based on the combined information from several samples. A multiple sampling plan for attributes is defined by the number of samples, their size, and an acceptance number and a rejection number for each sample. If the number of defective items dj in any of the samplesj is less than or equal to the acceptance number a, of that sample, the lot is accepted. If d, equals or exceeds the rejection number rj of that sample, the lot is rejected; otherwise the next sample is taken. sequential sampling Multiple sampling in which the number of samples is determined by the results of the sampling itself. Samples are taken one after another and inspected. On the basis of the result of the inspection a decision is made on whether the lot is accepted, rejected or another sample must be drawn. The sample size is often one. single sampling Lot sentencing on the basis of one single sample. A single sample plan for attributes is defined by the sample size n and the acceptance number a. From a lot n items are selected at random and inspected. If the number of defective items d is less than or equal to a the lot is accepted, otherwise rejected. The distribution of d is binomial with parameters n and P d (defective fraction in the lot). skip-lot sampling Sampling when only some~fractionof the lots is inspected. It is used only when the quality is known to be good. It can be viewed as a continuous sampling applied to lots instead of individual items. When a certain number of lots are accepted, the inspection switches to skipping, i.e. only a fraction of the lots are inspected. When a lot is rejected the inspection returns to normal lot-by-lot inspection.

4 action limits [QUAL] variable sampling Sampling in which the lot sentencing is based on a measurable quality characteristic, usually on their sample average and their sample standard deviation. Variable sampling generally requires a smaller sample size than attribute sampling for the same protection level. Numerical measurements of the quality characteristics provide more information than attribute data. Most standard plans require a normal distribution of the quality characteristics. A separate sampling plan must be employed for each quality characteristic inspected. There are two types of variable sampling: sampling that controls the defective lot fraction and sampling that controls a lot parameter (usually the mean). b

action limits [QUAL]

+ controlchart b adaptive kernel density estimator [ESTIM] + estimator ( 0 density estimator)

b adaptive smoother [REGR] + smoother

b

adaptive spline [REGR] spline

b addition of matrices [ALGE] + matrix operation

b

additive design [EXDE] design

-+

b additive distance [GEOM] + distance b

additive inequality [GEOM] distance

-+

b

-+ b

additive model [MODEL] model adequate subset [MODEL] model

-+

b

adjusted R2 [MODEL] goodness offit

-+

b

adjusted residual [REGR] residual

b

admissibility properties [CLUS] assessment of clustering

all possible subsets regression [REGR]

5

admissible estimator [ESTIM] + estimator b

b

affinity [PROB] population

b

agglomerative clustering [CLUS] hierarchical clustering

b

agreement coefficient [DESC] correlation

b qjne’s test [TEST] + hypothesis test b

h i k e ’ s information criterion (AIC) [MODEL] goodness of prediction

b

algebra [ALGE]

: linear algebra b algorithm [OPTIM] A set of rules, instructions, or formulas for performing a (usually numerical)

calculation or for solving a problem. A description of calculation steps in a form suitable for computer implementation. It does not provide theoretical background, motivation or justification.

b

alias [EXDE] confounding

b

alias structure [EXDE] confounding

alienation coefficient [MODEL] + goodness of fit b

all possible subsets regression [REGR] + variable subset selection b

6 ALLOC[CLAS] b

ALLOC [CLAS] potential function classijier

4

b alternating conditional expectations (ACE) [REGR] Nonparametric nonlinear regression model of the form

The functions t and g , called transformation functions, are smooth but otherwise unrestricted functions of the predictors and response variables. They replace the regression coefficients in a linear regression model. The A C E functions are obtained by smoothers which are estimated by least squares using an iterative algorithm. The variance of function tj indicates the importance of the variablej in the model: the higher the variance the more important is the variable. In contrast to parametric models, the ACE model is defined as a set of point pairs [ X i j , tj ( X i j ) , j = l,p] and [yi, g ( y i ) ] not in closed (or analytical) form. The nonlinear transformations are analyzed and interpreted by plotting each variable against its transformation function. If the goal is to calculate a predictive model, then the response function should be restricted to linearity. The prediction in ACE is a two step procedure. First the transformation function values are looked up in the function table, usually by calculating a linear interpolation between two values. In the second step the function values are summed to obtain a predicted response value. alternative hypothesis [TEST] + hypothesis testing b

b

analysis of covariance (ANCOVA) [ANOVA]

An extension to the analysis of variance, where the responses are influenced not

only by the levels of the various effects (introduced in the design) but also by other measurable quantities, called covariates, or concomitant variables. The covariates cannot usually be controlled, but can be observed along with the response. The analysis of covariance involves adjusting the observed response for the linear effect of the covariates. This procedure is a combination of analysis of variance and regression. The total variance is decomposed into variances due to effects, to interactions, to covariates and a random component. The simplest ANCOVA model of n observations contains the grand mean J , one effect A with I levels, one covariate x with mean jY and an error term e: Yik

=J+Ai

+b(Xik

-x) +eik

i = 1, 1

k = 1, K

n =IK

where b is a linear regression coefficient indicating the dependence of the response y on the covariate x; i is the index of levels of A, k is the index of observations per level. It is assumed that the error terms per level are normally distributed with common variance 02, and that b is identical for all levels. The null hypothesis Ho:b = 0 can

analysis of variance (ANOVA) [ANOVA]

7

be tested by the statistic

where MSE is the error mean square calculated as:

MSE =

I (K

SSE - 1) - 1

i

k

i

k

The numerator has one degree of freedom and the denominator has I (K degrees of freedom.

- 1) - 1

analysis of experiment [EXDE] + experimental design b

b analysis of variance (ANOVA) [ANOVA] (: variance analysis) Statistical technique for analyzing observations that depend on the simultaneous operation of a number of effects. The total variance, expressed as the sum of squares of deviations from the grand mean, is partitioned into components corresponding to the various sources of variation present in the data. The goal is to estimate the magnitude of the effects and their interactions, and to decide which ones are significant. The data matrix contains one continuous response variable and one or more categorical predictor variables representing the various effects and their interactions. For hypothesis testing it is assumed that the model error is normally and independently distributed with mean zero and variance o’,i.e. it is constant throughout the level combinations. The analysis of variance model is a linear equation in which the response is modeled as a function of the grand mean, the effects and the interactions, called terms inANOE-4. The model also contains an error term. It is customary to write the model with indices indicating the levels of effects and indices of observations made under the same level combination. The results of the analysis are collected in the analvsis of variance table. The simplest ANOE-4 model, called one-way analvsis of variance model, contains only one effect A of I levels, each level having K observations:

Yik =

7 4-Ai 4-eik

i=l,I

k=l,K

n=IK

8

analysis of variance table [ANOVA]

The simplest randomized block design has one effect A of I levels and one blocking variable B of J levels: Yij = Y

+ Ai + Bj + eij

i=l,I

j=l,J

n=IJ

The crossed two-way analysis of variance model contains two effects, A of I levels and B of J levels, and their interaction AB of I J levels. Each level combination contains K observations: Yijk=Y+Ai+Bj+ABij+eijk

i=l,I

j=l,J

k=l,K

n=IJK

The two-stage nested analysis of variance model contains effect A of I levels, effect B of J levels nested under each level of A. There is no interaction between A and B: yijk

= y + A i +Bj(i) +eijk

i=l,I

j=l,J

k=l,K

n=IJK

The model in which every level of an effect occurs in association with every level of another effect is called a crossed model. The model in which all main effect terms and their interaction terms that compose the interaction term of the highest order in the model is called a hierarchical model. The model in which no level of a nested effect occurs with more than one level of the effect in which it is nested is called a nested model. Nested models are further characterized by their number of effects, e.g. a two-stage nested ANOVA model. The mixed effect model, in which the random crossed effects terms have to sum to zero over subscripts corresponding to fixed effects terms, is called a restricted model. An ANOVA model that does not contain all possible crossed effect terms that can be constructed from the main effect terms is called a reduced model. Analvsis of covariance and multivariate analvsis of variance are extensions of ANOVA.

analysis of variance table [ANOVA] Summary table of analvsis of variance in which the columns contain the following ANOVA results: degrees of freedom, sum of squares, mean square, F ratio or expected mean square. The rows,of the table correspond to the terms in ANOVA, the last two rows usually correspond to the error term and the total. For example, the table of a random two-way ANOVA model with n observations is: b

Term

df

ss

Mean Square

F Ratio

Anscombe residual [REGR]

9

Another example is the ANOW table of a linear regression model with four predictor variables and 30 observations: Term

df

ss

Mean Square

F Ratio

where SSMis the model sum of squares, SSE is the error (residual) sum of squares and SST is the total sum of squares.

analytical rotation [FACT] + factor rotation b

Anderson’s classification function [CLAS] + discriminant analysis b

b Andrews’ Fourier-type plot [GRAPH] (: harmonic curves plot) Graphical representation of multivariate data. Each p-dimensional object i (i.e. each row of the data matrix) is represented by a curve of the form:

f i (t) = - + xi2 sin(t) + xi3 cos(t) + xi4 sin(2t) + . . . Xi1

2/2

This plot is a powerful tool for visual outlier detection.

Andrews-Pregibon statistic [REGR] + influence analysis b

b angular-linear transformation [PREP] + transformation b angular variable [PREP] + variable b animation [GRAPH] + interactive computer graphics

b Ansari-Bradley’s test [TEST] + hypothesis test b Anscombe residual [REGR] + residual

-7c
10 AP statistic [REGR]

b

AP statistic [REGR] influence analysis

b

appraisal cost [QUAL] qualitycost

b

arithmetic mean [DESC] location

arithmetic scale [PREP] + scale b

b

Armitage’s test [TEST] hypothesis test

artificial intelligence (AI)[MISC] The study of mental faculties through the use of computational models. The term artificial intelligence is used to underline the capability of a machine to formally simulate some aspects of human intelligence, such as visual perception, problem solving, learning, inference, speech recognition, language understanding, game playing, or theorem proving. Because of the intractability of such problems, artificial intelligence systems must include some heuristic knowledge to make the problems computationally feasible. Many programming environments (e.g. PROLOG, LISP) were developed to facilitate writing software for solving such problems, in order to simulate humans (e.g. computational psychology) or to have technological applications (e.g. expert systems, machine learning, robotics). Machine learning is a subdiscipline ofAI. It is a collection of techniques devoted to make a machine learn by simulating the inductive processes used by the human brain in learning strategies; e.g. learning from examples, from analogies, from instructions. These techniques also try to simulate all the processes for extracting regularities and rules from a large amount of data. b

b

artificial neural network (ANN) [MISC]

: neural network

assessment of clustering [CLUS] While there are numerous criteria for the assessment and comparison of regression and classification models, there are few measures for comparing clustering procedures and evaluating the resulting partitions. Clustering results depend strongly on the standardization of the variables and on the distance or similarity measure chosen. The goal of clustering is uncovering rather than imposing structure on the data. The type I error made by clustering is to miss a structure, while the type I1 error is to find clusters that are not there. b

assessment of clustering [CLUS]

11

admissibility properties Various desirable properties of clustering are defined and procedures are characterized as satisfying those properties or not. A clustering procedure is called A-admissible if it satisfies property A. As an example, four agglomerative hierarchical linkages are evaluated as admissible or nonadmissible in the following list of admissibility properties. o cluster omission admissibility Removing all objects from one of the clusters obtained and repeating the clustering results in the same partition obtained previously minus the empty cluster. Single, complete, Ward and average linkages are cluster omission admissible. o convex admissibility

The convex hulls of the clusters do not intersect. Ward linkage is convex admissible; single, complete and average linkages are not. o monotone admissibility

Applying a monotone transformation to the similarity matrix does not change the resulting partition. Single and complete linkages are monotone admissible; Ward and average linkages are not. o point proportion admissibility

Duplicating one or more objects any number of times and repeating the clustering, it results in the same cluster boundaries as those obtained on the original data set. Single and complete linkages are point proportion admissible; Ward and average linkages are not. o well-structured admissibility

All within-group distances are smaller than all between-group dissimilarities. Single, complete and average linkages are well-structured admissible; Ward linkage is not.

measure of distortion Regarding clustering as a transformation of the observed between-object distances (or similarities), various measures are defined to describe the relationship between the observed distances and the transformed distances. These measures indicate how well the data fit the structure resulting from the cluster analysis. They can be used for testing a hypothesis of clustering tendency. stability of clustering Desirable property of a clustering procedure, being only little affected by small changes in the data and by addition of some new objects or variables.

12 association [DESC] b association [DESC] Interdependence between two variables; in a stricter sense, between two binary variables, in contrast to correlation which measures interdependence between quantitative variables. The association coefficient is a distance measure among binary variables. b association analysis [CLUS] + hierarchical clustering (0 divisive clustering) b

association coefficient [GEOM] ( 0 binarydata)

+ distance

b association matrix [DESC] Square matrix of order n defined as the product of a data matrix W(n,p) with its transpose:

A=%WT b

asymmetric classification [CLAS]

+ binary classification b asymmetric design [EXDE] + design

asymmetric distribution [PROB] + random variable b

b

asymmetric matrix [ALGE]

+ matrix b

-+

asymmetric test [TEST] hypothesis testing

asymmetry [DESC] + skewness b

b asymptotic distribution [PROB] + random variable

asymptotic efficiency [ESTIM] + eficiency b

b

asymptotic Mahalanobis distance [GEOM]

+ distance (n ranked data)

autocovariancefunction [TIME]

b

asymptotic normal distribution [PROB] random variable (0 asymptotic distribution)

asymptotically efficient estimator [ESTIM] + estimator b

asymptotically unbiased estimator [ESTIM] + estimator b

b

attenuation [DESC]

+ correlation b

attribute [PREP]

: variable

attribute control chart [QUALI + controlchart b

b attribute sampling [QUALI + acceptance sampling b autocorrelation coefficient [TIME] + autocovariance function

autocorrelation function [TIME] + autocovariance function b

b autocorrelogram [TIME] + autocovariance function

b autocovariance coefficient [TIME] + autocovariance function b autocovariance function [TIME] Function of a stochastic process x (t) defined as:

v(t, s) = cov [x
[(m- P@))(x(d - P W ) I

where CL (t) = E [x(t)1 is the mean function. The autocorrelationfunction is defined as:

13

14

automatic interaction detection

(AID) [CLAS]

Together with the autoregressive model, the autocorrelation function of a stationary time series provides information necessary to control the statistical timedependent fluctuations of a process. If the process is stationary, y ( f , s , does not depend on f , only on f - s = t. In this case p(f) = p, the value y ( 0 ) is the vanance of x(t), and the autocorrelation function is defined as:

The t t h autocovariance coefficient estimated from a sample of size 11. assuming stationarity, is: n

[x(f) - LL) (x(f -

Y (t)=

- p ) ]/ n

t=r+l

and the t t h sample autocorrelation coefficient is:

The plot of p ( t ) against t is called a correlogram or autocornlogram. This plot is used to check whether there is evidence of any serial dependence in an observed time series. In the case of unequally spaced time series an empirical variogram can be constructed. This is the plot of

for all 0.5 n(n - 1) distinct pairs of observations. b automatic interaction detection (AID) [CLAS] One of the earliest classification tree methods for analyzing large data sets ( n > 1000). It divides the objects into disjoint exhaustive subsets in order to achieve optimal classification on the basis of a given set of categorized predictors. AID. similar to CART,calculates the classification rule in the form of a bin- decision tree. However, AID does not provide proper pruning steps to grow an ‘-honest” tree. Its stopping rule is based on measures calculated by resubstitution, and so the estimated error rate is overly optimistic.

b

autoregressive integrated moving average model (ARIMA) [TIME] time series model

+

autoregressive model (AR) [TIME] + time series model b

b

-+

autoregressive moving average model (ARMA) “l‘IME1 time series model

axialpoint [EXDE] b autoscaled variable [PREP] + variable

autoscaling [PREP] + standardization b

average absolute deviation [DESC] + dispersion b

b

average chart [QUAL] (0 variable control chart)

+ control chart b

average eigenvalue criterion [FACT]

+ rankanalysis

average Euclidean distance [GEOM] + distance (0 quantitative data) b

average linkage [CLUS] + hierarchical clustering (0 agglomerative clustering) b

b average Manhattan distance [GEOM] + distance (0 quantitative data)

b

average outgoing quality (AOQ) [QUAL]

+ lot b

average outgoing quality limit (AOQL) [QUAL]

+ lot b

average probability for proper class [CLAS]

+ classification b average run length (ARL)[QUAL] + control chart b average sample number (ASN) [QUAL] + lot

axial design [EXDE] + design

b

b

axial point [EXDE] (0 composite design)

+ design

15

16 axis[EXDE] b

axis [EXDE]

+ design

(0

axial design)

B p error [TEST] + hypothesis testing b

b

backpropagation network [MISC]

+ neural network b back-substi tution [ALGE] + Gaussian elimination

backward elimination [REGR] -+ variable subset selection b

b balanced factorial design [EXDE] + design b balanced incomplete block design (BIBD) [EXDE] + design (o randomized block design)

b

Ball and Hall clustering [CLUS] ( 0 optimization clustering)

+ nonhierarchical clustering bar chart [GRAPH] : barplot

b

b bar plot [GRAPH] (.- bar chart) Graphical display of the frequency distribution of a categorical variable. It is a set of adjoining vertical bars of varying height, each reflecting the frequency of occurrence of the corresponding value of the categorical variable. The width of the bars is equal but arbitrary. The histogram is a bar plot, applied to a discretized continuous variable, in which each bar corresponds to a range of values, hence their order is fixed. b Barnard’s test [TEST] + hypothesis test

Bayes’ theorem [PROB] 17 b Bartlett’s interaction test [TEST] + hypothesis test

-

b

Bartlett’s sphericity test [TEST] hypothesis test

Bartlett’s test [TEST] + hypothesis test b

b barycenter [DESC] + location

b

basis [ALGE] orthonoimal bash

b

basis function [REGR]

+ spline

batch [QUALI : lot

b

b

Bayes’ optimal error rate [CLAS]

+ Bayes’ rule b

Bayes’ rule (BR) [CLAS]

Classification rule that assigns an object to class g with the highest posterior class probability P ( g 1 x ) , calculated as:

where P g is the prior class probability and fg ( x ) is the class density function of class g. The associated probability of misclassification is known as the Bayes’ optimal error rate. It is the minimum error rate that can be achieved in a given problem using any classification method. b

Bayes’ theorem [PROB]

Relates the posterior probability of a parameter 6 given observed values x to its prior probability. When x has probability density functionf(x I 6) and P ( 6 ) is the prior probability density function of 6, the posterior probability density function of 6 is

18

Bayesian estimator [ESTIM]

This posterior probability is called Bayesian probability. The Bayes’ theorem, also called Baves’ rule, is the basis of the Bayesian estimators. For example, it is used in several classification methods. b

Bayesian estimator [ESTIM]

-+ estimator b

Bayesian forecasting [TIME]

: state-space model

b

Bayesian probability [PROB] Bayes’ theorem

b

Behrens-Fisher’s test [TEST] hypothesis test

-+

b

bell-shaped distribution [PROB] random variable

--3

Beran’s test [TEST] + hypothesis test b

b

Bernoulli distribution [PROB] distribution

b Bernoulli process [TIME] + stochastic process (0 counting process) b

best linear unbiased estimator (BLUE) [ESTIM] estimator

-+

b

best subset regression [REGR]

: variable subset selection b

beta distribution [PROB] distribution

-+

between-centroid covariance matrix [DESC] + covariance matrix b

between-group covariance matrix [DESC] + covariance matrix b

bidiagonalization [ALGE]

19

b Bhattacharyya distance [GEOM] + distance (a ranked data) b bias [ESTIM] (: systematic distortion) Source of error that cannot be reduced by increasing the sample size. It is systematic as opposed to random error. There are three sources of bias. Statistical bias is due to the specific algebraic form of an estimator. Selection bias occurs when the sample is not properly drawn from the population. Measurement bias is the constant portion of the experimental error. Bias is the difference between the expected estimated value and the true value of a parameter:

B = E [ 8 ]- 6

Bias squared is one of the two components ofthe mean square error. Unbiased estimators calculate estimates with zero bias, e.g. ordinary least squares gives an unbiased estimate of the regression coefficients. Biased estimators result in nonzero bias of the estimate, but also (usually) in smaller variance. b

biased estimator [ESTIM]

+ estimator b

biased model [MODEL]

+ model b biased regression [REGR] Regression method (usually linear) that attempts to decrease the mean square error of the model with respect to the least squares model by reducing the complexity (number of parameters) of the regression model on the basis of outside information. As the complexity is reduced the variance component of the mean square error also decreases but the bias component may increase. The key point of all biased regression methods is to find the optimal complexity, i.e. the optimal trade-off between bias and variance to achieve the minimum mean square error. These regression methods are especially appropriate in case of strong collinearity among predictors and low observatiodvariable ratio. The most popular methods are variable subset selection, ridge regression, principal component regression, partial least squares regression, latent root regression.

biased test [TEST] + hypothesis testing b

b

bidiagonal matrix [ALGEI

-+ matrix b bidiagonalization [ALGEI + matrix decomposition

20

bimodal distribution [PROB]

bimodal distribution [PROB] + random variable b

b binary classification [CLAS] ( : dichotomous classijication) Classification that can separate only two classes. Examples are linear learning machine and Fisher’s discriminant analysis. Also, each nonterminal node in a classification tree method represents a binary classification. Asymmetric classification is a special case of the two class problem. Only one of the two classes (class A) can be modeled; the other class is the whole complementary measurement space, denoted as not A. This classification poses the question of whether an object belongs to class A or not.

0

0

0

0

0

0

not A 0

0

0

0

0 0 0

not A

0 0

An example is the asymmetric SIMCA method, in which objects are classified based

on their distance from the principal component model of class A. b

binary decision tree [CLAS]

+ classijication tree method

binary tree searching [CLAS] + classijication tree method b

b

binary variable [PREP]

-+ variable b

binomial distribution [PROB]

+ distribution

b

binormamin rotation [FACT] factor rotation

biplot [GRAPH] + scatter plot b

block generator [EXDE]

21

b biquartimax rotation [FACT] -+ factor rotation b biquartimin rotation [FACT] + factor rotation

bivariate data [PREP] -+ data b

b biweight [ESTIM] M estimator that calculates a robust estimate using a weighting scheme. For example, a robust location estimator is calculated as

where each observation Xi is weighted by wi(ui) = (1 - u;l2 wi(Ui)

=0

if J U i l 5 1 if luil > 1

The weights depend on (Xi - Z) ui = c=6 cs where s is a robust scale estimate, for example the quartile deviation. Because Z depends on the weights and the weights depend on ji; the procedure is iterative starting with the mean as initial estimate.

biweight kernel [ESTIM] -+ kernel b

b block [EXDE] + blocking

block clustering [CLUS] + cluster analysis b

b

-+

block design [EXDE] design

b block generator [EXDE] + blocking

22 b

block size [EXDE]

block size [EXDE]

+ blocking b blocking [EXDE] Assignment of runs into blocks. A block is a portion of the experimental material which is expected to be more homogeneous than the entire set. The block size is the number of runs per block. If the block size is equal to the number of treatments, the design has complete blocks; if the block size is less than the number of treatments, the design is composed of incomplete blocks. In the latter case some interaction effects are indistinguishable from the blocking effect. Blocking of runs is obtained with the use of block generators. These are interaction columns in the design matrix, such that combinations of their levels determine the blocking of runs. These interaction columns are confounded with the blocking variable. For example, one interaction column can generate two blocks: runs having the -sign in the column belong to the first block and runs having the +sign to the second block. Two interaction columns are needed to generate four blocks: level combination - - indicates the first block, - the the fourth block. In general 2N second block, - the third block, and blocks can be generated according to the level combination of N interaction columns. Blocking is a technique for dealing with inhomogeneity of runs and for increasing the precision of an experiment. By confining treatment comparisons within blocks, greater precision in estimating effects can often be obtained. Blocking is done to control and eliminate the effect of a nuisance factor described by the blocking variable. The goal is to separate the effect of blocks from the effects of factors. The assumption is that blocks and factors are additive, i.e. there are no interactions among them. When blocking is used the experimental error is estimated from within-block comparisons rather than from comparisons among all the experimental units. Blocking is the best design strategy only if the within-block variability is smaller than the between-block variability.

+

b

blocking variable [PREP] variable

-F

b Bonferroni index [DESC] -+ skewness

b

bootstrap [MODEL]

+ model validation b

boundary design [EXDE] design

--j

++

+

Box-Draper design [EXDE]

23

b bounded influence regression [REGR] + robust regression b box plot [GRAPH] (.- box-and-whiskersplot) Graphical summary of a univariate distribution. The bulk of the data is represented as a rectangle with the lower and the upper quartiles being the bottom and the top of the rectangle, respectively, and the median is portrayed by a horizontal line within the rectangle. The width of the box has no meaning. Dashed lines, called whiskers, extend from the ends of the box to the adjacent values. The upper adjacent value is equal to the upper quartile plus 1.5 times the inter-quartile range. The lower adjacent value is defined as the lower quartile minus 1.5 times the inter-quartile range. Outliers, i.e. values outside the adjacent values, are plotted as individual points above and below the adjacent value line segments.

c,

I

OUTLIERS

+

-

0

L

t

UPPER QUARTILE

MEDIAN d

t r

LOWER QUARTILE

0

ADJACENT VALUE

0

+ 1

I

I

I

x1

x2

x3

b

The box plot easily reveals asymmetry, outliers, and heavy tails of a distribution. Displaying several box-plots side by side gives a graphical comparison of the corresponding distributions. To emphasize the relative locations, box plots can be drawn with notches in their sides, called notched box plot. The formula for calculating notch lengths is based on a formal hypothesis test of equal location. b

box-and-whiskers plot [GRAPH]

: box plot

Box-Behnken design [EXDE] + design b

b

-+

Box-Draper design [EXDE] design

24

Box-Jenkins model [TIME]

b Box-Jenkins model [TIME] + time series model

b

BOX’Stest [TEST]

+ hypothesis test b branch and bound method (BBM) [OPTIM] Algorithm proposed for solving a general linear problem in which some of the variables are continuous and some may take only on integer values. BBM is a partial enumeration technique in which the set of solutions to a problem is examined by dividing this set into smaller subsets. By using some user-defined decision criterion it can be shown that some of these subsets do not contain the optimal solution and can thus be eliminated. A solution subset can be defined by calculating its bound. The bound of a subset is a value that is less than or equal to the value of all the solutions in the subset. It is sometimes possible to eliminate a solution subset from further consideration by comparing its bound with a solution which has already been found. The solution subset contains no better solution if the bound is larger than the value of the known best solution. In searching a solution subset, if one of its solutions is better than the best one already known, this newly found solution replaces the best one previously found. If a solution subset cannot be eliminated, it must be split into smaller subsets using a branching criterion. This criterion attempts to divide the solution subset into two subsets in such a way that one of the subsets will almost certainly not contain the optimal solution. b

Bray-Curtis coefficient [GEOM] (0 quantitative data)

+ distance

b breakdown point [ESTIM] Characteristic of an estimator that measures its robustness. It is calculated as the percentage of outliers or contamination that can cause the estimator to take arbitrary large values, i.e. to break down. In other words, the breakdown point is the distance from the assumed distribution beyond which the statistic becomes totally unreliable and uninformative. Ideally the breakdown point is 50%, as in the case of the median, i.e. the majority of the observations can overrule the minority. In contrast, the breakdown point of the mean is O%, which indicates total nonrobustness. In regression the breakdown point of the least squares estimator is 0%, indicating extreme sensitivity to outliers, while the least median squares regression or the least trimmed squares regression have an optimal breakdown point of 50%. b

Brier score [CLAS]

+ classification

calibration [REGR] b

25

Brownian motion process [TIME] ( 0 Wiener-Levy process)

+ stochastic process b

Brunk’s test [TEST]

+ hypothesis test b brushing [GRAPH] + interactive computer graphics

(0

connecting plots)

C b c-chart [QUAL] + control chart (o attribute control chart)

b

calibration [REGR]

Special regression problem in which the goal is to predict a value of a fixed variable x from an observed value of a random variable y on the basis of the regression model

y=f(x)+f By contrast, in the regular regression problem the above regression model is used to predict y. For example, one of the most important problems in analytical chemistry is to predict chemical concentration x from measured analytical signal y on the basis of the calibration function. If there is only one x variable and one y variable the calibration is called a univariate calibration. The case when there is a vector of x to be predicted from a vector of y is called a multivariate calibration: y = f (x) + i

The most common calibration model is linear and additive:

where each row contains a calibration sample, a row of U contains the r responses from multiple sensors, while a row of X contains the p component concentrations; r 2 p. S, the regression coefficient matrix, is called the calibration matrix and contains the partial sensitivities or pure spectral values. IE is the error term. The solution for S is:

s = (XT X)-1

XT Y

26

Calkoun distance [GEOM]

and the estimated concentrations i are obtained by measuring the corresponding response vector and solving: ..I

y = STx

and

i = (S ST)-' S y

This case, in which all concentrations x are known, is called direct calibration (or total calibration). It is a two-step procedure: first S is calculated from the calibration data set in which all component concentrations are known, then the concentrations are predicted in the unknown mixture. If not all component concentrations are known the inverse calibration (also called indirect calibration) is used, where the concentration of one component is taken as a function of the responses. x = f (y) +2

The linear model is:

x=Ys+e The regression coefficients s can be estimated from calibration data set in which only the concentration of the component of interest is known. The solution for the inverse calibration is:

i=(YTYppX

&yTi

As in a regular regression problem, the parameters can be estimated by ordinary least squares or by a biased estimator, e.g. PCR,or PLS.

Calkoun distance [GEOM] + distance (0 ranked data) b

b

Canberra distance [GEOM]

+ distance (o quantitative data) b canonical analysis [ALGE] + quadratic form b canonical correlation [MULT] + canonical correlation analysis

b canonical correlation analysis (CCA) [MULT] (: correlation analysis) Analysis of interrelationship between two sets of variables x and y. CCA examines the inter-correlation between two sets of variables, as opposed to factor analysis which is concerned with the intra-correlation in one single set. In contrast to regression analysis, CCA does not assume causal relationship between the two sets and treats them symmetrically. Geometrically, CCA investigates the extent to which

category [CLAS] 27

objects have the same relative position in the two measurement spaces x and y; i.e. to what extent do the two sets of variables describe the objects the same way? The procedure starts with calculating an overall correlation matrix Iw that is composed of four submatrices: R,, is the correlation matrix of the x variables, R,, is the correlation matrix of the y variables, R,, = Iw;, ,containsthe cross-correlations between elements of x and y.

Iw=* The next step is to reduce the variables x and y into two sets of canonical variates. These are linear combinations of x and of y calculated to obtain a maximum correlation between them. The canonical variates in each set are uncorrelated among themselves and each canonical variate in one set is independent of all but one canonical variate in the other set. The correlation between a linear combination of x and the corresponding linear combination of y, i.e. between two corresponding canonical variates, is called canonical correlation. The goal of CG1 is to maximize the canonical correlations. Squared canonical correlations are called canonical roots. They are eigenvalues of [It; Iw,, :;wI R,,].

canonical form [ALGEI + quadratic form b

b canonical root [MULT] + canonical correlation analysis b canonical variate [MULT] + canonical correlation analysis b

Capon’s test [TEST]

+ hypothesis test

categorical classification [CLAS] + probabilistic classification b

b categorical data [PREP] + data

categorical variable [PREP] + variable b

b

category [CLAS]

: class

28

Cauchy distribution [PROB]

b Cauchy distribution [PROB] + distribution

causal model [MODEL] + model b

cause variable [PREP] + variable b

cause-effect diagram [GRAPH] (: fishbone diagram, Ishikawa diagram) Versatile tool used in manufacturing and service industries to reveal the various sources of nonconformity of a product and analyze their interrelationship. b

causes

cause 1

methods

parameters

sub-cause 1.2

+

s amp1es

effect

solvents

causes The following steps are taken to construct such a diagram:

- choose the effect (product) to be studied and write it at the end of a horizontal main line; - list all the main causes that influence the effect under study and join them to the main line; - arrange and rank other specific causes and sub-causes joining them to the lines of the main causes; - check the final diagram to make sure that all known causes of variation are included.

centroid classification [CLAS]

29

cell [EXDE] + factor b

b

censored data [PREP] data

b

center [DESC] location

center point [EXDE] + design ( 0 composite design) b

b centering [PREP] + standardization

b

central composite design (CCD) [EXDE] design (n composite design)

b

central limit theorem [PROB]

Theorem in statistics that points out the importance of the normal distribution. According to the theorem, the distribution of the sum of j = 1,p variables with mean p, and finite variances a,,will tend to the normal distribution with mean p, and variance uj as p goes to infinity. For example, the distribution of the sample means approaches the normal distribution as the sample size increases, regardless of the population distribution.

cj b

cj

central moment [PROB]

+ moment b

central tendency [DESC]

: location b centroid [DESC] + location b centroid classification [CLAS] ( : nearest means classification) Parametric classification method in which the classification rule is based simply on the Euclidean distances from the class centroids calculated from the training set. If the classes are represented only by one training object, this method is called prototype classification. This method assumes equal spherical covariance matrices in each class. When the class weights are inversely proportional to the average class variance of the variables, the method is known as weighted nearest means

classification (WNMC).

30

centroid distance [GEOM]

Pattern recognition by independent multicategory analysis (PRIMA) is a variation of the centroid classification. The coordinates in the Euclidean distance are weighted in inverse proportion to the variance of the corresponding variable in the corresponding class. b

centroid distance [GEOM] distance

--+

centroid linkage [CLUS] + hierarchical clustering (0 agglomerative clustering) b

b centrotype [CLUS] + cluster

b chain sampling [QUALI + acceptance sampling b chance correlation [DESC] + correlation b

characteristic [PREP]

: variable b characteristic polynomial [ALGE] + eigenanalysis b characteristic root [ALGE] + eigenanalysis b

characteristic vector [ALGE] eigenanalysis

--+

b

Chebyshev distance [GEOM]

+ distance (0 quantitative data) b Chernoff face [GRAPH] + graphical symbol b chi squared distance [GEOM] + distance (0 quantitative data)

b

chi squared distribution [PROB] distribution

--+

class densityfinction b

[CLAS] 31

chi squared test [TEST] hypothesis test

-+

b

Cholesky factorization [ALGE] matrix decomposition

-+

chromosome [MISC] -+ genetic algorithm b

circular data [PREP] -+ data b

b circular histogram [GRAPH] + histogram b

circular variable [PREP]

+ variable b

-+

city-block distance [GEOM] distance (0 quantitative data)

b class [CLAS] (: category) Distinct subspace of the whole measurement space, defined by a set of objects in the training set. The objects of a class have one or more characteristics in common indicated by the same value of a categorical variable or by the same range of a continuous variable. The term class is mainly used in classification; however, it also appears incorrectly in cluster analysis as a synonym for cluster or group. Classes are often assumed to be mutually exclusive (not to overlap) and exhaustive (the sum of the class subspaces covers the whole measurement space).

b class boundary [CLAS] (: decision boundary, decision surface) Multidimensional surface (hypersurface) that separates the classes in the measurement space. Some classification methods assume linear boundaries (e.g. LDA), others assume quadratic boundaries (e.g. @A), or allow even more complex boundaries (e.g. CART, K",ALLOC). In a classification rule class boundaries can be defined explicitly by equations (e.g. LDA) or implicitly by the training set (e.g. K"). b

class box [CLAS] class modeling method

-+

b class density function [CLAS] Density function describing the distribution of a single class population in a classification problem. In discriminant analysis the class density functions are

32

class modeling method [CLAS]

assumed to be normal. In histogram classification the class density functions are estimated by univariate histograms, assuming uncorrelated variables. In potential function classifiers the class density functions are calculated as the average of the object kernel densities. The classification rule assigns an object to the class with the largest density function at that point. b class modeling method [CLAS] Classification method that defines closed class boundaries for each class in the training set. The subspace separated by class boundaries is called the class-box. The density function of a class outside of its class-box is assumed to be zero, so objects that are outside all of the class-boxes are classified as unknown. The class density function can be uniform inside its class-box (SIMCA), normal (QDA), sum of individual kernels (ALLOC, CLASSY), etc. The shape of a class-box can be a hypersphere centered around the class centroid (PRIMA), a hyperellipsoid (UNEQ), a hyperbox (SIMCA), etc.

b

class prediction [CLAS] classification

b class recognition [CLAS] + classification

classification [CLAS] (: discrimination) Assignment of objects to one of several classes based on a classification rule. Classification is also called supervised pattern recognition, as opposed to unsupervised pattern recognition, which refers to cluster analysis. The classes are defined a priori by groups of objects in the training set. Separation of only two classes is called binary classification. The goal is to calculate a classification rule and class boundaries based on the training set objects of known classes and to apply this rule to objects of unknown classes. An object is usually assigned to the class with the largest class density function at that point. Based on how they estimate the class density functions and the classification rule, the methods are called parametric classification (DASCO, LDA, NMC, P R I M , RDA, QDA, SIMCA, UNEQ),or nonparametric classification. The latter group contains classification tree methods (AID, CART, LDCT), mtential function classifiers (ALLOC, CLASSY), histogram classification and K". Methods that define closed class boundaries are called class modeling methods. Classification that assumes uncorrelated variables, e.g. histogram classification, is called independence classification. There are several measures for evaluating the performance of a classification method. They are based on assessing the misclassification, i.e. the assignment of objects to other than their true class. Loss matrix and prior class mobability are often incorporated in such measures. Some of the measures are calculated for probabilistic classification, others can also be applied for categorical classification. b

classification [CLAS]

33

One should distinguish between measures calculated from the training set with resubstitution, measures calculated from the training set with cross-validation and measures calculated from the test set. The first group assesses the class recognition of a method, i.e. how well the objects of known classes in the training set are classified. The other two groups estimate the class prediction of a method, i.e. how well objects of unknown classes will be classified. In the following equations the notations are: object index i = 1,n g, g’ = 1,G class index number of objects in class g number of correctly classified objects in class g number of incorrectly classified objects in class g probability of object i belonging to proper class g probability of object i belonging to another class

ng Cgg CSS’

Pig Pig!

average probability for proper class Measure for probabilistic classification:

Brier score (: quadratic score) Measure for probabilistic classification: r 1

The modified Brier score is:

confusion matrix (: misclussification malriu) Matrix containing categorical classification results. The rows of the matrix correspond to true classes and the columns to predicted classes. 0

correctly classified 1st true class 2nd true class

...

Gth true class

1st

[zi:

L

2nd . . . Gth predicted class

z;:)

... ... ... . . . .. . . .. cG1 C O ~ ... COG 012

czz

The number of correctly classified objects in each class appears on the main diagonal, while the numbers of misclassified objects are the off-diagonal elements.

34

classification [CLAS]

The off-diagonal element c,,’ is the number of objects that belong to class g but which are classified as class g’. In case of perfect classification the off-diagonal elements cggJare all zero and the diagonal elements are c,, = n,.

error rate (ER) Measure for categorical classification expressed as the percentage of incorrectly classified objects:

xx

Cgg’

ER% =

g’

g

0 5 ER% 5 1 n The no-model error rate is calculated for the assignment of all objects to the largest class with nM objects ( n M 2 n,) without using any classification model. It serves as a common denominator in comparing classification methods. n-nM NOMER% = 0 5 NOMER% 5 1 n The complementary quantity, called non-error rate (NER), is the percentage of the correctly classified objects: ~

c, 5%

NER% = -= I - E R % 0 5 NER% 5 1 n Depending on how the error rate is calculated, one should distinguish between true error rate, conditional error rate and unconditional error rate. The true error rate, which comes from classification rules with known (not estimated) parameters, serves for the theoretical comparison of classification methods. The conditional error rate is calculated from and relevant to only one particular training set, while the unconditional error rate is the expected value of the conditional error rate over all possible training sets. The conditional error rate is of interest in practical problems.

misclassification matrix : confhion matrix misclassification risk (MR) Measure for categorical classification:

MR% =

C g

[ ~ C g g . L g , . / ~ , ]p , G

where P , is the prior class probabilitv and Lgg!is an element of the loss matrix.

quadratic score : Brier score

classification and regression trees (CART) [CLAS] 35

reliability score Measure for probabilistic classification: Q3

= QI - 4 2

- 15

Q3

5 (G - 1)/4G

where Q1 and Q 2 are the average probability for proper class and the sharpness of classification, respectively.

sensitivity of classification Measure for categorical classification, the non-error rate for class g:

SN%,

=%

0 5 SN% 5 1 % Sensitivity equals 1 if all objects of true class g are classified to class g.

sharpness of classification Measure for probabilistic classification:

y xp$ 4 2

=

i

g'

n

1/G 5

4 2

I1

specificity of classification Measure for categorical classification indicating the purity of class g:

Specificity equals 1 if only objects of true class g are assigned to class g.

classification and regression trees (CART) [CLAS] Classification tree method that constructs a binary decision tree as a classification rule. Each node f of the tree is characterized by a single predictor variable j ( f ) and by a threshold value for that variable Sj(t) and represents a question of the form: 5 s, (f)? If the answer is yes then the next question is asked on the left son node, otherwise on the right son node. Starting at the root node with an object vector, the tree is traversed sequentially in this manner until a terminal node is reached. Associated with each terminal node is a class, to which the object is assigned. Given the training set, CART constructs the binary decision tree in a forwardbackward stepwise manner. The size of the tree is determined on the basis of cross-validation, which ensures a good class prediction. The CART classification rule has a simple form that is easy to interpret, yet it takes into account that different relationships may hold among variables in different parts of the data. CART is scale invariant, extremely robust with respect to outliers, and performs automatic stepwise variable selection. b

36

classifcation by ALLOC and SIMCA synergism (CLASSY) [CLAS]

b classification by ALLOC and SIMCA synergism (CLASSY) [CLAS] Classification method that combines potential function classifier (ALLOC) and SZMCA. First, as in SZMCA, principal component models for each class are calculated and the optimal number of components is determined. A normal distribution is assumed outside of the class box, just as in SZMCA. Inside the class box, however, kernel density estimators of the principal components are used to calculate the class density functions. The optimal parameter, which determines the smoothness of the class density function, is estimated with a leave-one-out modification of the maximum likelihood procedure.

classification power [CLAS] + classijication rule b

b classification rule [CLAS] (.- discriminant rule) Rule calculated from the training set to determine the class to which an object is assigned. The classification rule may be expressed as a mathematical equation with few parameters, as in discriminant analysis, or may consist of only a set of threshold values of the variables, as in CART, or it may include all objects of the training set, as in K".Many classification rules are based on the Baves' rule. The importance of a variable in a classification rule is called classification power or discrimination power. For example, in linear discriminant analysis the discriminant weight measures the classification power of the corresponding variable. b classification tree method (CTM) [CLAS] (.- top-down induction decision tree method) Nonparametric classification method that generates the classification rule in the form of a binary decision tree.

cluster [CLUS]

37

In such a tree each nonterminal node is a binarv classifier. Traversing the tree from the root to the leaves and classifying objects is called binary tree searching. Examples are: AID, CART, LDCT. Many expert systems are also based on binary decision trees, e.g. EX-TRAN. b

closed data [PREP] data

b

CLUPOT clustering [CLUS] non-hierarchical clustering (0 density clustering)

--3

b cluster [CLUS] Distinct group of objects that are more similar to each other than to objects outside of the group. Once a cluster is found, one can usually find a unifying property characterizing all the objects in the cluster. A cluster is represented by its seed point. It can be a representative object of the cluster, called a centrotype, or a calculated center point, called a centroid. ?kro desirable properties of a cluster are internal cohesion and external isolation. L-clusters score high on both lists. The internal cohesion of a cluster g can be measured by the cluster diameter, defined as the maximum distance between any of its two objects: {max[d,,], s,t E g ) or by the cluster radius, defined as the maximum distance between an object and its centroid or centrotype: {max[d(xi, c,)], i E g ] . Cluster connectedness is another measure of internal cohesion. The external isolation of a cluster is quantified in hierarchical clustering by its moat.

I

cluster A well isolated cluster B nested in cluster C

Geometrically, a cluster is a continuous region of a high-dimensional space containing a relatively high density of points (objects), separated from other clusters

38

cluster analysis [CLUS]

by regions containing a relatively low density of points. For example, clusters can be well isolated spherical clusters (cluster A) or may be nested within each other (clusters B and C). Clusters are easy to detect visually in two or three dimensions. In higher dimensions, however, cluster analysis must be applied. b cluster analysis [CLUS] (.- numerical taxonomy) Set of multivariate exploratory methods which try to find clusters in high-dimensional space, based on some similarity criterion among the objects (or variables). Sometimes, incorrectly, cluster analysis is referred to as classification. Classification, also called supervised pattern recognition, means the assignment of objects to predefined groups, called classes. In contrast, the goal of cluster analysis, also called unsupervised pattern recognition, is to define the groups, called clusters, such that the within-group similarity is higher than the between-group similarity. Cluster analysis often precedes classification; the former explores the groups, the latter confirms them. Frequently a partition of objects obtained from cluster analysis satisfies the following rules: each object belongs to one and only one cluster, each cluster contains at least one object. Fuzwclustering results in partitions which are not restricted by the above rules. Cluster analysis problems are usually not well defined, and often have no unique solutions. The resulting partition greatly depends on the method used, on the standardization of the variables and on the measure of similarity chosen. It is important to evaluate the partition obtained, i.e. to perform assessment of clustering, and to verify clustering results with supervised techniques. The clustering tendency may be calculated by the Hopkins’ statistic. Clustering objects is called Q-analysis, while clustering variables is called Ranalysis. The former often starts with a distance matrix calculated between objects, the latter with the correlation matrix of variables. Block clustering (also called two-way clustering) groups both objects and variables simultaneously, resulting in rectangular blocks with similar elements in the data matrix. The partition is obtained by minimizing the within-block variance.

CLUSTER ANALYSIS k!

Hierarchical Clustering k!

agglomerative clustering

I divisive clustering

I Non-hierarchical Clustering

J density clustering

.1 graph theoretical clustering

I optimization clustering

Clustering methods are divided into two major groups: hierarchical clustering and non-hierarchical clustering methods. Each group is further divided into subgroups. Clustering based on several variables is called polythetic clustering, as opposed to monothetic clustering, which considers only one variable at a time.

Cohran’stest [TEST]

39

b cluster connectedness [CLUS] Measure of internal cohesion of a cluster defined for hierarchical agglomerative clustering. It is a range scaled number of edges e present in the graph theoretical representation of the obtained cluster g containing n, objects:

e - (n, - 1)

c -- 0.5 ng(n, - 1) - (n, - 1) By definition the connectedness of a cluster obtained by complete linkage is 1, while the connectedness of a cluster resulting from single linkage can be as low as 0. The connectedness of a cluster of ng objects lies between [ng - 11 and [OS n,(n, - l)]. b cluster diameter [CLUS] + cluster b

cluster omission admissibility [CLUS] assessment of clustering ( admissibilityproperties)

-+

b cluster radius [CLUS] + cluster

b

cluster sampling [PROB]

+ sampling b coded data [PREP] + data

coefficient of determination [MODEL] + goodness of fit b

b

coefficient of nondetermination [MODEL] goodness of fit

-+

b coefficient of skewness [DESC] + skewness

coefficient of variation [DESC] + dispersion b

Cohran’s Q test [TEST] + hypothesis test b

b Cohran’s test [TEST] + hypothesis test

40

collinearily [MULT]

b collinearity [MULT] ( : multicollinear@) Approximate linear dependence among variables. In case of perfect collinearity a set of coefficients b can be found such that

Collinearity causes high variance in least squares estimates (e.g. covariance matrix, regression coefficients) resulting in instability of the estimated values, even wrong signs. Collinearity badly affects the modeling power in regression and classification but does not necessarily reduce the goodness of prediction of the model. Biased estimators mitigate the negative effects of collinearity. Collinearity is indicated by: - values close to one in the off-diagonal elements of the correlation matrix; - zero or close to zero eigenvalues calculated from the correlation matrix; - large condition number, i.e. large ratio between the largest and the smallest eigenvalues calculated from the correlation matrix, - multiple correlation coefficient close to one, calculated by regressing one variable on all the others.

column vector [ALGE] + vector b

b common factor [FACT] (: abstract fuctor,fuctor) Underlying, non-observed, non-measured, hypothetical variable that contributes to the variance of at least two measured variables. In factor analysis the measured variables are linear combinations of the common factors plus a unique factor. In principal component analysis the unique factor is assumed to be zero, and the common factors, called principal components (PC) or components, are also linear combinations of the variables. The common factors are uncorrelated with the unique factors and usually it is assumed that the common factors are uncorrelated with each other. Estimated common factors, called factor scores or scores, can be calculated as linear combinations of the variables using the factor score coefficients. These are linear coefficients calculated as the product of the factor loadings and the inverse of the covariance matrix or correlation matrix. b communality [FACT] Measure of the contribution of the common factors to the variance of the corresponding variable. The squared communality is the sum of squared factor loadings over M common factors:

completely randomized design [EXDE]

41

The complement quantity, called uniqueness, is the sum of squared unique factors. It measures the amount of variance which remains unaccounted for by the common factors. In principal component analysis, where the unique factors are assumed to be zero, the squared communalities are equal to the variance of the corresponding variable if M = p. In factor extraction techniques, in which an initial estimate of the communalities is needed, it is usually taken to be the squared multiple correlation coefficient: ..

where rJJ is the jth diagonal element of the inverse correlation matrix. A factor analysis solution that leads to the communality of some (scaled) variable being greater than one is called a Heywood case. This may arise because of sampling errors or because the factor model is inappropriate. b

comparative experiment [EXDE] experimental design

-+=

complementary design [EXDE] + design b

complete block [EXDE] + blocking b

b

complete block design [EXDE] ( 0 randomized block design)

+ design b

complete factorial design (CFD) [EXDE]

+ design b

complete graph [MISC] graph theoly

-+=

complete linkage [CLUS] + hierarchical clustering ( 0 agglomerative clustering) b

b

complete mixture [EXDE] ( 0 mixture design)

+ design

b complete pivoting [ALGE] + Gaussian elimination

completely randomized design [EXDE] + design b

42

component [FACT]

b component [FACT] + common factor b

component analysis [FACT]

: principal component analysis

component of variance [ANOVA] -+ term in ANOVA b

b

composite design [EXDE] design

b computer aided response surface optimization (CARSO) [REGR] + partial least squares regression b computer intensive method [MODEL] Statistical method that calculates models without making assumptions about the distribution of the underlying population. Such a method replaces theoretical analysis with massive amount of calculations, feasible only on computers. Instead of focusing on parameters of the data that have concise analytical form (e.g. mean, correlation, etc.) the properties are explored numerically, thus offering a wider array of statistical tools. A computer intensive method offers freedom from the constraints of traditional parametric theory, with its overreliance on a small set of standard models for which theoretical solutions are available. An example is the bootstrap technique, which can be used to estimate the bias and variance of an estimator. b concomitant variable [PREP] -+ variable

b

condition number [ALGE] matrix condition

b

condition of a matrix [ALGE]

: matrix condition b conditional distribution [PROB] + random variable

b

conditional error rate [CLAS] classification ( 0 error rate)

b

conditional frequency [DESC] ji-equency

-+

confounding [EXDE] b

43

conditional probability [PROB] probability

-+

conditionally present variable [PREP] -+ variable b

b

confidence coefficient [ESTIM] confidence interval

-+

b

confidence interval [ESTIM]

Interval between values tl and t;! (called upper and lower confidence limits), calculated as two statistics of a parameter 6 given a sample, such that: P(tl 5 6 5 t 2 ) = a

Parameter a,called the confidence coefficient or confidence level, is the probability that the interval [ t 2 - t l ] contains parameter fl. Assuming normal distribution, confidence limits can be calculated from Student's t distribution. b

confidence level [ESTIM] confidence interval

-+

b confidence limit [ESTIM] + confidence interval

b

configuration [GEOMI geometrical concept

-+

confirmatory data analysis [DESC] -+ data analysis b

b

confirmatory factor analysis [FACT]

+ factor analysis b

confounding [EXDE]

Generating a design in which an interaction column or a block generator column is identical to another column. Such columns are called aliases and their effect cannot be estimated separately. It is a technique for the creation of blocking and fractional factorial designs. The rule is that factors, blocking variable and important interactions should be confounded with unimportant interactions. A design generator is a relation that determines the levels of a factor in a fractional factorial design alias with an interaction. The levels of such a factor are calculated as a product of levels of the factors in the interaction. For example, a z3 complete factorial design with 8 runs can be extended to a z4-' fractional factorial

44 confounding pattern [EXDE] design by generating the fourth factor as D = ABC, i.e. calculating the levels of D in each run as a product of levels of A and B and C. Partial confounding means that different interactions are confounded in each replication. In contrast to total confounding, when the confounded effects cannot be estimated separately, here partial information is obtained. A confounding pattern, also called an alias structure, is a set of relations that indicate which interactions and factors are confounded, i.e. which are aliases in a design matrix. For example the 24-1 design with the design generator D = ABC has the following confounding pattern:

A=BCD B=ACD AB=CD AC=BD

C=ABD D=ABC AD=BC

The confounding pattern of a design is determined by the defining relation. The defining relation contains the design generators and all the interactions in which all levels are These interactions can be generated as products of the design generators. The resolution of a two-level fractional factorial design is the length of the shortest word in the defining relation. Resolution minus one is the lowest order interaction that is confounded with a factor in the design. The resolution is conventionally denoted by a Roman numeral appended as a subscript, e.g. the design 2;-' has resolution five. For example, the defining relation of the 2:i4 design with design generators

+.

D=AB

E=AC

F=BC

G=ABC

is

I = ABD =ACE = BCF = ABCG = BCDE = ACDF = CDG = ABEF = = BEG = AFG = DEF = ADEG = BDFG = CEFG =ABCDEFG and it is a resolution I11 design, i.e. second-order interactions are confounded with factors.

confounding pattern [EXDE] + confounding b

b confusion matrix [CLAS] + classijication b conjugate gradient optimization [OPTIMI Gradient optimization that is closely related to variable metric optimization. It minimizes a quadratic function of the parameter vector p of the form:

f (p) = u

+ pTb+ 0.5 pTX p

continuous distribution [PROB]

45

where X is a symmetric, positive definite matrix. Linearly independent descent directions di are calculated, that are mutually conjugate with respect to X:

dTXdk=O

for j # k

The step taken in the ith iteration is: Pi+l = Pi

+ sidi

If at each step the linear search that calculates the step size si is exact then the minimum of f (p) is found in at most p steps. Mutually conjugate directions can be found according to the Fletcher-Reeves formula:

where gi is the gradient direction. b

conjugate vectors [ALGE]

+ vector b

connected graph [MISC]

+ graph theory

connecting plots [GRAPH] + interactive computer graphics b

b

-+

consistent estimator [ESTIM] estimator

b

constrained optimization [OPTIM] optimization

b consumer’s risk [QUAL] + producer’s risk

b

contaminated distribution [PROB] random variable

b contingency [DESC] + frequency b

contingency table [DESC]

--+frequency b continuous distribution [PROB] + random variable

46

continuous sampling [QUAL]

b continuous sampling [QUAL] + acceptance sampling

continuous variable [PREP] + variable b

b

contour plot [GRAPH] scatter plot

4

b contrast [ANOVA] Linear combination of treatment totals in which the coefficients add up to zero. For example, in the one-wayANOVA model:

c=cciyi 1

i = 1, I

with

C C i =0 i

Bvo contrasts with coefficients ci and di are orthogonal if 1

Contrasts are used in hypothesis testing to compare treatment means. A contrast can be tested by comparing its sum of squares to the error mean square. The resulting F ratio has one degree of freedom for the numerator and n - I degrees of freedom for the denominator, where I is the number of levels of the effect tested. The statistical method used to determine which treatment means differ from each other is called multiple comparison. The special case of making pairwise comparisons of means can be performed, for example, by the multiple t-test, Fisher’s least significant difference test, the Student-Newman-Keul multiple range test, Duncan’s modified multiple range test, the Waller-Duncan Bayesian test, or Dunnett’s test. Scheffe’s test or Tukey’s test can be used in order to judge simultaneously all the contrasts. The linear dendrogram is a graphical tool for representing multiple comparisons. b control chart [QUAL] (.- quality control chart) Graphical display of a quality characteristic measured or computed from a sample against the sample number or time. The control chart provides information about recently produced items, about the process, and helps to establish specifications or the inspection procedure. The use of control charts for monitoring the characteristics of a process in order to detect deviations from target values is called statistical process control (SPC). A control chart usually contains three horizontal lines: the center line represents the average value of the quality characteristic. The other two lines are called control limits. The uppermost line, called the upper control limit (UCL), and the lowermost line, called the lower control limit (LCL), indicate the region within

control chart [QUAL]

47

which nearly all the sample points fall when the process is in control. When a sample point falls outside this region, a corrective action is required. Sometimes two sets of control limits are used: the inner pair is called the warning limits and the outer pair the action limits. Warning limits increase the sensitivity of the control chart, however, they may be confusing to the operator. It is customary to connect the sample points for easier visualization of the sample sequence over time. The average number of data points plotted on a control chart before a change in the process is detected is called the average run length (ARL).

upper control limit

.h

I

warning limit

* * *

.P

23 so

*

* *

*

process in control

center

*

* *

f ....

..........................................................

warning limit

lower control limit

process out of control

time

A control chart tests the hypothesis that the process is in control. Choosing the control limits is equivalent to setting up critical regions. If the control limits are defined from a chosen type I error probability, then they are called probability limits. A point lying within the control limits is equivalent to accepting, while a point situated outside the control limits is equivalent to rejecting the hypothesis of statistical control. A general model for a control chart can be given as: UCL

= pw

- center line = p, LCL

= A/ ,

+ kaw - kaw

where w is a sample statistic for some quality characteristic, its mean is p, and its standard deviation is a,, k is a constant often chosen to be 3. The above type of control chart is called a Shewhart chart. The list of the most important control charts follows.

attribute control chart Chart for controlling characteristics described by binary variables.

48

. control chart [QUAL]

o c-chart (: chart for nonconfomities) Displays the number of defects (nonconformities) per item (i.e. the sample size is l), which is assumed to occur according to the Poisson distribution. An item is conforming or nonconforming depending on the number of nonconformities in it. The parameter distribution is c with

The model of this chart is:

- UCL

=c+3fi center line = c -LCL =c-3& Parameter c is estimated by C, the observed average number of nonconformities in a preliminary sample.

np-chart (: chart for number nonconforming) Displays the number of defective (nonconforming) items D = np, that has binomial distribution, with parameter p, where n is the sample size and p the proportion of defective items in the sample. The three horizontal lines are calculated as: o

UCL =np+34GT-5 center line = n p LCL =np-3,/= Similar to the p-chart, p is estimated by jY. o p-chart ( : chart for fraction nonconforming)

Displays the fraction of defective (nonconforming) items p (p = D/n, where D is the number of nonconforming items and n the sample size). The number of nonconforming items has a binomial distribution with p=p

and

a=,/-

The model of this chart is:

- UCL

-center line LCL

= P + 3 h G - m =p

=p-3,/-

Parameter p is estimated by j7, the average of a few (usually 20-25) sample fractions of nonconforming items. This chart is often used in the case of variable sample size.

control chart [QUAL]

49

u-chart (: chart for nonconfomzitiesper unit) Displays the average number of defects (nonconformities) in a sample unit of size larger than one item. The average number of nonconformities is u = d/n, where d is the total number of nonconformities in the sample. It has a Poisson distribution with o

p=u

and

a=m

The model of this chart is:

- UCL

= u + 3 G center line = u LCL = U - 3 G

Similar to the c-chart, u is estimated by ii.This chart is often used in the case of variable sample size.

cusum chart (: cumulative sum chart) Alternative to the Shewhart chart, used to control the process on the basis not only of one but an entire sequence of data points. It plots the cumulative sums of the deviations of the sample values from a target value. For example the quantity:

can be plotted against the sample number m.TT is the average of the ith sample and p is the target mean. When the process is in control, the data points are plotted on an approximately horizontal line. However, when the target value p shifts upward or downward, a positive or negative drift develops, respectively. A V mask is used to assess whether the process is out of control or not. This is a V-shaped tool that is placed on the chart with its axis horizontal and its apex a distance d to the right of the last plotted point. A point outside of the two arms of the V indicates that the process is out of control. The performance of the chart is determined by the parameters of the V mask: distance d and the angle between the two arms.

modified control chart Control chart for a process with variability much smaller than the spread between the specification limits. In this case the specification limits are far from the control limits. One is interested only in detecting whether the true process mean, p is located such that the fraction nonconforming is in excess of a specified value. When the sample size, the maximum fraction nonconforming and the type I error are specified the chart is called an acceptance control chart.

moving average chart Chart to correct for the Shewhart chart’s insensitivity to small shifts of the mean.

50

control chart [QUAL]

The model is:

+

UCL = 51 3 ~ / f i center line = 51 LCL = 51 - 3a/@

where w is the time span over which the sample means are averaged. The sample means can be averaged according to various weighting schemes.

variable control chart Chart for controlling quality characteristics described by continuous or discrete variables. Depending on the chart, it controls either the mean or the variance of the characteristic. It is assumed that the distribution of the quality characteristic is normal. Estimates for p and CT are usually calculated from 20-25 samples, each of n observations. o averagechart : Kchart

o R chart (: range chart) Chart for controlling the process variance:

The range R is estimated by R, the average sample range. OR is estimated by sR/m, where m is the mean and s is the standard deviation of the relative range R’ = R/a. Both parameters depend on the sample size. o rangechart : Rchart o S chart (: standard deviation chart)

Chart for controlling the process variance:

- UCL

=s+3as center line = S -LCL =s-3us The standard deviation S is estimated by 3, the average sample standard deviation. 0s is estimated by

SJiZ/c where c is a constant depending on the sample size.

Cook’s influencestatistic [REGR]

51

o standard deviation chart : S chart o 51 chart (: average chart)

Chart for controlling the central tendency of the process: UCL =p+3a/fi center line = p LCL =p -3a/fi

is estimated by TI, the mean of the sample averages; u is estimated either by R / m from the sample range with m being the mean of R’ = R/u or by s / c from the sample standard deviation with c being a constant that depends on the sample size. b control limits [QUAL] + control chart b control variable [PREP] + variable b controlled experiment [EXDE] + factor b convergence [OPTIM] The end of an iterative algorithm when the change in the estimated parameter values or in the value of the function being optimized from one iteration step to another is smaller than a prespecified limit value. In minimizing a function f (p) with respect to a vector of parameters p, the following convergence criteria are most common: - the change in function value is small:

If

(Pi+l) - f

(pi)I < E

- the change in parameters is small: - Pi II < E the gradient vector is small: IIPi+l

-

Ilgi+lII < E

or

max j bth component of gi+l] < E .

where i is the index of iterations and

11 . / I indicates the vector norm.

b convex admissibility [CLUS] -+ assessment of clustering (0 admissibilityproperties) b Cook’s influence statistic [REGR] + influence analysis

52

Coomans’plot [GRAPH]

b Coomans’ plot [GRAPH] + scatter plot b Coomans-Massart clustering [CLUS] + non-hierarchical clustering (0 density clustering) b correlation [DESC] Interdependence between two variables x and y . The correlation coefficient r,, is a measure of this interdependence. It ranges between -1 and + l . A zero value indicates absence of correlation, while -1 or 1indicate perfect negative or positive correlation. The correlation matrix R is a symmetric matrix of the pairwise correlation coefficients of several variables. Correlation calculated on ranks is called rank correlation. Underestimation of the correlation between two variables x and y due to the error in the observations is called attenuation. A corrected correlation estimate is:

+

where rxy is the geometric mean of the correlations between independent determinations of x and y; rxxand ryy are the means of the correlation between independent determinations of x and y . If correlation is found, despite the fact that the observations come from uncorrelated variables, such a quantity is called chance correlation or spurious correlation. Partial correlation is the correlation between two variables after the effect of some other variable, on which both depend, has been removed. It is most often encountered in variable subset selection. For example, the correlation coefficient between the residuals from models

is denoted ryz.x,measuring the strength of relation between y and z after their dependence on x was removed from both of them. Multiple correlation is a correlation between two variables where one of the variables is a linear combination of several other variables. For example, the goodness-of-fit of a linear regression model, measured by the squared multiple correlation coefficient, is the correlation between the response and the linear combination of the predictors. The most common correlation coefficients are:

agreement coefficient Generalization of Kendall’s T coefficient, defined as:

correlation analysis [MULTJ 53 U =

8s -1 k(k - l ) ( n - l ) n

where k is the number of observers providing paired comparisons of n objects and S is the sum of the number of agreements between pairs of observers. This coefficient is equal to Kendall’s t coefficient when k = 2.

Kendall’s T coefficient Rank correlation defined as:

C Si rt =

i

0.5 n(n - 1 ) where Si indicates the number of inversion in rank (y) compared to rank ( x ) . It is calculated as follows: - write the ranks of the observations on y in increasing order of the ranks on x; - calculate Si as: if rank (yi) < rank (yi+l) 6i = +1 Si = -1 otherwise.

Pearson’s correlation coefficient ( :product moment Correlation coeficient) Measure of linear association between x and y:

with i and 7= n n sx, sy and s,: are the standard deviations of x and y and the covariance between x and y, respectively. -

x = -1

product moment correlation coefficient : Pearson ’s correlation coefficient Spearman’s p coefficient Rank correlation defined as: 6 [rank (Xi) - rank (yi)I2 rp=lb

n(n2 - 1 )

correlation analysis [MULT]

: canonical correlation analysis

54

correlation coefficient [DESC]

b correlation coefficient [DESC] + correlation b

-+

correlation matrix [DESC] correlation

b correlogram [TIME] + autocovariance function b

correspondence analysis [FACT]

: correspondencefactor analysis

correspondence factor analysis (CFA) [FACT] ( : correspondence analysis) Multivariate method, very popular in the French school of data analysis, that finds the best simultaneous representation of rows and columns of a data matrix. It is especially appropriate for the qualitative data of contingency tables, although its applicability can be extended to any data matrix of positive numbers. Similar to principal component analysis, correspondence analysis is a data reduction technique that displays the data structure projected onto a two- (or three-) dimensional factor space. This plot is an exploratory tool for studying the relationship between the rows and columns of a data matrix, i.e. the object-variable interaction. This method differs from principal component analysis in treating the rows and columns of the data matrix in a symmetric fashion via special scaling, i.e. calculating column and row profiles. The similarities among the row profiles and among the column profiles are measured by the chi squared distance. This latter differs from the Euclidean distance only by a weighting factor, i.e. each term is weighted by the inverse of the total of the profiles corresponding to the term. The squared distance between rows s and t is: b

wherehj andftj are the profiles andfj is the sum of the profiles over the rows in column j. In symmetric fashion, the squared distance between columns p and q is: 1

cfi.

d2 =pq

(fip

i

Correspondence analysis calculates the eigenvalues A and eigenvectors v of the matrix P, with elements:

covariance matrix [DESC]

55

The goal is to project both rows and columns of the data matrix on the same axis v, calculated as Pv = hv

approximating the following ideal situation: - each column is the barycenter of rows, weighted as

- each row is the barycenter of columns, weighted as

As on a biplot, it is legitimate to interpret the distances between rows and between columns. It is also legitimate to interpret the relative positions of a column with respect to all the rows. However, the proximity of a row and a column cannot usually be directly interpreted. The center of gravity located at the origin of the axes corresponds to the average profiles of rows and columns. Additional (test) rows and columns can be added to the plot, projecting their profiles onto the principal component axes. b cosine coefficient [GEOM] + distance (0 quantitative data)

count data [PREP] -+ data b

counting process [TIME] + stochastic process b

b covariance [DESC] The first product moment of two variables x and y about their means. For a population with means px and py it is defined as:

C (xi cT y ;

=

~

x (Yi )

-~

y )

'

n In a sample the population means are estimated by TI and 7 and the covariance is estimated as:

s:x is the variance of the variable x. To obtain an unbiased estimate, n is substituted by n - 1 in the denominator. The covariance is scale dependent, taking on values between -MI and +MI.In case of multivariate data, the pairwise covariance values are arranged in a covariance matrix.

56

covariance matrix [DESC] (: variance-covariance matrix)

b covariance matrix [DESC] (: variance-covariance matrix) Symmetric matrix S(p,p) of pairwise covariances. It is a measure of the dispersion of multivariate data. The diagonal elements are the variances of the corresponding variables. It is the scatter matrix divided by the degrees of freedom. If the variables are autoscaled the covariance matrix equals the correlation matrix. Depending on the mean estimates and on the observations included, the following covariance matrices are of practical importance:

between-centroid covariance matrix Measure of dispersion of group centroids cg, g = 1,G around the overall barycenter c. Its elements are:

where B is the between-group scatter matrix.

between-group covariance matrix ( : inter-group covariance matrix) Measure of dispersion of the groups around the overall barycenter c. Its elements are:

When all G groups have the same number of objects, i.e. ng = n/G, this matrix equals the between-centroid covariance matrix.

group covariance matrix : within-group covariance matrix covariance matrix about the origin Measure of dispersion of the variables around the origin; its elements are:

inter-group covariance matrix : between-group covariance matrix intra-group covariance matrix : within-group covariance ma& mean group covariance matrix The average of the within-group covariance matrices:

critical value [TEST]

57

pooled covariance matrix The mean dispersion of the groups:

s=

g

n-G When all G groups have the same number of objects, i.e. ng = n/G, the pooled covariance matrix equals the mean group covariance matrix. within-group covariance matrix (: group covariance matrix, intra-group covanance matrix) The dispersion of ng objects in group g around its centroid cg, denoted as 8,. Its elements are: (xijg - c j g ) (xikg

2 sjkg =

- ckg)

i

ng - 1

-- W

ng - 1

where W is the within-group scatter matrix. b covariance matrix about the origin [DESC] + covariance matrix

covariate [PREP] + variable b

b

covarimin rotation [FACT] factor rotation

-+= b

covering clustering [CLUS]

+ non-hierarchical clustering

(0

optimization clustering)

b COVRATIO [REGR] + influence analysis

Cox-Stuart’s test [TEST] + hypothesis test b

b

Cramer-von Mises’s test [TEST]

+ hypothesis test b critical region [TEST] + hypothesis testing

critical value [TEST] -+= hypothesis testing b

58

cross-correlationfunction [TIME]

b cross-correlation function [TIME] + cross-covariancefunction b cross-correlogram [TIME] + cross-covariancefunction b cross-covariance function [TIME] Autocovariance style function defined for bivariate time series [x(fi), y(ti), i = 1, n]:

Y x y R 4 = cov [x, Y ( S ) I = E [(x - Px (0)( Y W - Py(S))]

The cross-correlation function is:

If the process is stationary, the above functions depend only on t = t - s, therefore:

The plot of p x y ( t )against t is called a cross-correlogram.

crossed effect term [ANOVA] -+ term in ANOVA b

b

crossed model [ANOVA]

-+ anabsis of variance b cross-over design [EXDE] + design

cross-validated Rz [MODEL] + goodness of prediction b

cross-validated residual [REGR] + residual b

b cross-validation (cv) [MODEL] + model validation b

cube point [EXDE]

+ design (o composite design) b cumulative distribution function (cdf) [PROB] + random variable

data [PREP] b

59

cumulative frequency [DESC]

+ frequency

-+

curve fitting [MODEL] model fitting

b

cusum chart [QUALI

b

+ controlchart b

cycle [MISC] graph theory

-+

cyclical design [EXDE] + design b

b cyclical variable [PREP] + variable

D-optimal design [EXDE] -+ design (a optimal design) b

b D’Agostino’s test [TEST] + hypothesis test

Daniel’s test [TEST] + hypothesis test b

b data [PREP] Statistical information recorded as one or more measurements or observations on an object. Multivariate data constitute a row of the data matrix.

bivariate data Data composed of two measurements or observations recorded on an object. categorical data (: non-metric data, qualitative data) Data that consist of observations of qualitative characteristics, defined by categories.

60

data[PREP]

censored data Data where certain values are not measurable or not observable, although they exist. For such data only bounds of the true value are available. When only a lower bound is given, the data is called right-censored, while data specified by an upper bound is called left-censored. circular data (: spherical data) Data ordered according to a variable which cyclically repeats itself in time or space. Data of this type has no natural starting point and can be represented on a circumference, hence its other name: spherical data. The mean can be defined in several ways: for instance, by finding the interval for which the sum of the deviations is zero. The mean trigonometric deviation is a measure of dispersion. closed data Multivariate data that add up to the same constant in each object, i.e. the p measurements are fully determined by only p - 1 measurements. If this is not true, the data is called open data. Closed data where the measurements add up to one is called percent data. For example, the amounts of components in a mixture is percent data. Closed data, defined as the ratio between a specific value and a reference total value, is called proportional data. Several multivariate methods give different results, depending on whether they are applied on closed data or on open data. To avoid the closure problem one of the variables should be deleted or the variables should be transformed to logarithmic ratios. codeddata Data obtained by multiplying (dividing) by and/or adding (subtracting) a constant, in order to convert the original measurements into more convenient values. count data Data of integer numbers counting the occurrence of an event. This type of data often assumes a binomial distribution; therefore, in case of the normality assumption, a special transformation is required. grouped data Data reported in the form of frequencies, i.e. counts of occurrences within cells defined by a set of cell boundaries. Grouped data are often collected in contingency tables. metric data : numerical data mixed data Multivariate data including both numerical and categorical data.

data reduction [MULT] 61

multivariate data Data composed of more than one measurement or observation recorded on an object. non-metric data : categorical data numerical data (: quantitative data, metric data) Data that consist of numerical measurements or counts. qualitative data : categorical data quantitative data : numerical data spherical data : circular data univariate data Single datum, i.e. a single measurement or observation recorded on an object. b data analysis [DESC] Obtaining information from measured or observed data. Exploratory data analysis (EDA) is a collection of techniques that reveal (or search for) structure in a data set before calculating any probabilistic model. Its purpose is to obtain information about the data distribution (univariate or multivariate), about the presence of outliers and clusters, to disclose relationships and correlations between objects and/or variables. Examples are principal component analysis, cluster analysis, projection pursuit. EDA is usually enhanced with graphical analysis. Confirmatory data analysis tests hypotheses, calculates probabilistic models, makes statistical inference providing statements of significance and confidence. '

b

data matrix [PREP]

-+ data set b

data point [PREP]

: object b data reduction [MULT] Procedure that results in a smaller number of variables describing each object. There are two types of data reduction. If the new variables are simply a subset of the original variables, the procedure is called feature reduction or variable reduction. Such data reduction often forms part of classification or regression analysis, e.g. variable subset selection regression methods, stepwise linear discriminant analysis. The other type of data reduction, called dimensionality reduction, involves calculating a new, smaller number of variables as functions of the original ones. If the

62

data set [PREP]

procedure is linear, the new variables are linear combinations of the original variables. Principal component analysis, factor analysis, correspondence factor analysis and spectral map analysis belong to this group. In contrast, multidimensional scaling, nonlinear mapping and proiection pursuit give a low-dimensional representation of the objects via nonlinear dimensionality reduction. Dimensionality reduction techniques are called display methods when they are accompanied by graphical analysis. In that case the goal is to represent multidimensional objects in lower dimensions (usually two or three), retaining as much as possible of the high-dimensional structure, such that the data configuration can be visually examined (e.g. on a scatter plot). There are two basic groups: the projection method calculates the new coordinates as a linear combinations of the original variables, as opposed to mapping where the two or three new coordinates are nonlinearly related to the original ones. b data set [PREP] Set of objects described by one or more variables. A data set is often considered as a sample from a population and the sample parameters calculated from the data set are taken as estimates of the population parameters. A data set can be split into two parts. The training set, also called the learning set, is a set of objects used for modeling. The set of objects used to check the goodness of prediction of a model calculated from the training set is called the evaluation set. The test set is a new set of data, a new sample from the population. Predictions for the test objects (e.g. class assignment, response value) are obtained using the model calculated on the training set. A data set is usually presented in the form of a data matrix. Each object corresponds to a row of the matrix and each variable constitutes a column.

David-Barton’s test [TEST] + hypothesis test b

Davidon-Fletcher-Powell optimization (DFP) [OPTIM] + variable metric optimization b

David’s test [TEST] + hypothesis test b

b

decile [DESC]

+ quantile b *.

decision boundary [CLAS] class boundary

decision function [MISC] + decision theory b

defect [QUAL]

63

b decision rule [MISC] -+ decision theoly b

decision surface [CLAS]

: class boundary b decision theory [MISC] The theory of finding a decision which is an optimum with respect to the consequence of the decision. There are three spaces considered in decision theory. The elements of the parameter space 6 describe the state of the system. The elements of the sample space x provide information about the true value of 6 . Finally, the elements of the decision space d are the possible actions to take. Decisions are determined by the decision function 6, also called the decision rule, based on the sample:

d = S(X) The consequence of a decision is formalized by the loss function L(6,d) or L ( 6 , S(x)); it is a real, nonnegative function valued on the parameter and the decision spaces. In case of discrete parameter and decision spaces (e.g. a classification problem) the loss function is a matrix, called the loss matrix, otherwise the loss function is a surface. The expected value of the loss function is called the risk function: %(6) = Ex tL(% 6(4)1

Given the conditional distribution F(x 1 0 ) the risk function is:

Rs(f+) =

/

L (0, 6(x>) d o I 6)

The optimal decision function 6* is found by minimizing &(tY). If 6* minimizes h(6) for every value of 6, it is called the uniformly best decision function. When such a function does not exist, the Bayes or minimax rules can be employed. The former defines the 6* that minimizes the average risk:

I-

6* = min RS = E~ [ E , [ L ( ~~,( x ) ) ] ] ]

s while the latter defines the 6* that minimizes the maximum risk

max [Ex[L(6,6 ( x ) ) ] ] b decision tree [GRAPH] Graphical display of decision rules. Decision rules usually generate a tree of successive dichotomous cuts. The most common one is the binary decision tree. b defect [QUAL] + lot

64

defective item [QUAL]

b

defective item [QUAL]

+ lot defective unit [QUAL] + lot b

b

defining relation [EXDE]

+ confounding b degrees of freedom (df) [ESTIM] The number of independent pieces of information necessary to calculate a single statistic or a model with several parameters. For example, the mean has one degree of freedom; an OLS model with p parameters hasp degrees of freedom. For further details see degrees of freedom in ANOVA and model degrees of freedom.

b degrees of freedom in ANOVA (df) [ANOVA] Column in the analysis of variance table containing the number of independent pieces of information needed to estimate the corresponding term. The number of degrees of freedom associated with a main effect term is one less than the number of levels the term can assume. The number of degrees of freedom associated with a crossed effect term is the product of the numbers of degrees of freedom associated with the main effects in the term. The number of degrees of freedom associated with a nested effect term is one less than the number of levels of the nested effect multiplied by the number of levels of the other (outer) effect. The total sum of squares around the grand mean has a number of degrees of freedom which is one less than the total number of observations. For example, in a crossed two-way ANOVA model with n observations the numbers of degrees of freedom are partitioned among the terms as:

Term

df

B

A

~~

~

1-1

Error

AB ~

J -1

~

~

~

Total ~

(I - 1)(J - 1)

n-IJ

n-1

In a two-stage nestedANOVA model with n observations the numbers of degrees of freedom are: Term

A

df

1-1

I ( J - 1)

b deletion residual [REGR] + residual

Error

?btal

n-IJ

n-1

dendrogram [GRAPH]

65

b demerit system [QUALI Weighting system of defects that takes their severity into consideration. Because defects are not of equal importance, it is not only the number of defects but also their severity that influences whether an item is judged as conforming or nonconforming. For example: defects are classified into four categories: A, B, C, D; weights are assigned as: 100 for A (the most severe), 50 for B (less severe), 10 for C and 1 for D (least severe); the value of demerits is defined as:

where dA, dB, dc, and dD are the numbers of defects of various degrees of severity. b dendrogram [GRAPH] Graphical representation of a hierarchy of clusters resulting from a hierarchical cluster analysis, in which the edges of the tree are associated with numerical values (e.g. distance or similarity measures).

On the above figure the two most similar objects are (B, C). At the 0.5 similarity level (indicated by the dotted line), for example, five clusters are formed: CA (A), CB (B, C, D, E, F), CG (G, H), CI (I, J, K), CL (L, M, N). CA, called a singleton, is the most distinct object. Among these five clusters, CG and CI are the most similar, i.e. CG and CI will be connected next. At the 0.2 similarity level all objects are combined into one single cluster. In a dendrogram, resulting from cluster analysis, each node represents a cluster and each terminal node, called a leaf, represents an object (or variable). The graphtheoretical distances between the root of the dendrogram (single cluster) and the

66

density clustering [CLUS]

leaves (objects) define the quality of the tree, i.e. the quality of the clustering result. The maximum distance is called the height of the tree. A dendrogram with a height close to the largest distance is a chained tree, while a dendrogram with a minimal height is a balanced tree. A linear dendrogram is a graphical representation of multiple comparisons results: 1.23

1.45

2.18

2.44

2.50

3.78

b density clustering [CLUS] + non-hierarchical clustering

b

density estimator [ESTIM]

-+ estimator b density function [PROB] + random variable

b

density map [GRAPH]

+ scatter plot b dependent variable [PREP] + variable

b

descriptive statistics [DESC]

+ statistics b

descriptor [PREP]

: variable b design [EXDE] Location of experimental points in the predictor space. A design consists of two basic structures: the treatment structure and the design structure. The treatment structure is defined by the factors and their level combinations, called treatments. The design structure is the arrangement of experimental units into blocks. A design is specified by the random assignment of treatments from the treatment structure to the experimental units from the design structure. Hence design means the choice of the two structures and the method of random assignment. In regression terms it corresponds to choosing the predictor variables and fixing their values in the matrix. The most important designs observations. A design is collected in a design are listed below:

design [EXDE]

61

additive design Design in which there are no interactions of any type, i.e. the factors influencing the response have an additive effect. In such a design a change of one level of a factor to another causes the same change in the outcome of the experiment, independent of the level of other factors and blocking variables. asymmetric design Design in which the number of levels is not the same for all the factors. axial design Design in which, in contrast to the boundary design, most design points are positioned on the factor axes. The most common axial design is for complete mixtures in which most of the design points are inside the simplex positioned on the component axis. (7

xl = 1

X3-1

xl = 0

The simplest axial design has points equidistant from the centroid towards each of the vertices. In a design with q components the axis of component j is the imaginary line extending from the base point

x, = l / ( q - 1)

Xk = 0

for all k # j

1

for all k # j

to the vertex Xj

=O

Xk =

For example, a three-component axial design has design points which are equidistant from the centroid on the three axes.

balanced factorial design Factorial design in which each treatment is run an equal number of times. In a balanced design the effects of factors and interactions are orthogonal.

68

design [EXDE]

block design : randomized block design boundary design Design in which the design points are on the boundaries (vertices, edges, faces) of the cube or the simplex. Box-Behnken design Design for fitting a second-order model generated by combining two-level factorial design with balanced incomplete block design (BIBD). For example, a three factor, three block BIBD, where each factor appears twice in the design, always paired with a different factor, combined with a 2' complete factorial design, gives the following design matrix:

1 2 3 4

5 6 I

8

+

9 10

11 12 -

+

Box-Draper design Saturated design of K factors and N = (K + 1)(K 2)/2 runs for fitting a second-order model. For example, if K = 2, the six runs are:

+

(-1

-1)

(+1 -1)

(-1 -1)

(-d

-d)

(+1 3d) (3d +1)

where d = 0.1315. The optimal position of the design points is found by minimizing Ix=XI.

complementary design Fractional factorial design in which the design generators contain minus signs. For example, the Z4-' design has two half fractions, the first generated by D = ABC, the second by D = -ABC. The second design is a complementary design to the first one. complete factorial design (CFD) (: full factorial design) Factorial design in which there is at least one run for each combination of factor levels. For example, a 23 design with 8 runs studies the effect of three factors, each

design [EXDE]

69

with two levels. This design allows one to study the effect of all three factors and the effect of all interactions. The model is:

where i , j , k = 1, 2.

completely randomized design (.- randomized design) Design in which the runs are assigned randomly to treatments. The runs are viewed as a random sample from a normal distribution. There are no restrictions on randomization due to blocking. composite design lko-level factorial design augmented with further runs. This permits the fitting of a second-order model and the estimation of the quadratic effects. If the optimum is assumed to be near the center of the cube, the added runs are center points and star points and the design is called a central composite design (CCD). cube

.

.

. . I

. star

. ..

.

'..

. . . . ... .. .. . .. . . . . . center star

cube

If the optimum is assumed to be close to a vertex of the cube, the added runs should be near that vertex and the design is called a noncentral composite design. The design matrix of a central composite design is composed of cube points, center points and axial points (or star points). Cube points are the runs of the original factorial design with factor levels + 1 and -1. The factor levels of a center point are denoted by 0. Given K factors, there are 2 K axial points added to the design with level f a in one factor and 0 4in others. For example, the 23 design can be augmented with two center points (0, 0,O)and (0, 0,O)and six axial points (&a,0, 0), (0, fa,0), and (0, 0, &a).Adding only center points to the cube points allows a test to be made for curvature in the response surface.

70

design [EXDE]

cross-over design Design in which time periods constitute factors. If there are N treatments to be tested in K time periods, then N x K experimental units are needed. cyclical design Balanced factorial design in which blocks are generated by cyclical permutation of the treatments. distance-based optimal design Optimal design in which a new set of points is selected to maximize the minimum distance between the previously selected points. This design results in a uniform distribution of the points finally selected, and it does not require any assumption on the regression model used. Doehlert design (.- uniform shell design) Design for fitting a second-order model that consists of points uniformly spaced on concentric spheres. Such points are generated from a regular simplex by taking differences among its vertices. The number of factors is K = 2,10, the number of runs is N = K2 K plus a center point.

+

equiradial design Design that consists of two or more sets of design points such that the points in each set are equidistant from the origin. These point sets are called equiradial sets. factorial design (FD) Design in which each level of each factor is combined with each level of every other factor under consideration. If, for example, three factors are investigated, and there are L1 levels of the first factor, LZ levels of the second factor and L3 levels of the third factor, then an L1 x LZ x L3 treatments must be run. Factorial designs are specified by the number of factors, e.g. two-factor factorial design, threefactor factorial design, etc., and by the number of levels of the factors, e.g. two-level factorial design, three-level factorial design, etc. (assuming equal number of levels in all factors). The most important one is the two-level factorial design. There are two kinds of factorial designs: complete factorial design and fractional factorial design. The advantage of factorial design is that the effects of several factors can be investigated simultaneously. Its disadvantage is that the number of treatments increases rapidly with increasing number of factors or levels. first-order design Design for fitting a model in which the response is a linear function of the factors, i.e. the response surface can be approximated by a first-order polynomial. The regression coefficients of the first-order polynomial dictate the direction to be taken to optimize the response. For example, in a two-factor ( A and B), first-order design the model is

design [EXDE]

y = bo

71

+ biA + b2B

The response surface is a tilted plane with intercept bo and slopes bl and 62. The contour lines of such plane are equally spaced parallel straight lines.

folded design (: fold-over design, reflected design) Two-level factorial design of N runs in which the second N/2 runs are mirror images of the first N/2 runs, i.e. with opposite levels in each factor. Furthermore, the N/2 runs can be folded only on specified factors, i.e. only the specified factors have opposite levels; the rest of the factors are repeated with the same levels. fold-over design :folded design fractional factorial design (FFD)(.- partial factorial design) Factorial design that consists of only a fraction of the total number of runs. When the fraction of runs is selected appropriately the effects of factors and even of some interactions can be estimated with fewer runs. The assumption is that the effect of the majority of interactions is negligible. Conventionally a fractional factorial design is referred as a LK-’ design, where L denotes the number of levels in the factors, K denotes the total number of factors and J denotes the number of factors generated by confounding. The number of runs in this design is LK-J. These designs are generated by writing down first a complete factorial design in Yates order with the maximum number of factors possible (K - J ) , then calculating the levels of the remaining factors according to the design generators. full factorial design : complete factorial design Graeco-Latin square design Design for studying the effect of one factor and three blocking variables. The blocking variables and the factor must have equal numbers of levels. The assumption is that the three blocking variables and the factor are additive, i.e. there is no interaction among them. This design can eliminate three extraneous sources of variation. The design matrix can be generated from a Latin square design by adding a third blocking variable such that each run with the same factor level receives a different level of the third blocking variable. An example of a four-level GraecoLatin square design, where A, B, C are the three blocking variables, and D is the

72

design [EXDE]

Hartley design Composite design in which the cube portion is of resolution 111, i.e. with the restriction that two-factor interactions must not be aliased with other two-factor interactions. This design permits much smaller cubes compared to other composite designs with higher resolution cubes. hierarchical design : nested design Hoke design Design containing three or more factors with three levels for fitting a second-order model. It is generated from partially balanced saturated fractions of the 3K design. This design compares favorably with the Box-Behnken or the Hartley designs. hybrid design : Roquemore design hyper-Graeco-Latinsquare design Design for studying the effect of one factor and more than three blocking variables. The design matrix can be generated from a Graeco-Latin square design by adding blocking variables to it. Knut-Vik square design Special Latin square design in which the permutation of the factor levels is generated by moving three levels forward instead of one. The pattern of the factor levels resembles the knight's move in chess. An example of a five-level Knut-Vik square design, in which A and B are two blocking variables and C is the factor, is:

Latin square design Design for studying the effect of one factor and two blocking variables. The blocking variables and the factor must have equal numbers of levels. The assumption is that the two blocking variables and the factor are additive, i.e. there is no interaction among them. The levels of the factor are assigned randomly to the combinations of blocking variable levels such that each level of a blocking variable receives a different level of the factor. This design eliminates two extraneous sources of variation. An example of a four-level Latin square design, in which A and B are the two blocking variables, and C is the factor, is:

design [EXDE]

73

mixture design Design for the mixture problem, i.e. when the response depends only on the proportion of the mixture components and not on their amount. In such a design the sum of the amounts of the components is constant over the design points. Due to the above constraint, a simplex coordinate system is used for the geometric description of the factor space. Usually the amount of components is standardized, i.e. their sum is 1. A q-component mixture design has points on or inside the boundaries of a (q - 1)-dimensional simplex with edges of unit length. The points inside the simplex represent complete mixtures, i.e. mixtures in which all components are present, while points on the boundaries of the simplex represent mixtures in which only a subset of components is present. The proportion of the components is often restricted by a lower or upper bound, or by both. In that case the design points can be positioned only on a subregion of the simplex, usually on the extreme vertices. The most commonly used mixture designs are the simulex centroid design and the simulex lattice design. nested design (.- hierarchical design) Design in which the levels of factors are not completely crossed but are combined in a hierarchical manner. optimal design Design aimed at minimizing the variance of estimates of the effects. This variance depends on the size and sphericity of the confidence region. There are various criteria for defining design optimality which measure some combination of the above two characteristics. o A-optimal design minimizes the mean square error of the coefficients:

r

min [A] = min [tr [(XTX)-']] = rnin

1

IF"]

o D-optimal design minimizes the generalized variance, i.e. the volume of the

confidence ellipsoid of the coefficients: min [D] = min [ IXT XI] = rnin

n:]

-

[ j

where X denotes the design matrix and A, thejth eigenvalue of (XTX)-'.

74

design [EXDE]

o E-optimal design minimizes the largest eigenvalue:

min [El = min [All The desirability of a design increases as D increases, but as A and E decrease.

orthogonal design Design in which the factors are pairwise orthogonal. Two factors are orthogonal if the sum over the runs of the product of their levels is zero. Orthogonal designs are desirable because they offer greater precision in the estimates of the parameters since these estimates are uncorrelated. All two-level factorial designs are orthogonal designs. partial factorial design :fractional fuctoriul design Plackett-Burman design Saturated, orthogonal, resolution 111, two-level fractional factorial design for up to K = N - 1 factors, in which N, the number of runs, must be a multiple of four. The size of the design varies between K = 3,99 and N = 4,100. This design can be generated by shifting the first row cyclically one place K - 1 times. The last row has a minus sign in all factors. randomized block design (: block design) Design in which the runs are grouped into blocks such that the runs are assumed to be homogeneous within blocks. Runs are assigned randomly to treatments within blocks. The effect of blocking is assumed to be orthogonal to the effect of factors. It is important to form the blocks on the basis of a variable which is logically related to the outcome of the experiment because the purpose of blocking is to control the variation of this variable. Effective blocking reduces the residual error, i.e. the denominator of the F test. This design is preferred over the completely randomized design if the within-block variability is smaller than the between-block variability. In this design the blocking is done in a single direction, which removes one source of variation. Other designs can handle blocking in several directions, e.g. the Latin square design has two-way blocking, and the Graeco-Latin square design has three-way blocking. A randomized block design in which each treatment is run in each block, i.e. the block size equals the number of treatments, is called a complete block design. In contrast, a randomized block design in which not all treatments are run in each block is called an incomplete block design. A special case is the balanced incomplete block design (BIBD) in which each block has the same number of runs, each treatment is replicated the same number of times, and each pair of treatments occurs together in a block the same number of times as any other pair of treatments.

design [EXDE]

75

randomized design : completely randomized design reflected design :folded design Roquemore design (.- hybrid design) Design for fitting a second-order model with K factors, generated from a central composite design with K - 1 factors. It has the same degree of orthogonality as the parent central composite design but is also near saturated and near rotatable. rotatable design Design in which the variance of the predicted response at any point of the design depends only on the distance of the point from the design center, but not on the direction. The variance contours are concentric hyperspheres. saturated design Fractional factorial design in which the number of runs is equal to the number of terms to be fitted. Examples are: the Plackett-Burman design and the Box-Draper design. second-order design Design for fitting a model in which the response is a linear function of the factors, of the factor interactions and of the squared factors, i.e. the resuonse surface is approximated by a second-order polynomial. For example, in a two-factor, secondorder factorial design the model is: JJ = bo

+ b l A + b2B + b3A2 + b4B2 + bsAB

The response surface is curved, it can have a single maximum or minimum with circular contour lines, a stationary ridge with nonequidistant parallel contour lines, a rising ridge with parabola contour lines, or a saddle point with double hyperbola contour lines. sequential design Design in which, based on the results of statistical analysis, the original first-order design is augmented to a second-order design suitable for fitting a full second-order model. For example, after discovering strong interaction effects in a Z3 factorial design, axial points and center points can be added to examine second-order effects using the newly generated composite design. simplex centroid design Mixture design with q components consisting of 24 - 1 design points. Each point is a subset of mixture components of equal proportions. Such mixture points are located at the centroid of the lower dimensional simplexes. For example, there are seven design points in a three-component (q = 3) simplex centroid design:

76

design [EXDE]

simplex lattice design Mixture design with q components for fitting a polynomial equation describing the response surface over the simplex region. A lattice is a set of design points ordered and uniformly distributed on a simplex. A lattice design is characterized by the number of mixture components q and by the degree of the polynomial model m. The [q, m]simplex lattice design contains all possible combinations of points xi = 0, l/m, 2/m,

. . ., 1

For example, there are six points in the q = 3 and m = 2 simplex lattice design:

(1,0,0> ( 0 , L 0) (I/& 1/2,0> ( W , 0,1/2)

(0,0,1) (0,1/2,1/2)

split-plot design Generalization of the randomized block design for the case when complete randomization within blocks is not possible. The simplest design consists of two factors and one blocking variable. Instead of completely assigning the treatments to experimental units in a randomized way within a block, only the levels of one factor can be chosen randomly, the levels of the other factor are varied along the runs in a systematic fashion. The levels of this latter factor are called main treatments and the collection of the experimental units running with the same level of this factor is called the whole plot. Each whole plot is then divided randomly into split-plots, i.e. the experimental units are assigned randomly to the levels of the other factor, called a subplot treatment. The terminology obviously originated from the first agricultural applications of this design. symmetric design Design in which all factors have equal numbers of levels. two-level factorial design Factorial design in which each factor has only two levels. This design is of special importance because with relatively few runs it can indicate major trends and can be a building block in developing more complex designs. The two-level complete factorial design is denoted 2 K , the two-level fractional factorial design is denoted 2K-J. In the design matrix the two levels are conventionally denoted as + and -. In geometric terms the runs of this design form a hypercube. For example, the runs of the 23 factorial design can be represented geometrically as vertices of a cube, where the center of the cube is at point (0, 0, 0), and the coordinates of the vertices are either +1 or -1.

design structure [EXDE]

77

unbalanced factorial design Factorial design in which the number of runs per treatment is unequal. In contrast to balanced design, the effects of factors and interactions are not orthogonal. Consequently, the analysis of an unbalanced design is more complicated than that of a balanced one. uniform shell design : Doehlert design Westlake design Composite design, similar to a Hartley design, with a cube portion of relatively small runs. The most common design sizes are:

(K = 25, N

= 22),

(K = 7, N = 40), (K = 9, N

= 62).

Youden square design Incomplete Latin square design in which the number of levels in the two blocking variables is not equal. design generator [EXDE] + confounding b

b design matrix [EXDE] Matrix containing the combination of factor levels (treatments) at which the experiments must be run. It is the predictor matrix in a regression context. The columns of the matrix correspond to factors, interactions and blocking variables. A matrix with orthogonal columns is usually preferred. The rows of the design matrix represent runs, usually in Yates order. A run, also called experimental run or design point, is the set up and operation of a system under a specific set of experimental conditions, i.e. with factors adjusted to some specified set of levels. For example, in a chemical experiment, to do a run means to bring together in a reactor specific amount of reactants, adjusting temperature, pressure to the desired levels and allowing the reaction to proceed for a particular time. In the regression context runs correspond to observations; in the multivariate context runs are points in the factor space. The smallest partition of the experimental material such that any two units may receive different treatments in the actual experiment, i.e. the unit on which a run is performed, is called an experimental unit or a plot. Experimental units can be material samples, animals, persons, etc. b

-+

design point [EXDE] design matrix

design structure [EXDE] + design

desirabilityfunction [QUAL]

78

b desirability function [QUAL] -+ multicriteria decision making

b determinant of a matrix [ALGE] + matrix operation

b

deterministic model [MODEL] model

+

b DFBETA [REGR] + influence analysis

b

DFFIT [REGR]

+ influence analysis b

diagnostics [REGR]

: regression diagnostics b diagonal element [ALGE] -+ matrix

diagonal matrix [ALGE] + mat^ b

b diagonalization [ALGE] + matrix decomposition

b

dice coefficient [GEOM] distance ( 0 binary data)

-+

b

dichotomous classification [CLAS]

: binary classification

dichotomous variable [PREP] -+ variable b

b diffusion process [TIME] + stochastic process (u Markov process)

b digidot plot [GRAPH] Complex graphical tool used in quality control. A stem-and-leaf diag-am (digits) is

dkcrete variable [PREP]

79

constructed simultaneouslywith a time series ulot of observations (dots) as data are taken sequentially. 17.1 18.2 16.3 15.2

ri

15.2 15.4 16.4 18.6 22.3

17.1 15.2 19.7 19.5 23.0

digraph [MISC] + graph theory b

b

dimensionality [GEOM] geometrical concept

dimensionality reduction [MULT] + data reduction b

b

direct calibration [REGR] calibration

b direct search optimization [OPTIM] Outimization which, in calculating the set of parameters that minimizes a function, relies only on the vaIues of the function calculated in the iterative process. This optimization does not explicitly evaluate any partial derivative of the function. Techniques belonging to this group of optimization are: linear search oDtimization, simulex outimization and simulated annealing.

directed graph [MISC] + graph theory b

discrete distribution [PROB] + random variable b

b discrete variable [PREP] + variable

80 discretization error [OPTIM] b discretization error [OPTIM] + numerical error

b discriminant analysis (DA) [CLAS] Parametric classification method that derives its classification rule by assuming normal class density functions. Each class g is modeled by the class centroid c, and the class covariance matrix S,, both estimated from the training set. An object x is classified into class g’ with the largest posterior class probability P(g‘ 1 x) calculated according to the Baves’ rule. Inserting the equation of the normal density function, the following expression has to be maximized:

P(g’ I x) = max

[m]’1S,1”2

exp -0.5 (x - cg)TSp’(x- c,)

[

where P, denotes the prior class probability. Equivalently the following distance function has to be minimized:

where the first term is the Mahalanobis distance between object x and a class centroid c,. This rule, which defines quadratic class boundaries, is the basis of the quadratic discriminant analysis (QDA). In linear discriminant analysis (LDA) it is assumed that the class covariance matrices are equal, i.e. 8, = S for all g. In that case the class boundaries are linear and linear discriminant functions can be calculated. Fisher’s discriminant analysis is a binary classification subcase of LDA. Discriminant score, i.e. a one dimensional projection, is obtained by multiplying the measurement vector x by the discriminant weights w:

s = wT x w is often called Anderson’s classification function. The weights are obtained by maximizing the between- to within-class variance ratio:

This single linear combination yields a new axis which best separates the two classes. The goodness of classification in both LDA and QDA depends (among other things) on the quality of the estimates of the class centroids and class covariance matrices. When the sample size is small compared to the dimensionality,LDA often gives better results, even in the case of somewhat different covariance matrices, due to its advantage of estimating fewer parameters. DASCO, SZMCA, RDA, UNEQ are extensions and modifications of the QDA rule, all of which try to improve class prediction using biased covariance estimates.

dispersion [DESC]

81

In stepwise linear discriminant analysis (SWLDA) the variables used in computing the linear discriminant functions are chosen in a stepwise manner based on F statistics. Both forward and backward selections are possible, similar to variable subset selection. At each step the variable that adds the most to the separation is entered or the variable that contributes the least is removed. The optimal subset of variables is best estimated on the basis of class prediction. b discriminant analysis with shrunken covariances (DASCO) [CLAS] Parametric classification method, similar to the quadratic discriminant analysis, that models each class by its centroid and covariance matrix and classifies the objects on the basis of their generalized Mahalanobis distance from the centroids. Instead of the usual unbiased estimates for the class covariance matrices, however, DASCO approximates them on basis of a biased eigenvalue decomposition. The large eigenvalues, which can usually be reliably estimated, are unchanged, but the small eigenvalues, which carry most of the variance of the covariance matrix estimate, are replaced by their average. An important step in DASCO is to determine in each class the number of unaltered eigenvalues. This number is estimated on the basis of the cross- validated misclassification risk. b

discriminant function [CLAS]

: linear discriminant function b discriminant plot [GRAPH] + scatter plot

b

discriminant rule [CLAS]

: classification rule

discriminant score [CLAS] + linear discriminant function b

b discriminant weight [CLAS] + linear discriminant function

discrimination [CLAS] : classif cation

b

b

discrimination power [CLAS]

+ classification rule b dispersion [DESC] (: scale, scatter, variation) A single value that describes the spread of a set of observations or the spread of a distribution around its location. The most commonly used dispersion measures are:

dispersion [DESC]

82

average absolute deviation : mean absolute deviation coefficient of variation (: relative standard deviation) Measure which is independent of the magnitude of the observations: cvj = Sj /jrj

where Sj is the standard deviation and Srj is the arithmetic mean.

H-spread The difference between the two hinges, which are the two data points half-way between the two extremes and the median on ranked data. half-interquartile range : quartile deviation interdecile range Robust measure, similar to the interquartile range: IDR = Q[O.9] - Q[O.l] where Q10.91 and Q[O.13 are the ninth and first deciles, respectively.

interquartile range Robust measure: IQR = Q3 - Ql = Q[0.75] - Q[O.25]

where Q3 and Ql are the upper and lower quartiles, respectively.

mean absolute deviation (MAD) (: mean deviation, average absolute deviation) Robust measure: MADj=

n

mean deviation : mean absolute deviation mean trigonometric deviation Measure for cyclical data: MTDj = 1 -

IA sin(Tj) + B cos(jrj)I n

with

x = arctan(A/B)

n=

C 1

Wi

dispersion [DESC]

83

median absolute deviation around the median (MADM) Robust measure: MADMj = median

[IXij

- Mj I]

with Mj = median [Xij] For symmetric distribution MADM is asymptotically equivalent to the quartile deviation.

quartile deviation ( : half-interquartilerange, semi-interquartile range) Robust measure:

range

R. J -- U.J - L.J where U, and Lj are the maximum (upper) and minimum (lower) values, respectively.

relative standard deviation : coefficient of variation root mean square (RMS)

root mean square deviation (RMSD) : standard deviation semi-interquartile range : quartile deviation standard deviation (: root mean square deviation) Square root of the variance:

sj =

trimmed variance Robust measure in which the sum of squared differences from the trimmed mean is calculated excluding a fraction of objects. It involves ordering the squared differences from smallest to largest and excluding the largest ones. For example, the 10%-trimmed variance is calculated using n - 2 m objects, where m = 0.1 n.

84 dispersion matrix [DESC] variance

s2 =

i

n-1 This is the unbiased estimate of the population variance. Sometimes the biased formula is used in which the sum is divided by n instead of n - 1. For computational purposes the following formulas are often used: J

s? = i J

n-1

-

i

‘ i

n(n - 1)

weighted variance

where wi are observation weights and Xj is a weighted mean. b

dispersion matrix [DESC]

: scatter matrix

b

display method [MULT] data reduction

b

dissimilarity [GEOM]

--+distance

b

dissimilarity index [GEOM] similarity index

b

dissimilarity matrix [GEOM] similarity index

distance [GEOM] Nonnegative number 8 , associated with a pair of geometric objects, reflecting their relative position. The pairwise distances can be arranged into a matrix, called the distance matrix, in which rows and columns correspond to objects. The distance S may satisfy some or all of the following properties: b

1.

S,t

2.

a,

20 =0

3.

S,t

= St,

distance [GEOM]

85

4. SSt = 0 iff s = t 5 . Sst 5 Ss, Szt 5a. SSt = S,, + SZt

+

6. SSt 6a. Sst 7. 8.

+ SU, Imax [(Ssu + S t A + Wl + S, = max [(L + JtA(asz + WI (JSz

5

[ J s m Jt,]

qt= (x,

- Xt)T(Xs

Sst

- Xt)

Property 5 is called the triangular inequality, property 6 is called the additive inequality or four-point condition, and property 7 is called the ultrametric inequality. If property 7 is satisfied, then properties 4,5,6 and 8 are also satisfied. Similarly, if property 6 is satisfied, so is property 5 and if property 8 is satisfied, so is property 4. According to which property is satisfied, the functions are: Name

Property

Proxlmity measure Pseudo distance Dissimilarity Metric distance Additive distance Ultrametric distance Centroid distance Euclidean distance

1

1 1 1 1 1 1 1

2 2 2 2 2 2 2

3 3 3 3 3 3 3

4 4 4 4 4 4

5

6

7 5a

6a 8

Ultrametric distances are particularly important in partitions and hierarchies. For example, in agglomerative hierarchical clustering methods, if the pairwise distances calculated from the data matrix satisfy the ultrametric property, then complete, single and group average linkages result in the same hierarchy. Distance and related measures are widely used in multivariate data analysis, allowing the reduction of multivariate comparison between two objects to a single number. The choice of distance measure greatly influences the result of the analysis. A list of the most commonly used distances follows.

binary data Distance between two binary variables or between two objects described by a set of binary variables is usually calculated from an association coefficient a, by applying the transformation: dst = constant - as[

Association measures are based on the following 2 x 2 table containing the number of occurrences (a, b, c, d) of the four possibilities

86

distance [GEOM]

There is an important difference between measures including the 0,Omatch (6) and measures without it. Coefficients that take d occurrences into account must be selected only if joint absence implies similarity. Dice coefficient (: Sorenson Coefficient) 2a ast = 2a+b+c Edmonston coefficient a ast = a 2(b c)

+ +

range [0, 11

range [0, 11

fi coefficient ad-bc

ast = J(u

+ b)(a + c)(b+ d)(c + d)

Hamann coefficient (a d) - ( b c ) - W + d ) -1 a,t = a+b+c+d a+b+c+d

+

+

range [-1, 11

range [-1, 11

Hamming coefficient Binary version of the Manhattan distance:

range[O,a+b+c+d]

4t=b+c

Binary version of the Euclidean distance: dst =

range [0, a + b + c + d ]

Jb+c

o Jaccard coefficient ast

=

a a+b+c

range [0, 11

o Kulczinsky coefficient U

ast = b+c o

range [0, 001

Kulczinsky probabilistic coefficient a

a+b

a a+c

range [0, 11

distance [GEOM] 87

o lambda coefficient a,t = o

&d-& &d+&

range [-1, 11

normalized Hamming coefficient : Tanimoto coefficient

o Rogers-'Panimoto coefficient

a,t =

a

a+d d 2 (b c)

range [0, 11

+

+ +

o Russell-Rao coefficient

a,t =

U

range [0, 11

a+b+c+d

o simple matching coefficient (: Sokal-Michener coefficient)

a,t =

a+d a+b+c+d

range [0, 11

o Sokal-Michener coefficient : simple matching coefficient o Sokal-Sneath coefficient

ast =

2(a

2(a+d) d) b

range [0, 11

+ + +c

o Sorenson coefficient : Dice coefficient

o Tanimoto coefficient (: normalized Hamming coefficient)

b+c " -a+ b + c + d

d -

range [0, 11

range [0, 11 o Yates chi squared coefficient

+ b + c + dl [ ( a d- b c ) - (a + b + c + d)/2]* (a + b)(c + m a + c)(b+ d) (ad - bc) (a + b + c + d)/2 else = 0.

ast = if

(a

7

%t

o Yule coefficient

a,t =

ad-bc ad+bc

range [-1, 11

88

distance [GEOM]

nominal data When the objects contain nominal variables, the distance between them is calculated after transforming the nominal variables into binary variables. Each nominal variable that may assume k different values is substituted by k binary variables; all are set to zero except the one corresponding to the actual value of the nominal variable. rankeddata The most common measure of distance between ranked variables is the rank correlation coefficient, e.g. Spearman’s p coefficient or Kendall’s t coefficient. To measure distance between objects of ranked data the following quantities are the most popular, calculated from the ranks rsj and rtj of objects s and t on variable j : o asymptotic Mahalanobis distance : Bhattachalyya distance o Bhattacharyya distance ( : asymptotic Mahalanobis distance)

o Calkoun distance

Distance based on ranking the n objects for each variable and counting the number of objects between objects s and t:

dst = 6 n l + 3 n 2 + 2 n 3 where nl is the number of objects that fall between objects s and t on at least one variable; n2 is the number of objects that are not in nl but have tie values on at least one variable with either object s or t; n3 is the number of objects that are neither in nl nor in n2 but have tie values on at least one variable with both objects s and t. The normalized Calkoun distance is defined as: 6n1+3n2+2n3 dst = 6 (n- 2)

Mahalanobis-like distance

Rajski’s distance Measure of distance between variables; can also be calculated for objects:

where H(s; t) is the mutual entropy and H ( s , t) is the joint entropy for the objects (or the variables) s and t. If the entropy is calculated over the relative frequencies

distance [GEOM]

89

f of the objects in the data set, Rajski’s distance can also be used for quantitative data. Rqjski’s coherence coefficient is a similarity index calculated from the Rajski’s distance as

o rank distance

where sf is the variance of variable j. The variance of a ranked variable j with n observations and k = 1,q ties at tk is defined as:

(n3 - n) -

c
k

quantitative data o average Euclidean distance

o

average Manhattan distance : mean character difference

o BrayCurtis coefficient : Lance- Williams distance o Canberra distance

dst =

c j

o

lxsj

- xtj I

(xsj + x t j )

Chebyshev distance (: Lagrange distance) dst = m , a lxsj - xtj I J

o chi squared distance

dst =

90

distance [GEOM]

o city-block distance : Manhattan distance o cosine coefficient

o divergence coefficient

o generalized distance

dst = J ( x s

-X d T

M (XS- X t )

where M is a symmetric positive definite matrix such that xT M x 2 0 for all x . When M is the inverse of the covariance matrix this distance equals the Mahalanobis distance and when M = II (identity matrix) this distance equals the Euclidean distance.

Lagrange distance : Chebyshev distance Lance-Williams distance ( : Bray-Curtis coefficient) dS t -

j

C(Xsj

-Xtj)

j

Mahalanobis distance

where 8-' is the inverse of the covariance matrix. o Manhattan distance (: city-block distance, taxi distance)

dst =

C j

lxsj - xtj

I

distance [GEOM] 91

o mean character difference (: average Manhattan distance)

o Minkowski distance

o Pearson distance

(xsj - xtj

where sj is the standard deviation of variable j .

size difference coefficient

shape difference coefficient ds =

Jp3 [$,(Euclidean) -

$t

(size)]

taxi distance : Manhattan distance mixed type data When the objects contain binary, nominal, ordinal and quantitative variables, the distance between them is calculated as: 4t

= wld,'t

+ wzdft + ~3d:t + wq$t

where d', 8,d3, d are distances calculated separately for the binary, nominal, ordinal and quantitative variables, respectively. The w1, wz, w3, w4 are weighting coefficients in the aggregate distance measure. o Gower coefficient A general similarity measure which includes several other ones as special cases and can be used for different types of variables. The Gower coefficient is defined as

92

distance matrix [GEOM]

where Sstj is the similarity index between the object s and t calculated for thejth variable and 6, tj is a comparison flag, being 1 when variable j can be compared for s and t, and 0 otherwise. When dStj is zero for all the variables, sst is undefined; when the comparison is allowed on all variables, i.e. all Sstj are 1, then the denominator is equal to p. For the different kinds of variables, Sstj is defined as: - for binary variables: Values of jth variable Objects

Validity

- for nominal variables: 1 if the two objects agree in the jth variable 0

otherwise

- for quantitative variables: sstj = 1 -

Ixsj

- Xtj I

U, - L j

where V, and Lj are the upper and the lower values of the jth variable of the population or of the sample.

Parks distance A mixed-type distance defined for binary, nominal, ordinal and quantitative variables, in which all variables are scaled to range [0,1] as: binary variables are unchanged nominal variables are transformed into binary variables ordinal variable taking on values 1. . .k is scaled: x’ = x/k quantitative variable with range U-L is scaled: x’ = (x - L)/(U - L) The Parks distance is defined on the scaled variables as: o

distance matrix [GEOM] + distance b

distance-based optimal design [EXDE]

+ design

distribution [PROB]

93

b distribution [PROB] (: statistical distribution) The most important characteristic of a random variable is its distribution function. The number of such functions is quite large, however, in practice a relatively small number of them is used. The most important distribution is the normal distribution. Its crucial role is stated in the central limit theorem; many statistical methods are based on the normalitv assumption. I.i.d. and homoscedastic distributions are also of special interest. Below is a list of the most popular distribution functions in terms of their cumulative distribution function F(x), probability density function or probability functionf(x), survival function S(x), hazard function h(x), range, mean, and variance. There are two functions that are used in the formula of some of the following distribution functions. The gamma function is:

r(z) =

sm

tZ-' e-t dt

z >0

0

and the beta function is: B(z, W ) =

j

tz-'(l

- t)w-l

dt

z >0

w >0

0

The two functions are related as:

Bernoulli distribution Discrete distribution with parameter p, the probability of success in one trial. F(0) = 1 - p f(0) = 1- P

F(1) = 1 f(1) = P

x E {O,l) Range: Mean: p Variance: p(1 - p)

0cp c1

beta distribution Continuous distribution with shape parameters a and b. f ( x ) =Xa-'(l

-x)b-'/B(a, b)

where B is the beta function. Range: 05x51 a>O b>O Mean: a / ( a b) Variance: a b / [ ( a+ b12(a+ b + 111

+

94

distribution [PROB]

binomial distribution Discrete distribution with parameters n, the number of trials, and p , the probability of success. F(x) =

c (;)

p'(1 - p)"-'

I

Range: 0j x jn Mean: np Variance: n p (1- p)

05p 51

Cauchy distribution Continuous distribution with location parameters a , and scale parameter b. F(x) = 0.5 + l/n tan-'

[ [ + ('+TI]

(*ba) -1

f ( x ) = rrb 1

Range: -00 Mean = Median = Mode: a Interquartile range: 2b

<X

< 00

b >0

chi squared distribution Continuous distribution with shape parameter n, the degrees of freedom.

where

r is the gamma function.

O j x < 00 Range: Mean: n Variance: 2 n

0 t n < 00

double exponential distribution : Laplace distribution error distribution Continuous distribution with location parameters a, scale parameter b, and shape parameter c.

distribution [PROB]

where

r is the gamma function.

--oocxcoo Range: Mean = Median = Mode: a

-cocacoo

b>O

c>O

Variance:

exponential distribution : negative exponential distribution Continuous distribution with scale parameter a . F(x) = 1 - exp[-x/a]

m = l/a exp [-+I S(x) = exp [-x/a]

h(x) = l / a 0 s x c 00 Range: Mean: a Variance: a 2

a >0

F distribution (: variance ratio distribution, Fisher's distribution) Continuous distribution with shape parameters n and m, the degrees of freedom.

F[OS (n+ m)] (n/rn)"I2x("-2)/2

=

where

r ( n / 2 )r ( m / 2 )[I + (n/m)x](n+m)/2

r is the gamma function.

n>O m > O m/(m - 2) for m > 2 2 m2(n+ m - 2) Variance: for m > 4 n(m - 2)2(m - 4)

Range: Mean:

Opxcca

Fisher's distribution : F distribution gamma distribution Continuous distribution with scale parameter a , and shape parameter b.

where

95

r is the gamma function.

Opxcoo Range: Mean: ab Variance: a 2 b

a>O

b>O

96

distribution [PROB]

Gaussian distribution : normal distribution geometric distribution Discrete distribution with parameters p, the probability of success, and n, the number of trials. F(x) = 1 - (1 - p)"+' f(x) = P(1 - PI" Range: n 20 0

hypergeometric distribution Discrete distribution with parameters N, the population size, n the sample size, and a , the number of good items in the population.

Range: Mean:

max[O,n-N+a]<x<min[a,n] an/N (a n/N)(l - a/N)(N - n) Variance: (N - 1)

N >O

n>O

a >O

Laplace distribution ( : double exponential distribution) Continuous distribution with location parameter a, and scale parameter b.

F(x) = 1 - 0.5exp

[

--

ifxza

Range: --oo<x
-oocacoo

b>O

logistic distribution Continuous distribution with location parameter a, and scale parameter b. F(x) = 1 - {l+exp[(x-a)/b]}-'=

{l+exp[-(x-a)/b]}-'

distribution [PROB] 97

f ( x ) = exp [ - ( x - a)/b]b-'(l+ exp [ - ( x - a)/b]}-2 = exp[(x-a)/b]b-'(l +exp[(x- a)/b])-2 S(x) = 11 +exp[(x-a)/b]}-'

+

h(x) = (b(1 exp [-(x

- a)/b]}]-'

Range: --oo<x
-ca
b>O

lognormal distribution Continuous distribution with scale parameter p, and shape parameter a.

f(4=

exp [-(logx

-~

) ~ / 2 a ~ ]

X 0 - G

Range: 0 s x c 00 -ca
+

a>O

multinomial distribution Multivariate generalization of the discrete binomial distribution with parameters n, the number of trials and p, , the probability of success in variate j .

multivariate normal distribution Multivariate generalization of the continuous normal distribution with location parameter vector p and scale parameter matrix X.

Range:

-ca < xj < 00

Mean: Variance-covariance:

pj

X

-00
j=l,p

98

distribution [PROB]

negative binomial distribution (.- Pascal distribution) Discrete distribution with parameters y, an integer, and p, the probability of success.

Range: O(x
O
O
negative exponential distribution : exponential distribution normal distribution (: Gaussian distribution) Continuous distribution with location parameter ~ r and . scale parameter u

[I]

f(X) =

exp[-(x

-~

) ~ / 2 a ~ ]

(r&

Range: -oo<x
-oo
a>O

Pareto distribution Continuous distribution with location parameter a and shape parameter c. F(x) = 1 - (a/x)' f(x) = c a ' / x C + l

S(x) = (a/$'

h(x) = c/x

Range: a(xO c>O Mean: ca/(c - 1) for c > 1 Variance: ca2/[(c - I)*(c - 2)] for c > 2 Pascal distribution : negative binomial distribution Poisson distribution Discrete distribution with location parameter A.

distribution [PROB]

f(x) = AX exp [ -h]/x! O s x < 00 Range: Mean: h Variance: h

A >0

power function distribution Continuous distribution with scale parameter b, shape parameter c.

F(x) = (x/b)'

f(~) = cxC-'/bC h ( ~=) cxC-'/(bC - 2 ) Range: 0 5 x Ib b >0 Mean: b c/(c 1) Variance: b2 c/[(c 2)(c l)']

+ +

+

Rayleigh distribution Continuous distribution with scale parameter b.

F(x) = 1 - exp [-2/(2 b2)] f(X) = (x/b2) exp [-2/(2b2)1

h(x) = x/b2 Range: 0 < x < 00 Mean: b m Variance: (2 - n/2)b2

b >0

rectangular distribution : uniform distribution Student's t distribution : t distribution t distribution (: Student's t distribution) Continuous distribution with shape parameter u , the degrees of freedom.

where

r is the gamma function.

-0O<x<0O u>o Range: Mean = Mode: 0 for u > 1 Variance: u / ( u -2) for u > 2

99

100 distribution [PROB]

triangular distribution Continuous distribution with location parameters a, b and shape parameter c. F(x) =

(x - al2 (b - a)(c - a )

F(x) = 1 -

(b - x ) ~ (b - a)(b - c)

2( x- a ) f ( x ) = (b - a)(c - a ) 2(b-x)

f(x> = (b - a)(b - c)

for a ( x 5 c for c i x i b for a 5 x 5 ~ for c s x s b

Range: a 5x5b Mean: (a b c)/3 Variance: (a2 b2 c2 - a b - a c - bc)/18

+ + + +

uniform distribution (.- rectangular distribution) Continuous distribution with location parameters a and b. F(x) = (x - a)/(b - a )

h(x) = l/(b - X )

Range: a5 x 5b Mean: (a + b)/2 Variance: (b - a)2/12 Discrete distribution with parameter n.

S(x) = (n - x ) / ( n

+ 1)

h(x) = l/(n - x ) Range: 0i x5n Mean: n/2 Variance: n(n 2)/12

+

Doehlert design [EXDE]

101

variance ratio distribution : F distribution Weibull distribution Continuous distribution with scale parameter b, shape parameter c. F(x) = 1 - exp[-(x/b)'] f i x ) = (cxC-'/bC)exp [-(x/b)']

Range: 05xO c>O Mean: b r [ (c l)/c] Variance: b2(r[(c 2)/c] - {r[(c 1 ) / ~ ] ] ~ )

+ +

+

b distribution function (df) [PROB] + random variable

distribution-free estimator [ESTIM] + estimator b

distribution-free test [TEST] + hypothesis testing b

b

divergence coefficient [GEOM]

+ distance (o quantitative data) b divisive clustering [CLUS] + hierarchical clustering

b Dodge-Romig tables [QUAL] Tables of standards for acceptance sampling plans for attributes. They are either designed for lot tolerance percent defective protection or provide a specified average outgoing quality limit. In both cases there are tables for single sampling and double sampling. These tables apply only in the case of rectifying inspection, i.e. if the rejected lots are submitted to 100% inspection. b Doehlert design [EXDE] + design

102

dotplot [GRAPH]

b dot plot [GRAPH] Graphical display of the frequency distribution of a continuous variable on a onedimensional scatter plot. The data points are plotted along a straight line parallel to and above the horizontal axis. Stacking or jitter can be applied to better display overlapping points.

JITTER

STACKING

. . . . .. . .. . .. .

. .

' .

.'

.

*

b

2 4 6 8 10 12 14 16

Stacking means that data points of equal value are plotted above each other on a vertical line orthogonal to the main line of the plot. Jitter is a technique used to separate overlapping points on a scatter plot. Instead of positioning the overlapping points exactly on the same location, their coordinate values are slightly perturbed so that each point appears separately on the plot. On a one-dimensional scatter plot, rather than plotting the points on a horizontal line, they are displayed on a strip above the axis. The width of the strip is kept small compared to the range of the horizontal axis. The vertical position of a point within a strip is random. Although a scatter plot with jitter loses in accuracy, its information content is enhanced.

b

double cross-validation (dcv) [FACT] rankanalysis

b

double exponential distribution [PROB] distribution

P -

b

double sampling [QUAL] acceptance sampling

b double tail test [TEST] + hypothesis testing b draftsman's plot [GRAPH] (.- scatterplot matrix, matrix plot) Graphical representation of multivariate data. Several two-dimensional scatter plots arranged in such a way that adjacent plots share a common axis. A p-dimensional data set is represented by p @ - 1)/2 scatter plots.

Durbin- Watson’s test [TEST]

103

x2

XI

1 x4

x4 x4

‘0

x2

’

‘ *

8

I .

4

x3

In this array of pairwise scatter plots the plots of the same row have a common vertical axis, while the plots of the same column have a common horizontal axis. Corresponding data points can be connected by highlighting or coloring.

dummy variable [PREP] + variable b

b

Duncan’s test [TEST]

+ hypothesis test

b

Dunnett’s test [TEST] hypothesis test

Dunn’s partition coefficient [CLUS] --f fizzy clustering b

Dunn’s test [TEST] + hypothesis test

b

Durbin-Watson’s test [TEST] + hypothesis test b

104 Dwass-Steel’s test [TEST] b Mass-Steel’s test [TEST] + hypothesis test b

dynamic linear model [TIME]

: state-space model

b E-M algorithm [ESTIM] Iterative procedure, first introduced in the context of dealing with missing data, but also widely applicable in other maximum likelihood estimation problems. In missing data estimation it is based on the realization that if the missing data had been observed, simple sufficient statistics for the parameters would be used for straightforward maximum likelihood estimation; on the other hand, if the parameters of the model were known, then the missing data could be estimated given the observed data. The algorithm alternates between two procedures. In the E step (stands for Expectation) the current parameter values are used to estimate the missing values. In the M step (stands for Maximization) new maximum likelihood parameter estimates are obtained from the observed data and the current estimate of the missing data. This sequence of alternating steps converges to a local maximum of the likelihood function.

E-optimal design [EXDE] ----f design (: optimal design) b

b

edge [MISC]

-+ graph theory

b

Edmonston coefficient [GEOM] distance (: binarydatu)

b

Edwards and Cavalli-Sfona clustering [CLUS]

+ hierarchical clustering ( : divisive clustering) b effect [ANOVA] + term in ANOVA

b

effect variable [PREP] variable

eigenanalysis [ALGE]

105

b efficiency [ESTIM] Measure of the relative goodness of an estimator, defined as its variance. This measure provides a basis for comparing several potential estimators for the same parameter. The ratio of the variance of two estimators of the same quantity, in which the smaller variance is in the numerator, is called the relative efficiency. This is the relative efficiency of the estimator with the larger variance. It also gives the ratio of the sample sizes required for two statistics to do the same job. For example, the sample median has variance of approximately (n/2)a2/nand the sample mean has variance u 2 / n ,therefore the median has relative efficiency 2/n = 0.64. In other words the median estimates the location of a sample of 100 objects as well as the mean estimates it for a sample of 64. The efficiency of an estimator as the sample size tends to infinity is called asymptotic efficiency. b efficient estimator [ESTIM] + estimator

b eigenanalysis [ALGE] Analysis of a square matrix X[p, p] in terms of its eigenvalues and eigenvectors. The result is the solution for the system of equations:

XVj = Aj vj

j = 1,p

The vector v is called the eigenvector (or characteristic vector, or latent vector); the scalar A is called the eigenvalue (or characteristic root, or latent root). Each eigenvector defines a one-dimensional subspace that is invariant to premultiplication by X. The eigenvalues are the roots of the characteristic polynomial defined as:

P(A) = IW - A III The p roots A are also called the spectrum of the matrix X and are often arranged in a diagonal matrix A. The decomposition j

is called the eigendecomposition or spectral decomposition. Usually, one faces a symmetric eigenproblem, i.e. finding the eigenvalues and eigenvectors of a symmetric square matrix; most of the time the matrix is also positive semi-definite. This case has substantial computational advantages. The asymmetric eigenproblem is more general and more involved computationally. The following algorithms are the most common for solving an eigenproblem.

Jacobi method An old, less commonly used method for diagonalizing a symmetric matrix X. Let

S(X) denote the sum of squares of the off-diagonal elements of X, so that X is

106 eigenanalysis [ALGE]

diagonal if S(X) = 0. For any orthogonal matrix Q, QTX Q has the same eigenvalues as X. In each step the Jacobi method in each step finds a matrix Q for which

~ ( Q ~ x
v!J = ~vj-1/1vj-11

i 2 1 j =1,p

where i is the index of iteration, X(p,p) is a symmetric real matrix, v starts as an arbitrary vector (nonorthogonal to the eigenvector being computed) and converges to the jth eigenvector. Nonlinear iterative partial least squares (NIPALS) is the most popular powei method. The algorithm starts with the data matrix X(n, p), calculates the eigenvec. tors v1 and principal components tl corresponding to the largest eigenvalue in a criss-cross manner. The ith iteration step determines v1 and tl as: vf = xTt;/llt;II

t; = x v ; with the constraint Ilv;II = 1 Once the iteration converges, i.e. [It; - tf-111 < &

the first component is subtracted as

X = x - tl vT and the calculation continues with the second component. singular value decomposition (SVD) The numerically most stable matrix decomposition method. X is first transformed into an upper bidiagonal matrix, then the superdiagonal elements are systematicallq reduced by an implicit form of the Q-R decomposition. diagonalization of tridiagonal matrix Calculates the eigenvectors via Q-R decomposition of a tridiagonal matrix. The symmetric matrix X is first reduced to a tridiagonal form UT X U where U is an orthogonal Housholder matrix, then diagonalized by the Q-R algorithm.

entropy [MISC] b

107

eigendecomposition [ALGE]

+ matrix decomposition

eigenvalue [ALGE] + eigenanalysis b

b eigenvalue-one criterion [FACT] + rankanalysis

eigenvalue threshold criterion [FACT] + rank analysis b

b eigenvector [ALGE] + eigenanalysis b elementary operation [ALGE] (.- Gauss transformation) The following three operations on a square matrix: - permutation of rows/columns; - addition of any multiple of a row/column to another row/column; - multiplication of elements in a row/column by a nonzero constant. The first two operations do not alter the determinant of the matrix (except possibly its sign), while the third operation multiplies the determinant by the same constant applied to the row/column. None of these operations causes the determinant of any submatrix to change from nonzero to zero, consequently they do not change the rank of the matrix. Elementary operations can be performed by pre-multiplication with a suitable matrix obtained by performing the required operation on the identity matrix. These matrix operations are building blocks of the Gaussian elimination.

elimination [ALGE] : Gaussian elimination

b

b empirical distribution [PROB] + random variable

empirical influence curve [ESTIM] + influence curve b

b

endogenous variable [PREP]

+ variable

entropy [MISC] + information theory b

108

Epanechnikov kernel [ESTIM]

b Epanechnikov kernel [ESTIM] + kernel

b

equimax rotation [FACT]

-+ factor rotation b

equimin rotation [FACT]

+ factor rotation

equiradial design [EXDE] + design b

b

ergodic process [TIME]

+ stochastic process b error [ESTIM] A term used in several ways in statistics, e. g. model error, numerical error, error of the first kind, etc. In a strict sense error is the difference between a measured value and a true or expected value. The absolute error is the absolute deviation of an observation from the true value. b

error degrees of freedom [MODEL]

+ model degrees of freedom b

error distribution [PROB]

+ distribution

error mean square (EMS) [MODEL] + goodness of fit b

error of the first kind [TEST] + hypothesis testing b

b

error of the second kind [TEST] hypothesis testing

-+

b

error propagation [OFTIMI

+ numerical error b error rate (ER) [CLAS] + classification

b

error standard deviation [MODEL]

+ goodness of fit

error terms in factor analysis [FACT] b

109

error sum of squares [MODEL]

+ goodness of fit b

error sum of squares linkage [CLUS]

+ hierarchical clustering ( : agglomerative clustering) b

error term [MODEL]

: model error

error terms in factor analysis [FACT] The data matrix X, factored in principal component analysis, is composed of a pure data matrix (without experimental error) W* and an error matrix IE: b

X=X*+IE This real error (RE) has two components: the extracted error (XE), also called residual error, that can be removed by factor analysis, and the imbedded error (IE), that cannot be eliminated:

RE=XE+IE Although the dimensionality of X, W* and E are the same, i.e. (n, p), their ranks are different. X* has rank M (the number of factors), IE and consequently W have rank min (n,p). This rank inflation of W causes difficulty in determining the number of factors M. The principal component factor model can be written as:

The second term contains pure error that can be extracted from the model (hence the name extracted error), while the first term, the reproduced data matrix, also includes error which remains inseparably imbedded in the factor model (hence the name imbedded error). The reason why the real error cannot be totally eliminated from the data matrix X is that the pure data matrix X* is not orthogonal to IE. However, the reproduced data matrix XI containing the imbedded error is orthogonal to the extracted error XE. There exists a Pythagorean relationship between RE, XE and IE as follows:

m2=X E +~ I E ~ RE

= W - X*

XE = X - X’

IE = XI - X*

110

etror variance [MODEL]

IE

These errors can be expressed as a function of the residual standard deviation (RSD) as: I E = R S D , / ~

RE=RSD

\IPIM

XE=RSD -

with

RSD =

1

&

Am

m=M+1

Another related quantity is the root mean square error (RMSE):

RMSE =

rn=M+l

v

np

RMSE is equal to the extracted error, RMSE = XE, and is related to RSD as XE. b error variance [MODEL] + goodness of fit b estimate [ESTIM] + estimator b estimation [ESTIM] One of the most important problems in statistics. Parameter estimation calculates the value of one or more parameters as a function of the data; for example, estimation of mean, variance, regression coefficients, etc. The goal of function estimation is to approximate an underlying function from the data. Smoothers,

estimator [ESTIM]

11 1

splines, and kernels are important tools in function estimation. Examples are: potential function classifier, K",ACE, MARS, nonlinear PLS, etc. If the data set is considered as a sample from a population, then the calculated parameter value or function is taken as an estimate of the population parameter or function. An estimator is characterized by its bias, breakdown point, confidence interval, degrees of freedom, efficiency, influence curve, mean square error, scale invariance. b estimator [ESTIM] Specific rule or method by which a parameter is calculated. It is a random variable defined as a function of the sample values. A particular value of this function is called an estimate or statistic. Its reliability depends on the probability distribution of the corresponding estimator. Below is a list of various estimators:

admissible estimator Estimator for which, within the same class of estimators, no other estimator exists that performs uniformly at least as well as the estimator in question and performs better in at least one case. asymptotically efficient estimator Estimator that becomes efficient as the sample size tends to infinity. asymptotically unbiased estimator Estimator that becomes unbiased as the sample size tends to infinity. Bayesian estimator Estimator of the posterior probability that incorporates the prior probability via the Bayes' theorem. best linear unbiased estimator (BLUE) The minimum variance estimator among all linear and unbiased estimators. For example, according to the Gauss-Markov theorem, the least squares estimator is BLUE. biased estimator Estimator in which the expected estimated value is different from the true value of the parameter: B=E[8]-6#0.

Examples are: RDA, ridge regression, PCR, PLS, the estimator of the population variance as s2 =

1

n

112 estimator [ESTIM]

consistent estimator Estimator that converges in probability to the true value of the parameter being estimated as the sample size increases. density estimator Nonparametric estimator for calculating an estimate for a density function; both univariate and multivariate. Such an estimator, for example, underlies several classification and clustering methods. The most popular ones are: o adaptive kernel density estimator

Kernel density estimator in which the bandwidth of the individual kernels may vary from point to point:

where K is a kernel function, h is the bandwidth, and &is a local bandwidth factor. The smoothness of the estimate is controlled by the varying bandwidth. o histogram density estimator The most widely used density estimator based on counts per bin:

1 f ( x ) = - [number of X i in same bin as x ] nh where h is the binwidth, which controls the smoothness of the estimate. o kernel density estimator

Density estimator that calculates the overall density as the sum of individual density functions, called kernels, placed at each point. 1 nh

j(x) =-

K

('T)

where h is the bandwidth (or smoothing parameter or window width), and K is a kernel function. The smoothness of the estimate is controlled by the bandwidth parameter. The multivariate kernel density estimator for p-dimensional xi is 1 j(x) = nhp

K

(T)

o nearest neighbor density estimator

Density estimator that is inversely proportional to the size of the neighborhood needed to contain k points:

estimator [ESTIM]

113

where k is the number of nearest neighbors considered and dk(X) is the distance from the kth nearest neighbor. The smoothness of the estimator is controlled by the parameter k.

distribution-freeestimator : nonparametnc estimator efficient estimator Estimator that, compared to other similar estimators, has small variance. The efficiency of an estimator is measured by its variance. The most efficient estimator, with an efficiency of loo%, possesses the smallest variance among all the estimators. A good or efficient estimator is one which is near to the optimum value. For example, the arithmetic mean is a more efficient estimator of the location of a sample than the median. interval estimator Estimator that, in contrast to the point estimator, gives an estimate of a parameter in terms of a range of values defined by the upper and lower limits of the range. Lestimator Robust estimator based on linear combination of ordered statistics. It involves ordering the observations from smallest to largest and substituting the extreme values by values closer to the center of the scale. This means that extreme observations receive zero weight, while others might receive weights higher than one. The group of the L-estimators, for example, contains trimmed estimators and Winsorized estimators, the median as location estimator and the quartile deviation as scale estimator.

L1 estimator : least absolute value estimator L2 estimator : least squares estimator least absolute value estimator (: L l estimator) Estimator that minimizes the absolute value of a function of the parameters and observations. For example, the least absolute residuals regression estimates the regression coefficients by minimizing the sum of absolute residuals. least squares estimator (: L2 estimator) Estimator that minimizes a quadratic function of the parameters and observation. For example, the ordinary least squares regression estimates the regression coefficients by minimizing the sum of squared residuals. linear estimator Estimator that is a linear function of the observations. For example, in regression a linear estimator calculates the estimated response as a linear combination of the

114

estimator [ESTIM]

observed responses with coefficients which are independent of the response variable and include only the predictor variables: f=Wy

where W is the hat matrix. For linear regression estimators the cross-validated residuals can be calculated by the simple formula

instead of the costly leave-one-out technique. Examples are ordinary least squares regression, ridge regression, principal components regression.

M estimator Robust estimator that is a generalization of the maximum likelihood estimator. For example, the biweight is an M-estimator of location and the median absolute deviation from the median is an M-estimator of scale. In regression this estimator minimizes a function of the residuals that is different from the sum of squares: r

1

where p is a symmetric function with a unique minimum at 0. Differentiating this expression with respect to the regression coefficients yields:

where

\I,

is the derivative of p and is an influence curve.

As the solution to this equation is not scale invariant, the residuals must be standardized by some estimate of the error standard deviation. maximum likelihood estimator Estimator that maximizes the likelihood function of the sample, e.g. maximum likelihood factor analysis, maximum likelihood cluster analysis. nonlinear estimator Estimator that, in contrast to the linear estimator, is a nonlinear function of the observations. Cross-validation of nonlinear estimators cannot be calculated by simple formulas. For example, PLS is a nonlinear estimator. nonparametric estimator ( .- distribution-fieeestimator) Estimator that is based on a function of the sample observations whose corresponding random variable has a distribution that does not depend on complete specification of the underlying population distribution. These estimators do not rely

estimator [ESTIM]

115

for their validity or their utility on any assumptions about the form of the underlying distribution that produced the sample. In contrast to the parametric estimator, which is based on a knowledge of the distribution parameters, the nonparametric estimator is valid under very general assumptions about the underlying distribution. Most of them are based on simple ranking and randomization strategies, often also showing good efficiency and robustness. They are especially appropriate for small samples and for data sets with many missing values. For example, the sample mean and sample median are nonparametric estimators of the population mean and median, respectively. However, the mean illustrates that a nonparametric estimator is not necessarily a robust estimator. parametric estimator Estimator based on known distribution parameters, i.e. it assumes an underlying distribution. It is valid only if the underlying distribution satisfies the assumption. Robust estimators relax such a strict requirement. point estimator Estimator that, in contrast to the interval estimator, gives a single value as the estimate of a parameter. quadratic estimator Estimator that is a quadratic function of the observations. An example is the variance. Restimator Robust estimator, derived from rank tests, that weights the objects or residuals by their rank and by a monotone score function. resistant estimator Estimator that is insensitive to small changes in the sample (small change in all, or large changes in a few of the values). The underlying distribution does not enter at all. This estimator is particularly appropriate for exploratory data analysis. This estimator is distinct from robust estimator, however, the underlying concept is connected to ruggedness. robust estimator An extension of classical parametric estimators which takes into account that parametric models are only approximations to reality. It is not only valid under strict parametric models but also in a neighborhood of such parametric models. It has optimal or nearly optimal efficiency at the assumed model, is resistant to small deviations from the model assumptions, and does not suffer a breakdown in case of large deviations. Robustness means insensitivity to small deviations from the distributional assumptions; i.e. tolerance to outliers.

116 estimator [ESTIM] Robust estimators should not be confused with nonparametric estimators, although a few nonparametric procedures happen to be very robust. A robust estimator allows approximate fulfillment of strict assumptions, while a nonparametric estimator makes weak assumptions. The goals of robust estimators are: to describe the structure best fitting the bulk of the data; to identify and mitigate outliers and leverage points; to reveal any deviation from the assumed correlation structure. The philosophy of the robust estimator is the opposite of that of outlier detection. In the former the outliers are revealed by the residuals after fitting the model, whereas in the latter the outliers are diagnosed before the model is fitted. There are three basic types of robust estimators: L- estimator, M-estimator, and R-estimator.

sufficient estimator Estimator that contains all the information in the sample relevant to the parameter. In other words, given the estimator the sample distribution does not depend on the parameter. trimmed estimator L-estimator that omits extreme values at both ends of the ordered sample. It involves ordering the observations from smallest to largest and excluding, say, m of them at both ends. This way the first and the last m observations receive zero weights. For example, the m-trimmed mean of an ordered sample is: n-m

-

x=-

C

xi

i=m+l

n

unbiased estimator Estimator in which the expected estimated value is equal to the true value of the parameter E[&]=79

and

B=O

Examples are: ordinary least squares regression, the estimator of the population variance as

Winsorized estimator L-estimator that replaces rather than omits extreme values at both ends of the ordered sample. Winsorized estimators are obtained by first ordering the observations, then replacing the first, say, rn observations with the (m+ 1)th one, and the last m

evolvingfactor analysis (EFA) [FACT]

117

observations with the (n - m)th one. This way the first and the last rn observations receive zero weights, while the weight of the (rn 1)th and (n - m)th observations become rn + 1 times larger. For example, the rn-Winsorized mean of an ordered sample is:

+

-

X =

b

i=m+l

n

Euclidean distance [GEOM]

+ distance (: quantitative data) b Euclidean norm [ALGE] + norm (: vectornorm) b evaluation set [PREP] + data set b evolutionary operation (EVOP) [EXDE] Methodology for improving the running of production processes by small scale, on-line experimentation. In other words, it is an ongoing mode of using an operating full-scale process so that information on how to improve the process is generated from a simple experimental design while production is under way. To avoid appreciable changes in the characteristics of the product only small changes are made in the levels of the factors. b evolutionary process [TIME] + stochastic process ( : stationary process)

evolving factor analysis (EFA) [FACT] Factor analysis, or more specifically principal component analysis, applied to submatrices to obtain a model-free resolution of mixtures. The n rows of the data matrix, for example, can be spectra ordered in time. Submatrices are factored first with i = 1,L rows where L increases in time, and then with i = n, L rows where L decreases in time. The rank analysis of each submatrix is performed graphically by plotting the eigenvalues of the submatrices in forward and backward directions. The forward plot indicates the time when a component appears, while the backward plot shows when a component disappears in time. As each nonzero eigenvalue represents a pure component in the mixture, the appearance and disappearance of these eigenvalues define regions, which are called concentration windows, in the mixture matrix corresponding to the components. This time- dependent evolution of the total rank of the data matrix of mixtures that yields concentration windows for the components is the essence of evolving factor analysis. The concentration profiles and the absorption spectra of the components b

118

exact chi squared test [TEST]

................ FORWARD

0

7 d E 0 M

B

Time

can be calculated from the concentration windows on the basis of the mathematical and the chemical decomposition of the data matrix. b

exact chi squared test [TEST]

: Fisher-Yates test b

exceedance test [TEST]

+ hypothesis test exogenous variable [PREP] + variable b

b expected mean squares in ANOVA (EMS) [ANOVA] Column in the analvsis of variance table containing the expected value of the mean square of a term. If a term is a random effect term, the denominator of the F-ratio used in hypothesis testing is not necessarily the error mean square, but the mean square of another term or a linear combination of terms. The expected mean squares help to determine the appropriate denominator. For example, the expected mean squares of the terms in a balanced two-way ANOVA model with fixed effect terms are:

experimental design [EXDE] EMSB = a2+ I K CB:/(J

+

119

- 1)

X A B i / ( I - 1)(J - 1)

EMSAB = a2 K i

j

in a two-stage nestedANOK4 model with fixed effect terms are:

+

EMSB(A)= u2 K

Bj?ci)/(I(J i

- 1))

j

while the expected mean squares of a balanced two-way ANOVA model with both terms being random effect terms are:

+ + KO:, EMSB = u2 + I KO; + K a i , EMSAB = 02 + K U:, EMSA = a2 J Ka;

and in a two-stage nested ANOVA model with random effect terms the expected mean squares are: EMSA = a2+ J KO:

+ KO;

+

EMSB(A)= u2 Ku;

expected squared error (ESE) [MODEL] + goodness of fit b

F expected value [ESTIMI Mean of a probability distribution. Mean value of a random variable obtained by repeated sampling; denoted by E[x].For observed data the expectation operation is replaced by summation. F experiment [EXDE] + experimental design

experimental design [EXDE] Statistical procedure of planning an experiment, i.e. collecting appropriate data which, after analysis by statistical methods, result in valid conclusions. It includes selection of experimental units, specification of experimental conditions, i.e. specification of factors the effect of which will be studied on the outcome of the experiment, and specification of levels (values) of these factors and their combinations under which the experiment will be run, selection of response to be measured, and choice of statistical model to fit the data. b

120 experimental run [EXDE] An experiment consists of recording the values of a set of variables from a measurement process under a given set of experimental conditions. Experiments carried out simultaneously to compare the effects of factors in two or more experiments via hypothesis testing are called comparative experiments. The analysis of experiment is the part of experimental design where effects (parameters of the model) are estimated, the response surface is optimized, hypotheses are tested about the significance of the effects, confidence intervals are constructed, inferences, conclusions and decisions are made. The design of the experiment and the statistical analysis of the data are closely related since the latter depends on the former. Three basic principles of experimental design are replication, randomization and blocking. As statistical experimental design was developed for and first applied to agricultural problems, much of the terminology is derived from this agricultural background, e.g. treatment, plot, block, etc.

experimental run [EXDE] + design matrix b

b experimental unit [EXDE] + design matrix b

experimental variable [PREP]

+ variable b expert system [MISC] Computer program that relies on a body of knowledge to perform a difficult task usually undertaken only by human experts, i.e. to emulate the problem-solving process of human experts. Many expert systems are defined in the framework of artificial intelligence techniques and operate by using rule-based deduction within a particular specified domain. An expert system usually consists of three modules: a knowledge base, a control structure, and a user- oriented interface. The knowledge base is composed of examples, i.e. classes of objects and relationships between them, and rules, i.e. propositions from which to infer new classes and relationships. The control structure, also called the inference engine, is the part of the program devoted to searching for a solution to a problem. Finally, the user-oriented interface allows the user to know how and why a particular solution is proposed. Examples of expert systems used with chemical data are: ASSISTANT, and EX-TRAN. b explained variance [MODEL] + goodness of fit b

-+

explanatory variable [PREP] variable

F-ratioin ANOVA [ANOVA]

121

b exploratory data analysis (EDA) [DESC] + data analysis

exploratory factor analysis [FACT] + factor analysis b

b exponential distribution [PROB] + distribution b exponential regression model [REGR] + regression model

external failure cost [QUAL] + qualitycost b

b externally Studentized residual [REGR] + residual b

extracted error (XE) [FACT] error terms in factor analysis

----+

b F distribution [PROB] + distribution b F-ratio in ANOVA [ANOVA] Column in the analvsis of variance table containing a ratio of mean squares. The F-ratio corresponding to a term in ANOVA is used to test the hypothesis that the effect of the term is zero. In case of a fixed effect term the null hypothesis is

Ho:

Ai = O

while in case of a random effect term the null hypothesis is

The numerator of the F-ratio is always the mean square of the term being tested, the denominator is often the mean square of another term. In fixed effect models the denominator is always the error term. In random effect models or in mixed effect

122 F test [TEST]

models one must find the appropriate term or linear combination of terms for the denominator based on the expected mean squares. The term for the denominator of the F-ratio is chosen such that its expected mean square differs from the expected mean square in the numerator only by the variance component or the fixed effect tested. For example, in a fixed two-wayANOM model the F-ratios corresponding to the two main effect terms and the interaction term are: FA = M s ~ / h ! f s ~

FB =MSB/MSE FAB = MSAB/MSE

while in a random two-way ANOVA model the corresponding F-ratios are:

FA =MSA/MSAB FB = M s g / M S ~ f j FAB

= MSAB /MSE

The F-ratios for the terms of a two-stage nested ANOM model are: F A =MSA/MSB(A)

FB = MSB(A)/MSE F test [TEST] + hypothesis test b

b factor [EXDE] Independent variable in experimental design corresponding to an experimental condition that has an effect on the outcome of the experiment. It constitutes a column in the design matrix. In a regression context factors correspond to predictors. A value that a factor can take on is called a level. A combination of factor levels is called a treatment or cell. The phenomenon where the effect of a factor on the response depends on the level of other factors is called interaction. In case of interaction the factors are not additive, i.e. the estimated effects do not add up to the total variance, interaction terms must be inserted in the model fitted to the design. The interaction terms or simply interactions, similar to factors, are columns in the design matrix. They are generated by multiplying the columns corresponding to the factors in the interaction. A factor is called a fixed factor if it takes on only a fixed number of levels. An interaction that contains only fixed factors is also fixed. The effect of such a factor or interaction is called a fixed effect; the corresponding variable in the design matrix is not a random variable. No inference can be drawn in the case of fixed factors, i.e. the conclusions of the statistical analysis cannot be extended beyond the

factor analysis [FACT]

123

factor levels under study. An experiment in which all factors are fixed is called a controlled experiment. A factor is called a random factor if it has a large number of possible levels and only a sample of levels is selected randomly from the population. An interaction that contains one or more random factors is also considered to be random. The effect of such a factor or interaction is considered a random effect; the corresponding variable in the design matrix is a random variable. In this case, when the factor levels under study are chosen randomly, inferences can be drawn about the entire factor level population. factor [FACT] : common factor

b

b factor analysis [FACT] (: abstract factor analysis) Multivariate statistical method originally developed for the explanation of psychological theories. The goal of factor analysis is to express a set of variables linearly in terms of (usually) a small number of common factors, i.e. obtaining a parsimonious description of the observed or measured data. These common factors are assumed to represent underlying phenomena that are not directly measurable. The two types of factor analysis are the exploratory factor analysis and the confirmatory factor analysis. In the former case one has no a priori knowledge of the number or the composition of the common factors. In the latter case the goal is to confirm the existence of suspected underlying factors. The factor model is:

M

xj =

C 1jmfm + uj

j =1 , ~

m=l

where each of the p observed variables Xj is described as a linear combination of M common factors fm with coefficients l j m and a uniaue factor Uj . The coefficients lj are called factor loadings. The common factors account for the correlation among the variables, while the unique factor covers the remaining variance, called the specific variance (or unique variance). The unique factor is also called residual error. They are uncorrelated among themselves and with the common factors. The number of factors M determines the complexity of the factor model. There are two steps in factor analysis: the factor extraction and the factor rotation. In the first step the underlying, non- observable, non-measurable latent variables are identified, while in the second step the extracted factors are rotated to obtain more meaningful, interpretable factors. The factor extraction involves estimating the number of common factors by rank analvsis and calculating their coefficients, the factor loadings. The most popular factor extraction method is the principal component analysis, where the unique factors are all assumed to be zero, and the analysis is based on the maximum variance criterion. Other factor extraction methods include principal factor analysis, maximum likelihood factor analvsis, and

124 factor extraction [FACT]

minres. Evolving factor analvsis, corresDondence factor analvsis and kev set factor analysis are special methods, which are often applied to chemical data. factor extraction [FACT] + factor analysis b

b factor loading [FACT] (.- loading) Coefficients of the common factors in the factor analysis model. The factor loading 1jrn is the correlation between the jth variable and the mth common factor. In principal component analysis the factor loadings are the elements of the eigenvectors of the covariance or correlation matrix multiplied with the square root of the corresponding eigenvalue. The contribution of the mth common factor to the total variance of the variables is

This quantity is the eigenvalue in principal component analysis. The variance of a variable described by the common factors is the squared communalitv: rn

The matrix L(p, M ) of the factor loadings, in which each row corresponds to a variable and each column to a common factor, is called the factor pattern. The reproduced correlation matrix can be calculated from the factor loadings as

k=LILT The difference between the observed correlation matrix Rand the reproduced correlation matrix k is called the residual correlation matrix. b factor model [FACT] + factor analysis b

factor pattern [FACT]

+ factor loading b factor rotation [FACT] (.- factor transformation, rotation) The second step in factor analysis, that rotates the M common factors (or components) calculated in the factor extraction step into more interpretable factors. The various factor rotations are guided by one of the following objectives: - Simple structure: the number of variables correlated with (loaded on) a factor is small, each variable is a combination of only few (preferably one) factors, and each factor accounts for about the same amount of variance.

factor rotation [FACT]

125

- Factorial invariance: a factor is identified by a marker variable, such that, together with the marker variable, the same cluster of variables loads into the factor. This facilitates the comparison of factors extracted from somewhat different sets of variables. - Hypothetical structure: testing the existence of hypothetical factors by matching them with the extracted factors. - Partialling: dividing the total variance into partial variances due to unique factors. The influence of a variable can be separated by totally correlating it with (rotating it into) a factor. - Causal analysis: trying various linear combinations of rotated factors to best predict a response (e.g. PLS). The various factor rotation techniques can be divided into three major groups: graphical rotation (or subjective rotation), analytical rotation (or objective rotation), and target rotation. With the availability of powerful computers, the first group is seldom used any longer. The most common rotation techniques belong to the second group, which can be further divided into orthogonal rotations and oblique rotations (nonorthogonal rotations). FACTOR ROTATION

J GraDhical Rotation

I

J. Analytical Rotation

J orthogonal rotation biquartimax equimax orthomax quartimax varimax

Tweet Rotation

I obliaue rotation binormamin biquartimin covarimin equimin maxplane oblimax oblimin promax quartimin

The differences between orthogonal and oblique rotations are: - Orthogonal rotation results in uncorrelated and linearly independent factor scores, oblique rotation does not. - Orthogonal rotation gives orthogonal (but not necessarily uncorrelated) factor loadings, oblique rotation does not. - Orthogonal rotation moves the whole factor structure in a rigid frame around the origin, oblique techniques rotate each factor separately. - Communalities are unchanged in orthogonal rotation, not in oblique rotation. Below is a list of the most popular analytical rotations. The loadings are normalized with the corresponding communality:

S,

= l,,/h,

j = 1,p m = 1, M

126 factor rotation [FACT]

binormamin rotation Oblique rotation, based on a modification of the biquartimin criterion:

It minimizes the ratio of covariances of squared loadings to their inner products.

biquartimax rotation Orthogonal rotation, a subcase of orthomax rotation with y = 0.5. 0

biquartimin rotation Oblique rotation, a subcase of oblimin rotation with y = 0.5. covarimin rotation Oblique rotation corresponding to varimax rotation. It minimizes the covariances of the squared factor loadings scaled by the communalities: r

1

equimax rotation Orthogonal rotation, a subcase of orthomax rotation with y = M / 2 . equimin rotation Oblique rotation, a subcase of oblimin rotation with y = M/2. maxplane rotation Oblique rotation that increases the high and near-zero loadings in a factor based on maximizing its hyperplane count. The hyperplane of a factor is the plane viewed edgewise, i.e. the plane formed by the factor with M - 1 other factors. The count is the number of objects lying on this plane. Maxplane considers one pair of factors at a time and rotates them until the optimum position is found in terms of maximum counts. This technique is optimal for large samples and yields a solution closest to the graphical rotation. oblimax rotation Oblique rotation that maximizes a criterion similar to the one in quartimax rotation, called the kurtosis function: max[K] = r

c ca h j

m

factor rotation [FACT]

127

oblimin rotation Oblique rotation, similar to orthomax rotation, that comprises several techniques as subcases. It combines the criteria of quartimin and covarimin rotations:

If y = 1 it is the covarimin rotation; if y = 0 it is the quartimin rotation; if y = 0.5 it is the biquartimin rotation, and if y = M / 2 it is the equimin rotation.

orthoblique rotation Oblique rotation performed through a sequence of orthogonal rotations. This composite rotation is made up of orthogonal transformation matrices and diagonal matrices. orthomax rotation Orthogonal rotation that comprises several popular rotation methods as subcases. It combines the criteria of varimax and quartimax rotations:

If y = 0 it is the quartimax rotation; if y = 1 it is the varimax rotation; if y = 0.5 it is the biquartimax rotation, and if y = M / 2 it is the equimax rotation.

promax rotation Oblique rotation that starts with a varimax rotation and then relaxes the orthogonality requirement to achieve a simpler structure. The high loadings of varimax are made even higher and the low ones even lower by normalizing the orthogonal loading matrix by rows and columns and taking the kth power of each loading (k = 4 is recommended). The final step is to find the least squares fit to the ideal matrix using the Procrustes technique. quartimax rotation Orthogonal rotation that maximizes the variance of squared factor loadings of a variable, i.e. it tries to simplify the rows (corresponding to the variables) of the factor pattern:

128 factor rotation [FACT] The goal is to express each variable as a linear combination of not all, but ofonly a (different) subset of the common factors. This rotation preferentially increases the large factor loadings and decreases the small ones for each variable. A shortcoming of this rotation is that it tends to produce one large general factor.

quartimin rotation Oblique rotation corresponding to quartimax rotation. It minimizes the crossproducts of the squared factor loadings: j

Icm

varimax rotation Orthogonal rotation that maximizes the variance of squared factor loadings in a common factor, i.e. it tries to simplify the columns (corresponding to the factors) of the factor pattern: \2

The goal is to obtain common factors that are composed of only few variables. This rotation further increases the large factor loadings and large eigenvalues and further decreases the small ones in each factor. Equal weighting of the variables is achieved by scaling the loadings with the communalities. The goal of target rotation (also called target transformation factor analysis (TTFA)) is to match two factor models; either two calculated solutions or a calculated solution with a hypothetical one. The following types of matching are of practical interest: Relating two different factor solutions calculated from the same correlation matrix, i.e. finding a transformation matrix M that brings factor loading matrix IL into loading matrix IL*

ILM=IL* if M = p then M = IL-'IL* then M = (ILTIL)-'ILTIL* if M # p Relating two different factor solutions calculated from different correlation matrices. When the variables are fixed but the objects differ in the two data sets the coefficient of congruence measuring the degree of factorial similarity between factor m from the first set and factor k from the second set is calculated as a correlation of factor loadings: lj m lj k

feedforward network [MISC]

129

Similarly, when objects are fixed but variables are different the coefficient of congruence is calculated from the factor scores as: fim fik

- Fitting a calculated factor solution IL to a hypothetical one L*, called Procrustes transformation. The orthogonal Procrustes transformation gives a least squares fit between IL M and IL* by carrying out a series of planar rotations, arranged in a systematic order so that each factor axis is rotated with every other axis only once in a cycle and these cycles are repeated until convergence is reached. The least squares criterion can be written as:

where 9 is the rotation angle and j indexes variables that have specified weights in factor rn in the target matrix IL*. The oblique Procrustes transformation allows for correlations among rotated factors. b

factor score [FACT]

+ common factor b factor score coefficient [FACT] + common factor

factor transformation [FACT] : factor rotation

b

b factorial design (FD)[EXDE] + design b failure function [PROB] + random variable b

feature [PREP]

: variable b feature reduction [MULT] + data reduction

feedforward network [MISC] + neural network b

130 b

fi coefficient [GEOM]

fi coefficient [GEOM] distance (0 binatydata)

-+

filtering [TIME] + timeseries b

first kind model [ANOVA] + term in ANOVA b

b

first-order design [EXDE] design

--+

b first-order regression model [REGR] + regression model

b

fishbone diagram [GRAPH]

: cause-effect diagram

Fisher‘s discriminant analysis [CLAS] + discriminant analysis b

b

Fisher’s distribution [PROB] distribution

-+

Fisher-Yates’s test [TEST] + hypothesis test b

b five number summary [DESC] Five statistics used to summarize a set of observations. They are: the two extremes (minimum and maximum), the median and the two hinges. A hinge is a data point half way between an extreme and the median on ranked data. These five statistics are often displayed in a box plot. b

fixed effect [ANOVA] term in ANOVA

-+

b fixed effect model [ANOVA] + term in ANOVA b

fixed factor [EXDE]

+ factor b

fixed percentage of explained variance [FACT]

+ rank analysis

frequency [DESC]

131

b Fletcher-Reeves formula [OPTIM] + conjugate gradient optimization b folded design [EXDE] + design b

fold-over design [EXDE]

+ design

forest [MISC] + graph theoly b

b Forgy clustering [CLUS] + nonhierarchical clustering

(0 optimization clustering)

forward elimination [ALGE] + Gaussian elimination b

b

--+

forward selection [REGR] variable subset selection

b four-point condition [GEOM] + distance b

fractile [DESC]

: quantile b fraction [EXDE] A subset of all possible treatments in a complete factorial design, i.e. treatments in a fractional factorial design. The size of a fraction is calculated as 1/2’ , where J denotes the number of factors in the design generated by confounding. For example, the 24-1 design (J = 1) contains a half fraction, the 25-2 design (J = 2) contains a quarter fraction. The fraction in which all design generators have positive signs is called the principal fraction.

fractional factorial design (FFD) [EXDE] + design b

frequency [DESC] The number of observations that take on the same distinct value or fall into the same interval. If the frequency is expressed as the percentage of all observations, it is called relative frequency. The sum of frequencies up to a certain distinct value or interval, i.e. the number of observations taking on values less or equal to a b

132

frequency count scale [PREP]

certain value, is called the cumulative frequency. The function of frequencies vs. the ordered distinct values or intervals is called the frequency distribution. It can be displayed numerically in a frequency table, or graphically in a bar plot, frequency polygon or stem-and-leaf diagram. Frequency distributions can be univariate or multivariate. The most frequently analyzed frequency tables are bivariate tables, called contingency tables, where the rows correspond to the distinct values of one qualitative variable and the columns to that of the other variable. The entry in the 0th cell is denotedfij :

3 5

6 3

8 4

3 2

7

2

3

4

5 7 2

2 4 5

fl.= 27 f2.= 25 f3.

= 23

t c. marginals

The row-wise and column-wise sum of frequency entries is called the marginal frequency, denoted fi. and f j . The ratio of a frequency and the corresponding marginal frequency is called the conditional frequency. The most important question in the analysis of a contingency table is whether the qualitative variables are independent or not. It can be answered by calculating the estimated frequencies, which in cell i j is

,. = J.Lj fij f.

In a cell of a contingency table the difference between the actual and estimated frequencies is called the contingency and calculated as: n

C” 1J

=.f, 1J -.f, U

The squared contingency follows a chi squared distribution and is used to test the independence of the two variables. Contingency is the basis of corremondence factor analvsis. Although two-dimensional contingency tables are the more common ones, multidimensional tables arising from three or more qualitative variables can also be analyzed.

frequency count scale [PREP] + scale ( 0 ratio scale) b

b

frequency distribution [DESC] frequency

fuzzy clustering [CLUS] b

133

frequency function [PROB]

+ random variable

frequency polygon [GRAPH] Graphical display of the frequency distribution of a categorical variable; similar to the bar plot. The frequencies of values are plotted against the values of the categorical variable. The points are connected by constructing a polygon. b

frequency table [DESC] + frequency b

b Friedman-Rubin clustering [CLUS] + nonhierarchical clustering (0 optimization clustering) b Friedman’s test [TEST] + hypothesis test

Frobenius norm [ALGE] + norm ( 0 matrirnom) b

b full factorial design [EXDE] + design

function estimation [ESTIM] + estimation b

b

furthest neighbor linkage [CLUS] (0 agglomerative clustering)

+ hierarchical clustering

b fuzzy clustering [CLUS] Clustering, based on fuzw set theory, resulting in a partition in which the assignment of objects to clusters is not exclusive. A membership function, ranging from 0 to 1, is associated with each object. It indicates how strongly an object belongs to the various clusters. Fuzzy clustering techniques search for the minimum of functions like:

P

or

134 fuzzy set theoy [MISC] where & is the distance between objects s and t, m,, and mtg are the unknown membership functions of objects s and tin cluster g; k is an empirically determined constant. Dunn's partition coefficient measures the degree of fuzziness in a partition as:

i=l,n

Fg=rxmfg/n i

g=l,G

g

In case of a completely fuzzy partition (all mig = 1/G) Dunn's partition coefficient takes on its minimal value 1/G, whereas a perfect partition (all mig = 0 or 1) results in its maximal value one. The normalized version of Dunn's partition coefficient is

b fuzzy set theory [MISC] Extension of conventional set theory. In both cases the association of elements with a set is described by the membership function. While in conventional set theory the membership function takes only on two values (1 and 0, indicating an element belonging or not belonging to a subset), in fuzzy set theory the membership function can take on any value between 0 and 1. Commonly used membership functions have exponential, quadratic or linear forms:

2

m(x) = exp [-lx-al m(x)=[l-Ix-al

2

/b 2 ]

/ b2 ]+

m(x) = [l - Ix - a l/b]+ where a and b are constants and the + sign indicates truncation of negative values to zero. Membership functions can also be defined for more than one variable. The elementary operations on fuzzy sets are:

- intersection A n B: mAnB ( x ) = min [mA (x), m~(x)] - union A u B: mAUB(x) = [mA(x), mB(x)] - complement A: m,(x) = 1- mA(x) The cardinality of a set in common, finite sets is the number of elements in the set. In fuzzy sets the cardinality of a set is defined as: card [A] =

mA(x)

or

card [A] =

mA(x) dx

Fuzzy arithmetic is an extension of conventional arithmetic. For example, a simple addition A B = C becomes:

+

), mdz) = sup (min [ ~ A ( x maW1) z=x-y

Gauss transfomtion [ALGE]

135

where mc, mA and mB are membership functions of numbers C, A, B and sup stands for supremum. Fuzzy set theory is the basis of fuzzy data analysis. In statistical data analysis the outcome of an observation is vague prior to, but is determined after a measurement takes place. In fuzzy data analysis a measurement outcome is always vague. Fuzzy set theory has found a role in both classification and cluster analysis. In fuzzy clustering or fuzzy classification the assignment of objects to clusters or classes is not exclusive. Instead of a single cluster or class id, each object is characterized by a membership function that takes values between 0 and 1 over the clusters or classes. Fuzzy modeling and fuzzy logic are further applications of fuzzy set theory.

G G-fold cross-validation [MODEL] + model validation (0 cross-validation) b

GM estimator [REGR] + robust regression b

b

Gabriel’s test [TEST]

+ hypothesis test b game theory [MISC] Branch of mathematics dealing with the theory of contest among several players. The events of the game are probabilistic. A strategy is a course of action which depends on the state of the game, including the previous action of an opponent. One of the most well known strategies is the minimax strategy, i.e. trying to minimize the maximum risk. b gamma distribution [PROB] + distribution b Gart’s test [TEST] -+ hypothesis test

b

Gauss transformation [ALGE]

: elementary operation

136

b

Gaussian distribution [PROB]

Gaussian distribution [PROB] distribution

Gaussian elimination [ALGE] (: elimination) Method that uses elementary operations to eliminate (zero) selected elements in a matrix. For example, it is the method of choice for solving a system of linear equations Xb = y when X is square, dense and unstructured. Through a series of elementary operations X is transformed into a triangular matrix, for which the solution can be obtained with ease either by forward elimination (when X is lower triangular) or by back-substitution (when X is upper triangular). The Gaussian elimination mimics the process of eliminating unknowns from the system; in each step k some elements of W are set to zero. These steps can also be described in terms of matrix operations as b

where Mk is a matrix performing an elementary operation on Xk. The element under which all elements are eliminated in the kth column is called the pivot. If the pivot is not zero, then there is an elementary lower triangular matrix Mk that annihilates the last n-k elements of the kth column of Wk. Pivoting on the diagonal element is called sweeping. The procedure is unstable when a pivot element is very small. This problem can be alleviated by two strategies: complete pivoting or partial pivoting. The former method permutes both rows and columns of Xk. The kth step is:

where Pk is the row permutation matrix and Qk is the column permutation matrix. The cheaper partial pivoting method allows only row permutations. The GaussJordan method, a natural extension of the Gaussian elimination, annihilates all the off-diagonal elements in a column at each step, yielding a diagonal matrix. Gaussian kernel [ESTIM] + kernel b

b

Gauss-Jordan method [ALGE] Gaussian elimination

b

Gauss-Markov theorem [ESTIM]

An important theorem stating that the least squares estimator is the best, i.e. has

the minimum variance, among all linear unbiased estimators of a parameter. Such an estimator is called the best linear unbiased estimator (BLUE). For example, the least squares estimator is the BLUE for the coefficients of a linear regression, if the model is of full rank and the error is independent, identically distributed, and with zero mean. Notice that the error is not required to be normally distributed.

generalized distance [GEOM]

137

b Gauss-Newton optimization [OPTIM] Gradient outimization specifically used for minimizing a quadratic nonlinear function, e.g. the least squares estimator of the parameters in a nonlinear regression model. The special form of the objective function

F (PI = 0.5 fT(P)f(P) offers computational simplicity as compared to a general optimization method. The gradient vector of the quadratic function F (p) has the form

g(p) = JT(p)f(p) where J is the Jacobian matrix. The second derivative matrix of F (p) has the form:

G(P) = J~(P)J(P)+ f T ( p ) ~ ( p ) where H(p) is the Hessian matrix of f(p). Using these results the basic Newton step in the ith iteration has the form:

pi+] = pi

- Gr'J'fi

The Gauss-Newton optimization approximates the function f k(P) by a Taylor expansion using only the first two terms. This results in the approximation of the second derivative matrix by its first term: G(P) = J~(P)J(P) A step in the ith iteration is calculated as

Each iteration i requires the solution of the linear equation

Jidi = fi When J(p) is rank deficient it is difficult to solve the above equation. The Levenberg-Marquardt optimization method transforms J(p) to a better conditioned, full-rank matrix. F gene [MISC] + genetic algorithm

b generalized cross-validation (gcv) [MODEL] + goodness of prediction b

generalized distance [GEOM]

+ distance

(0

quantitative data)

138 generalized inverse matrix [ALGE] b generalized inverse matrix [ALGE] + matrix operation (o inverse of a matrix)

generalized least squares regression (GLS) [REGR] Modification of the ordinary least squares regression to deal with heteroscedasticity and correlated errors. It is used when the off-diagonal terms of the error covariance matrix 8, are nonzero and the diagonal elements vary from observation to observation. The regression coefficients are estimated as: b

This is an unbiased estimator and has minimum variance among all unbiased estimators assuming normal errors. Weighted least squares regression (WLS) is a special case of the generalized least squares regression when the error covariance matrix is diagonal.

generalized linear model (GLM) [REGR] + regression model b

b

generalized simulated annealing (GSA) [OPTIM] simulated annealing

4

b

-+

generalized standard addition method (GSAM) [REGR] standard addition method

b generalized variance [DESC] + multivariate dispersion b

generating optimal linear PLS estimation (GOLPE) [REGR] partial least squares regression

-+

b genetic algorithm (GA) [MISC] Search procedure for numerical optimization, sequencing and subset selection on the basis of mimicking the evolution process. This is a new and rapidly developing field that has great potential in data analysis applications. It starts with a population of artificial creatures, i.e. solutions (e.g. parameter values, selected variables) and, through genetic operations (crossover and mutation), the evolution results in an optimum population according to some objective function. The size of the initial population is specified (often between 50 and 500), its member solutions are selected randomly. In the original proposal each member solution in a population is coded in a bit string, called a chromosome. One bit in the chromosome is called a gene. There have been recent developments in real number genes. While genes are natural representations of binary variables, discrete and continuous variables must be coded. For example, if the problem is to select

geometrical concepts [GEOM]

139

optimal variables from among 100 potential variables, a solution chromosome has 100 bits, where those corresponding to the selected variables are set to 1 and the others to 0. In the problem of optimizing three continuous parameters where the first one can be coded into six, the second one into nine and the third one into five bits, the solution chromosomes contain 20 genes. The fitness of the members of a population is evaluated by calculating an objective function. Such a function, for example, could be the R2 value of a regression model with the selected variables, or the value of the response surface at the selected parameter combination. A termination criterion is selected a priori, to decide when a population is good enough to stop the evolution. In order to obtain the next generation (population), members from the present population are selected for mating according to their fitness. A mating list of the same size as the present population is composed. In selecting the next entry on the mating list the probability of a member being selected is proportional to its fitness and independent of whether it is already on the list or not. Members with high fitness are likely to appear more than once on the mating list, while members with low fitness will probably be omitted from the mating list altogether. Mating, i.e. recombining members on the mating list, is a random process. Given the list of mates, pairs are chosen randomly and, in another random step, it is decided whether they mate or not. l b o parent members produce two offspring. The most popular mating strategies are: single crossover, two-point crossover, uniform crossover, circular crossover. To better explore the solution space, for example to avoid local optima, mutation also takes place in generating the new population. This is a low probability event that causes a small number of genes to flip to the opposite value. In contrast to traditional methods, GA works well on large size problems, in the case of a noncontinuous response surface, and in the presence of local optima. GA is capable of rapidly approaching the vicinity of the global optimum, but its final convergence is slow. It is often best to use GA to obtain an initial guess for a traditional optimization method. GA produces a whole population of solutions, each characterized by their fitness, thereby offering substitute solutions when the optimum one is for some reason unacceptable. b

geometric distribution [PROB]

+ distribution b

geometric mean [DESC]

+ location b geometrical concepts [GEOM] Multivariate objects described by p variables can be represented as points in the p-dimensional hvuersuace, where each coordinate corresponds to a variable. Such

140

Gini index [MISC]

variables can be the original measurements, linear (e.g. PC)or nonlinear (e.g. NLM, MDS, PP) combinations of the original measurements. The number of coordinates, i.e. the number of variables is called the dimensionality.The relative spatial position of the objects is called the configuration. Dimensionality reduction techniques try to display the high-dimensional configuration in two dimensions with as little distortion as possible. Distance measures and similaritv indices provide pairwise quantification of the objects’ relative positions.

Gini index [MISC] + information theoly b

b

Givens transformation [ALGE]

+ orthogonal mat& transformation glyph [GRAPH] + graphical symbol b

Gompertz growth model [REGR] + regression model b

b goodness of fit (GOF) [MODEL] Statistic which measures how well a model fits the data in the training set, e.g. how well a regression model accounts for the variance of the response variable. A list of such statistics follows.

adjusted R2 R2 adjusted for degrees of freedom, so that it can be used for comparing models with different degrees of freedom:

where dfE and dfT are the error and total degrees of freedom, respectively; RSS is the residual sum of squares and TSS is the total sum of squares.

alienation coefficient Square root of the coefficient of nondetermination used mainly in psychology:

ac = JiZF coefficient of determination

.- R2

coefficient of nondetermination Complementary quantity of the coefficient of determination: end = 1 - R2

goodness offit(GOF) [MODEL]

141

error mean square : residual mean square error standard deviation : residual standard deviation error sum of squares : residual sum of squares error variance : residual mean square expected squared error (ESE) : residual mean square explained variance : model sum of squares Jp statistic Function of the residual mean square s2:

J, = (n + P b 2 n where n is the total number of observations, p is the number of parameters in the model.

mean square error (MSE) : residual mean square model sum of squares (MSS) (: explained variance) The sum of squared differences between the estimated responses and the average response:

This is the part of the variance explained by the regression model as opposed to the residual sum of squares.

multiple correlation coefficient A measure of linear association between the observed response variable and the predicted response variable, or equivalently between the observed response variable and the linear combination of the predictor variables in a linear regression model. This quantity squared, R2, is the most widely used goodness of fit criterion.

R2 (: coefficient of determination) Squared multiple correlation coefficient that is the percent variance of the response explained by a model. It can be calculated from the model sum of squares MSS or from the residual sum of squares RSS:

TSS is the total sum of squares around the mean. A value of 1 indicates perfect fit, a model with zero error term.

142 goodness ofjit (GOF) [MODEL]

real error (RE) : residual standard deviation residual mean square (RMS) (: expected squared error, mean square error, error mean square) Estimate of the error variance u2 RSS s =dfE

where dfE is the error degrees of freedom.

residual standard deviation (RSD) (: error standard deviation, real error, residual standard error, root mean square error) Estimate of the model error u

where dfE is the error degrees of freedom.

residual standard error (RSE) : residual standard deviation residual sum of squares (RSS) (: error sum of squares) Sum of squared differences between the observed and estimated response:

The least squares estimator minimizes this quantity.

residual variance : mean square error root mean square error (RMSE) : residual standard deviation S, statistic Function of the residual mean square:

where dfE is the error degrees of freedom.

standard deviation of error of calculation (SDEC) Function of the residual sum of squares:

goodness of prediction (COP)[MODEL]

143

standard error : residual standard deviation b goodness of prediction (GOP) [MODEL] Statistic to measure how well a model can be used to estimate future (test) data, e.g. how well a regression model estimates the response variable given a set of values for the predictor variables. These quantities are also used as model selection criteria. A list of such statistics follows.

Akaike’s information criterion (AIC) Model selection criterion for choosing between models with different parameters. It is defined as AIC, = -2L,

+ 2p

where p is the number of parameters and L, is the maximized log-likelihood. In a regression context the optimal complexity of the model is chosen by minimizing the following min(41C) = n + P + l s 2 P

n-p-1

cross-validated R2 Statistic similar to R2, in which the residual sum of squares RSS is substituted by the predictive residual sum of squares PRESS:

PRESS TSS

Rtu = 1 - -

generalized cross-validation (gcv) The ratio of residual sum of squares to the squared residual degrees of freedom: RSS gcv =
where RSS, is the residual sum of squares of the biased regression model with p parameters, and s2 is the residual mean square of the full least squares model. In OLS C, = p; in biased regression C, < p. If the model with p parameters is adequate, then E[RSS] = (n - p ) a 2 Assuming that E(s2) = cr2 is approximately true, then E(C,) = p. Consequently a plot of C, versus p will show the best models as points fairly close to the C, = p line.

144

Gower coefficient [GEOM]

predictive residual sum of squares (PRESS) The sum of the squared cross-validated residuals, i.e. the sum of squared differences between the observed response yi and the estimated response fi\i obtained from a regression model calculated without the ith observation: 1

i

This quantity is calculated in cross-validation. If the estimator is linear, the crossvalidated residual can be calculated from the ordinary residual, using the diagonal of the hat matrix hii, as:

predictive squared error (PSE) The average of PRESS:

n

n

standard deviation of error of prediction (SDEP) A function of the predictive residual sum of squares:

F

Gower coefficient [GEOM]

+ distance (o mixed type data) F

Gower linkage [CLUS]

+ hierarchical clustering (o agglomerative clustering) b gradient optimization [OPTIM] Optimization which, in calculating the set of parameters that minimizes a function, requires the evaluation of the derivatives of the function as well as the function values themselves. The basic iteration in gradient optimization is:

Pi+l = Pi - si hi gi where Si is the step size and Aigi defines the step direction, with gi being the gradient vector (first derivative) off (pi), Ai being a positive definite matrix and i the iteration index. Gradient optimization methods differ from each other in the way hi and S i are chosen. Most of these methods are applicable only to quadratic functions. Methods belonging to this group of optimization techniques are: coniugate gradient optimization, Gauss-Newton optimization, Newton-Rauhson optimization, steepest descent optimization, and variable metric optimization.

graph theory [ M I X ]

145

b gradient vector [OPTIM] Vector of first derivatives of a scalar valued function f (p) with respect to its parameters p = p 1, . . . , pp:

At a minimum, maximum or saddle point of a function the gradient vector is zero. Moving along the gradient direction assures that the function value increases at the fastest rate. Similarly, moving along the negative gradient direction causes the fastest decrease in the function value. This property is used in gradient outimization.

b

Graeco-Latin square design [EXDE] design

b Gram-Schmidt transformation [ALGE] + orthogonal matrix transfomation

b

graph [MISC] graph theoty

b graph theory [MISC] Branch of mathematics dealing with graphs. A graph is a mathematical object defined as 0 = (V, E ) , where V is a set of nodes (vertices) and E is a set of edges (lines), representing the binary relationship between pairs of vertices. The degree of a vertex v is the number of edges connected to it. A walk in 8 is a sequence of vertices w = ( ~ 1 , .. , , Vk) such that (v,, v,+l) E E f o r j = 1, k - 1. A path is a walk without any repeated vertices. The length of a path ( V I , . . . , Vk) is k - 1. A cycle is a path with no repeated vertices other than its first and last ones (v1 = vk).

Graph

Multigraph

Digraph Q

V,

nee

Forest

Network

146

graph theory [MISC]

Several multivariate methods are based on graph theory to solve problems such as finding the shortest paths between objects, partitioning into isomorphic subgraphs, hierarchical clustering methods, classification tree methods, compression of bit images, and definition of topological indices (topological variables) in chemical problems dealing with molecules or molecular fragments. A list of graphs of practical interest follows.

complete graph Graph in which all vertices are adjacent.

a connected graph Graph in which each pair of vertices is connected by an edge. a digraph (: directedgraph) Graph Q = (V, A) with directions assigned to its edges, where A is the set of ordered

pairs of vertices called arcs, i.e. A C E. The degree of a vertex has two components: indegree, i.e. the number of arcs directed toward the vertex, and outdegree, i.e. the number of arcs departing from the vertex. A vertex with indegree equal to zero and outdegree equal to or more than one is called a root (source); a vertex with indegree equal to or more than one and outdegree equal to zero is called a leaf or terminal node. A graph containing edges without direction is called an undirected graph.

directed graph : digraph

n forest

A set of disjoint trees 3 = [ ( V I ,El), . . . , (Vk, Ek)].

minimal spanning tree (MST) Spanning tree in which the sum of the edges is minimal. multigraph Graph with repeated edges, i.e. with multiple links between at least one pair of vertices. For example, a representation of bidimensional molecular structures.

a network Weighted digraph N = (V, A , s, t , w) including a source s E V, a terminal r E V, and weights w assigned to the edges.

spanning tree Undirected connected graph without cycles. tree Connected graph without cycles.

graphical symbol [GRAPH]

147

weighted graph Graph Q = (V, E ) in which a weight w,,2 0 is assigned to each edge (vi,vj) E E.

b

graph theoretical clustering [CLUS] non-hierarchicalclustering

D.

graphical analysis [GRAPH]

Data analysis that, in contrast to numerical analysis, explores and summarizes data using graphics. There is an important distinction between analysis resulting in presentation graphics and exploratory graphical analysis. Exploration requires active human interaction, it is a dynamic process that is usually performed as a first, unprejudiced data analysis step before modeling. Besides the traditional two-dimensional graphics, interactive computer graDhics offers powerful tools for exploratory analysis.

b

graphical rotation [FACT] factor rotation

b

graphical symbol [GRAPH]

Graphical representation of a multidimensional object. Each object is displayed as a distinctive shape and their similarities can be explored by visual pattern recognition. A list of the most well known symbols follows.

Chernoff face The variables are coded into facial features. For example, the values of a variable associated with the mouth are represented by the curvature of the smile.

Chicago 0

Cleveland

glyph

A circle of fixed radius with rays, corresponding to the variables, emanating from it. The position of the ray labels the variable, the length is proportional to the value of the corresponding variable.

148

graphical symbol [GRAPH]

profile symbol Each variable is associated with a positions along a profile, and its value is represented by the hight at that position.

V

star symbol A variation of the glyph. Each variable is represented as a ray emanating from the center with equal angles between adjacent rays. A polygon is formed by connecting the tick marks on the rays that indicate the value of the corresponding variables.

tree symbol Each variable is assigned to a branch of a stylized tree, its vaues are indicated by the length of the corresponding branch. The order of the branches is established on the basis of hierarchical clustering of the variables.

graphics [GRAPH] 149

U

b graphics [GRAPH] Visual display of quantitative information for graphical analvsis. It displays either the data themselves or some derived quantity. There are several types of graphics:

graphics for univariate data: bar plot, box ulot. dot plot, freauencv uolvgon, histogram, uictomam, uie chart, quantile ulot, stem-and-leaf diagram; graphics for bi- or trivariate data: quantile-quantile plot, response curve, scatter ulot, triangular ulot; graphics for multivariate data: Andrews’ Fourier tvue ulot, biplot, dendrogram, draftsman’s ulot, mauhical svmbol, principal component plot; graphics for checking assumptions and model adequacy: Coomans’ plot, residual plot, normal residual plot; graphics for time series: high-low graphics, - periodogram, time series plot, z-chart; graphics for decision making: cause-effect diagram, control charts, decision tree, digidot ulot, half-normal dot, ridge trace, scree ulot.

150

gross-en-orsensitiviv [ESTIM]

gross-error sensitivity [ESTIM] + influence curve b

b

group average linkage [CLUS]

+ hierarchical clustering (0 agglomerative clustering) b group covariance matrix [DESC] + covariance matrix

grouped data [PREP] + data b

growth model [REGR] + regression model b

b Gupta’s test [TEST] + hypothesis test

b

H spread [DESC]

+ dispersion b

half-interquartile range [DESC]

+ dispersion b half-normal plot [GRAPH] Graphical display of a factorial design in terms of ordered absolute values of the contrasts on normal probability paper.

Hamann coefficient [GEOM] + distance (0 binary data) b

b Hamming coefficient [GEOM] + distance ( 0 binary data)

Hansen-Delattre clustering [CLUS] -+ non-hierarchical clustering ( 0 optimization clustering) b

hat matrix [REGR]

151

b hard model [MODEL] + model (u soft model)

harmonic curves plot [GRAPH] : Andrews' Fourier-typeplot

b

b harmonic mean [DESC] + location

Hartley design [EXDE] + design b

b Hartley's test [TEST] + hypothesis test

hat matrix [REGR] Symmetric, idempotent matrix W(n,n) of rank p (in the absence of predictor degeneracies) that projects the observed responses y into predicted responses 9 (also called projection matrix): b

j=Wy

e=(II-W)y

If W depends only on X but not on y, j is a linear estimator. The trace of the hat matrix gives the degrees of freedom of a regression model. The diagonal element hii can be interpreted as the Mahalanobis distance of the ith observation from the barycenter using (XTX)-',i.e. the inverse of the scatter matrix as metric. The hat matrix is a very important quantity in influence analysis; it indicates the amount of leverage (or influence) exerted on 9 by y. The range of its elements is 0 5 h,t 5 1. The average value of hii is p / n . If hii = 1 then f i = yi, ei = 0 and W\i, the predictor matrix without the ith observation, is singular. If an observation i is far from the data center, the corresponding hii is close to 1 and V(ei) is close to zero, i.e. this observation has a better fit than another one close to the data center. For example, in case of single linear least squares regression with intercept:

1 (x, -X)(xt -X) hst = - + C ( X i -X)2 n

i, s, t = 1,n

I

In OLS:

W = X (XTX)-' XT

h,, = xj (XTX)-'X;

and in ridge regression: h,, = X, (XTX - yII)-'

X;

W = X (XTX - yII)-' XT

152

hazard function [PROB]

The hat matrix can be used to calculate cross-validated residuals from ordinary residuals in linear estimators:

b

hazard function [PROB]

+ random variable

Hessenberg matrix [ALGE] + matrix b

b

Hessian matrix [OPTIM]

Matrix H(p, p) of the second derivatives of a scalar valued function f (p):

where f is a real function ofp parameters, with continuous second partial derivatives with respect to all of the parameters and p is the p-dimensional parameter vector. This matrix is used, for example, in Newton-RaDhson optimization. b

heteroscedastic [PROB]

+ homoscedastic b

Heywoodcase [FACT]

+ communality hidden layer [MISCI + neural network b

b

hierarchical model [MODEL]

+ model b

hierarchical clustering [CLUS]

Clustering method that produces a hierarchy of partitions of objects such that any cluster of a partition is fully included in one of the clusters of the consecutive partition. Such partitions are best represented by a dendropram. This strategy is different from non-hierarchical Clustering, which results in one single partition. There are two main hierarchical strategies: agglomerative clustering and divisive clustering.

agglomerative clustering Hierarchical clustering that starts with n objects (variables) in n separate clusters (leaves of the dendrogram) and after n - 1 fusion steps ends with all n objects (variables) in one single cluster (root of the dendrogram). In each step the number

hierarchical clustering [CLUS]

153

of clusters is decreased by one, fusing the two closest clusters. The procedure starts with and n x n distance matrix which is reduced and updated in each fusion step. The various methods differ in the way in which the distances between the clusters are calculated. At each step two rows (and two columns) s and t corresponding to the two closest clusters, containing ns and nt objects (variables), respectively, are fused and are replaced by a new row (and column) 1, corresponding to the resulting new cluster, containing nl objects (variables). An updating distance formula defines the elements of this new row (column) dl, from the elements of the two old rows (columns) dsi and dti. o average linkage (: group average linkage, unweighted average linkage)

Distance between two clusters is calculated as the average of the distances between all pairs of objects in opposite clusters. This method tends to produce small clusters of outliers, but does not deform the cluster space. The distance formula is:

o centroid linkage Each cluster is represented by its centroid; distance between two clusters is calculated as the distance between their centroids. This method does not distort the cluster space. The distance formula is:

o complete linkage (: firthest neighbor linkage, maximum linkage)

Distance between two clusters is calculated as the largest distance between two objects in opposite clusters. This method tends to produce well separated, small, compact spherical clusters. The cluster space is dilated. The distance formula is:

+

d1i = 0.5 (dsi dti)

+ 0.5 ldsi - dtiI = max (dsi,dti)

o error sum of squares linkage : Ward linkage o furthest neighbor linkage : complete Zinkage o Gower linkage : median linkage

o group average linkage : average linkage o Lance-Williams’ flexible strategy

Most of the agglomerative methods can be unified by a single distance formula: d1i = ~ s d s i+ ~ t d t i+pdst +yldsi -dtiI where the parameters as,at, p and y are the following:

154

hierarchical clustering [CLUS] ~~

~

Method

ffS

fft

average centroid complete median single Ward w. average

n S / h + nt) n S / h + nt) 0.5 0.5 0.5 (ns + ni)/nsti 0.5

n t / h + nt) n t / h + nt) 0.5 0.5 0.5 (nt + ni)/nsti 0.5

B 0.0 -nsnt/(ns + n d 2 0.0 -0.25 0.0 -ni/nsti 0.0

Y

0.0 0.0 0.5 0.0 -0.5 0.0

0.0

o maximum linkage : complete linkage o McQuitty’s similarity analysis : weighted average linkage o median linkage (: weighted centroid linkage, Gower linkage) Centroid linkage where the size of the clusters is assumed to be equal and the position of the new centroid is always between the two old centroids. This method preserves the importance of a small cluster when it is merged with a large cluster. The distance formula is:

o minimum linkage : single linkage

o nearest neighbor linkage

.- single linkage

o single linkage (: nearest neighbor linkage, minimum linkage) Distance between two clusters is calculated as the smallest distance between two objects in opposite clusters. This method tends to produce loosely bound large clusters with little internal cohesion. Linear, elongated clusters are formed as opposed to the more usual spherical clusters. This phenomenon is called chaining. The distance formula is:

d1i = 0.5 (dsi + dti) - 0.5 ldsi- dti I = min (dsi, dti) o sum of squares linkage : Ward linkage o unweighted average linkage : average linkage o Ward linkage (: error sum of squares linkage, sum of squares linkage) This method fuses those two clusters that result in the smallest increase in the total within-group error sum of squares. This quantity is defined as the sum of squared deviation of each object from the centroid of its own cluster. In contrast to the other methods that use prior criteria, this method is based on a posterior fusion criterion. The distance formula is:

hierarchical clustering [CLUS]

dli =

n, +ni n,

+ nt + ni

dsi

+ nsnt+ +ntni)+ ni dti - ns + ntni + ni ds

155

t

o weighted average linkage (: McQuitty’s similarity analysis)

Average linkage where the size of the clusters are assumed to be equal. This method, similar to median linkage, weights small and large clusters equally. The distance formula is:

o weighted centroid linkage : median linkage

divisive clustering Hierarchical clustering that, as opposed to agglomerative clustering, starts with all n objects (variables) in one single cluster and ends with n clusters, each containing only one object (variable). There are two groups of divisive techniques: in monothetic divisive clustering a cluster division is based on a single variable, while in polythetic divisive clustering all variables participate in each division. o association analysis ( : Williams-Lambert clustering)

Monothetic divisive clustering method used especially with binary variables. In each step a variable is selected for each cluster and each cluster is divided into two new clusters according to the values of its selected variable. The goal is to find the variable in each cluster that least resembles the other variables based on the chi squared coefficient criterion. After each division step the selected variables are eliminated and new variables are selected in each cluster for further divisions. o Edwards and Cavalli-Sforza clustering

Divisive version of Ward clustering. The division of clusters is based on minimizing the total within-group sum of squares: min

c g

8, = min

c[ 7 g

ng

s

g= l,G

s,t=

1,n,

d:t]

A similar division criterion is to minimize the trace of the mean covariance matrix

o linear discriminant hierarchical clustering (LDHC)

Divisive clustering where each binary split is performed by an iterative relocation procedure based on a linear discriminant classifier. The initial splitting is obtained either randomly or on the basis of the first principal component. In each iteration objects are relocated if they are misclassified according to the linear discriminant classifier. In the final dendrogram a discriminant function is associated with each node, which allows validation of the tree obtained. The significance of the clusters is

156

hierarchical design [EXDE]

assessed on the basis of cross-validation. The stability function measures the ability of each cluster to retain its own objects, while the influence function indicates how each cluster can attract objects from other clusters. o MacNaughton-Smith clustering Divisive clustering where in each step each old cluster is divided into two new clusters according to the following rules. First, in each old cluster the object that is the furthest from all other objects in the cluster, i.e. has maximum sum of elements in the corresponding row of the distance matrix, is selected to form a new cluster. Next, each object in the old cluster is assigned to one of the two new clusters on the basis of its distance from the cluster centroids. These two steps are repeated until a complete division is obtained, i.e. each cluster contains one single object.

o Williams-Lambert clustering : association analysis

hierarchical design [EXDE] + design b

b hierarchical model [ANOVA] + analysis of variance

high-low graphics [GRAPH] Graphics for displaying the highest and lowest values of a time series. The successive time intervals are represented on the horizontal axis, while on the vertical axis the highest and lowest values in the corresponding time interval are indicated by two points connected with a bar. Often the high values are joined by a line, and similarly, so are the low values. b

OL

I

I

I

I1

111

I

IV

I

V

I

VI

I

VII

I

MI1

histogram classification [CLAS]

157

hill climbing clustering [CLUS] + non-hierarchical clustering (0 optimization clustering) b

hinge [DESC] + five number summary b

b

histogram [GRAPH]

Graphical summary of a univariate frequency distribution in the form of a discretized empirical density function. The range of the continuous variable is partitioned into several intervals (usually of equal length) and the counts (frequency) of the observations in each interval is plotted as bar length. The counts may be expressed as absolute frequencies or conditional frequencies. The relative height of the bars represents the relative density of observations in the interval. Histograms give visual information about asymmetry, kurtosis and outliers.

T

FREQUENCY OF THE OBSERVATIONS

INTERVALS

When the histogram is plotted on a square root vertical scale (which is an approximate variance stabilizing transformation) it is called a rootogram. A circular histogram, also called a polar-wedge diagram, is a similar graphical representation for angular variables. In contrast to the ordinary histogram, the range intervals of the variable are indicated by intervals on the circumference of a circle instead of bars erected on a horizontal line. b

histogram classification [CLAS]

Nonparametric independence classification method that assumes independent predictor variables and estimates the class density functions by univariate histograms of the predictors constructed in each class. The class density function of class g at object x is:

158

histogram density estimator [ESTIM]

where ngj is the number of objects in class g that are in the same bin of the histogram as x (i.e. have a similar value in variable j), and ng is the total number of objects in class g . This procedure suffers from information loss resulting from the reduction of continuous variables into categorical ones and from ignoring the correlation among the predictors. b histogram density estimator [ESTIM] + estimator (u density estimator) b Hodges’ test [TEST] + hypothesis test

b

Hoke design [EXDE]

+ design Hollander’s test [TEST] + hypothesis test b

b homoscedastic [PROB] Property referring to distributions with equal variance. It is often used, for example, in regression context where the error distribution is assumed to be homoscedastic, i.e. distributed with equal variance at each observation. Distributions with unequal variance are called heteroscedastic. b Hopkins’ statistic [CLUS] Measure of clustering tendency of objects (variables) in the measurement space defined as:

where dj is the distance from a randomly selected object (variable) i to the nearest object (variable), while ri is the distance from a randomly selected point (not necessarily an object location) in the measurement space to the nearest object (variable). Only 5-10% of the objects are randomly selected to calculate the above statistic. If the objects are uniformly distributed, i.e. there is no clustering tendency, the distances ri and di are of the same magnitude and H is around 0.5. In contrast, di << Ti and if there is strong clustering among the objects (variables), then H becomes close to one.

xi

xi

hypothesis test [TEST]

159

b Horn’s method [FACT] + rankanalysis

b Hotelling-Lawley trace test [FACT] -+ rankanalysis b Hotelling’s test [TEST] + hypothesis test

b Housholder transformation [ALGE] + orthogonal matrix transfonnation b Huber’s psi function [REGR] + indicator function

hybrid design [EXDE] + design b

b hypergeometric distribution [PROB] + distribution

b

hyper-Graeco-Latinsquare design [EXDE] design

-+

b hyperplane [GEOM] + hyperspace

b hyperspace [GEOM] (: pattern space) Multidimensional space (higher than three dimensions) where multivariate objects (patterns) described by p variables (p > 3) are represented as points. A hyperplane in a p-dimensional hyperspace is the analog of a line in 2-dimensional space or a plane in 3-dimensional space. b hypothesis test [TEST] (: test) Statistical test based on a test statistic to verify a statistical hypothesis about a parameter, a distribution or a goodness of fit. Hypothesis testing is one of the most important fields in statistics. The list of the most common tests follows.

Abelson-%key’s test Distribution-free test in analysis of variance based on rank order statistic.

qjne’s test Nonparametric test for the uniformity of a circular distribution.

160

hypothesis test [TEST]

Ansari-Bradley’s test Distribution-free rank test for the equality of the scale parameters of two distributions that have the same shape but might differ in location and scale. Armitage’s test Chi squared test for the trend in a two-row contingency table. Barnard’s test Test to determine whether a data set can be considered as a random sample from a certain distribution or as a result of a certain stochastic process. The test is calculated from Monte Carlo simulations. Bartlett’s interaction test Test for a significant second order interaction in a 2 x 2 x 2 contingency table. Bartlett’s sphericity test Test to determine the number of significant principal components, i.e. the level of collinearity among variables in a data set. Bartlett’s test The most common test for the equality of variance of several samples drawn from a normal distribution. Behrens-Fisher’s test Test for the difference between the means of samples from normal distributions with unequal variances. Beran’s test Test for uniformity of a circular distribution. Special cases include Watson’s test, Ajne’s test and Rayleigh’s test. Box’s test Approximate test for the equality of variance of several populations.

Brunk’s test Nonparametric test, based on order statistics, to determine whether a random sample is drawn from a population of a certain distribution function. Capon’s test Distribution-free test for the equality of the scale parameters of two otherwise identical populations, based on rank order statistics.

hypothesis test [TEST]

161

chi squared test Common significance test, based on the chi squared statistic, used to test for: - the difference between observed and hypothetical variance in a normal sample; - the goodnessoffit between observed and hypothetical frequencies. Cohran’s Q test Test for comparing percentage results in matched samples. Cohran’s test Test for equality of independent estimates of variance. Cox-Stuart’s test Sign test to determine whether the distribution (location and scale) of a continuous variable changes with time. Cramer-von Mises’ test Test for the difference between an observed and a hypothetical distribution function. D’Agostino’s test Test of normality based on the ratio of two estimates of the standard deviation: an estimate using the order statistic and the mean square estimate. Daniel’s test Distribution-free test for trend in a time series based on Spearman’s p coefficient. David’s test Distribution-free test to determine whether a continuous population has a specified probability density function. The test is based on dividing the range of the distribution into cells having equal areas above them and, after drawing a random sample, counting the number of empty cells. David-Barton’s test Distribution-free test, similar to the Siegel-nkey test, for equality of the scale parameters of two otherwise identical populations. Duncan’s test Modification of the Newman-Keuls test. Dunnett’s test Multiple comparison test to compare several treatments with one single control. Dunn’s test Multiple comparison test based on the Bonferroni inequality.

162

hypothesis test [TEST]

Durbin-Watson’s test Test for independent errors in the OLS regression model based on the first serial correlation of errors. Dwass-Steel’s test Distribution-free multiple comparison test. exact chi squared test : Fisher-Yates test exceedance test Distribution-free test for equality of two populations based on the number of observations from one population exceeding an order statistic calculated from the other population. F test (: variance ratio test) Test based on the ratio of two independent statistics, usually quadratic estimates of the variance. It is commonly used inANOVA to test for equality of means. Fisher-Yates’s test (: exact chi squared test, Fisher’s exact test) Test for independence of cell frequencies in a 2 x 2 contingency table. Friedman’s test Nonparametric test in two-way ANOVA for row differences. Gabriel’s test Extension of Scheffe’s test for homogeneity of subsets of mean values in ANOVA. Gart’s test Exact test for comparing proportions in matched samples. Gupta’s test Distribution-free test for the symmetry of a continuous distribution around the median. D Hartley’s test Test for equality of variances of several normal populations based on the ratio of the largest and smallest sample variances.

Hodges’ test Bivariate analogue of the sign test. Hollander’s test Distribution-free test for the parallelism of two regression lines with equal number of observations. It is a special case of the signed rank test.

hypothesis test [TEST]

163

Hotelling’s test Test for dependent correlation coefficients calculated from a single sample with three variables from a multivariate normal distribution. includance test Distribution-free test, similar to exceedance test, for equality of two populations based on the number of observations from one population included between two order statistics calculated from the other population. K test Distribution-free test for a trend in a series. Kolmogorov-Smirnov’s test Significance test for goodness of fit between a hypothetical and an observed distribution function and for equality of two observed distribution functions. Kruskal-Wallis’ test Distribution-free test for equality of populations. It is a generalization of the Wilcoxon-Man-Whitney test. Kuiper’s test Test for goodness of fit, similar to Kolmogorov-Smirnov test. o L test Test for homogeneity of a set of sample variances based on likelihood ratios.

least significant difference test Multiple comparison test for comparing mean values in ANOM. Lehman’s test Tho sample, nonparametric test for equality of variances. M test : Mood’s test o Mann-Kendall’s test Distribution-free test for trend in time series based on Kendall’s t.

Mann-Whitney’s test : Wilcoxon-Mann- Whitney test McNemar‘s test Test for equality of binary responses in paired comparisons. median test Test based on rank-order statistic for equality of two populations.

164

hypothesis test [TEST]

Mood-Brown’s test Distribution-free test for the difference between populations based on the overall median calculated from the corresponding samples. Mood’s test (: M test) Distribution-free test for equality of dispersion of two population based on the ranks of the combined sample. Moses’ test Distribution-free rank test for the equality of the scale parameters for two populations of identical shape. Newman-Keuls’ test Multiple range comparison test in ANOVA. Pitman’s test Distribution-free randomization test for equality of means of two or several samples. precedence test Distribution-free test, similar to the exceedance test, for equality of two populations. Quenouille’s test Test for goodness of fit of an autoregressive model to a time series or to two time series. Rayleigh’s test Test for uniformity of a circular distribution. Rosenbaum’s test Nonparametric test for the equality of scale parameters of two populations with equal medians. Scheffe’s test Multiple comparison test for equality of means in ANOVA. Shapiro-Wilk’s test Test for normality based on the ratio of two variance estimates. Siegel-nkey’s test Distribution-free test for equality of the scale parameters of two otherwise identical populations. t test Common test for the difference between a sample mean and a normal population mean or between two sample means from normal distributions with equal variances.

hypothesis testing [TEST]

165

’hkey’s quick test Distribution-free test of equality of means in two samples based on the overlap of the sample values. ’lhkey’s test Multiple comparison test of mean values from ANOVA based on the Studentized range. variance ratio test : F test Wald-Wolfowitz’s test Distribution-free test for randomness in large sample based on serial covariance. Walsh’ test Distribution-free test for symmetry of two populations based on the ranked differences of the two samples. Watson’s test Goodness of fit test, similar to the Cramer-von Mises test. Westenberg’s test Distribution-free test, similar to the Fisher-Yates test, for equality of two populations based on the rank order statistic. Wilcoxon’s test 1. Distribution-free test for equality of location of two otherwise identical populations based on rank sums. 2. Distribution-free test for the difference between two treatments on matched samples based on ranking the absolute differences, and calculating the sum of the ranks associated with positive differences. Wilcoxon-Mann-Whitney’s test ( ; Mann- Whitney test) Distribution-free test for equality of location parameters of two populations based on ranking the combined sample. Wilk’s test Distribution-free test, similar to David’s test, for equality of two continuously distributed populations based on the number of empty cells. b hypothesis testing [TEST] Together with parameter estimation, hypothesis testing is the most important field in statistics. It is performed to test whether a statistical hypothesis should be accepted or not. The most commonly tested hypotheses are about parameters (location,

166

hypothesis testing [TEST]

scale) or shape of probability distributions, or about the goodness of fit between two populations (theoretical or empirical). The testing is based on a test statistic, that is a function of the sample observations. A test based on a rank order test statistic is called a rank test. A test based only on the sign of the test statistic, not on its magnitude, is called a sign test. A test that does not assume a certain distribution of the tested population is called a distribution-freetest. The hypothesis tested is called the null hypothesis, denoted Ho;the complement hypothesis, which is accepted in case the null hypothesis is rejected, is called the alternative hypothesis, denoted HI. The error made by rejecting a true null hypothesis is called a-error, or error of the first kind, or type I error, or rejection error. The error of accepting a false null hypothesis is called B-error, or error of the second kind, or type I1 error, or acceptance error. Ho true

I Ho accepted

I Ho rejected

I

I

no error

error error of the first kind type I error

(Y

e ror of the second kind type I1 error

The probability a! that a true null hypothesis will be rejected is called the size of a test; the probability 1 - that a false alternative hypothesis will be rejected is called the power of a test. The most powerful test has the highest probability to reject a false alternative hypothesis. Rejection region, or critical region is a region in the test statistic’s distribution that calls for rejecting the null hypothesis if the value of the test statistic falls into that region. The value of a test statistic that separates the rejection region from the rest of the values is called the critical value. A hypothesis test in which the rejection region is located only on one end of the distribution of the test statistic is called a one-sided test, or a single tail test, or an asymmetric test. In contrast, when the rejection region contains areas at both ends of the distribution of the test statistic, it is called two-sided test, or double tail test, or symmetric test. The probability of making a type I error (i.e. rejecting a true null hypothesis) is called the significance level of a test. The function of p bvs. the parameter being tested is called the operating characteristic of a test. The function of 1 - vs. the parameter being tested is called the power curve of a test. A test is called a biased test if it assigns a lower probability of rejecting the null hypothesis when the alternative hypothesis is true than when the null hypothesis is true. Such a test has a minimum value of the power curve at a parameter value other than that of the null hypothesis.

imbedded error (IE) [FACT]

167

Performing a hypothesis test consists of five steps: 1. define the null hypothesis Ho and the alternative hypothesis HI; 2. choose a test statistic; 3. specify the significance level; 4. set up a decision rule based on the significance level and the distribution of the test statistic; 5. calculate the test statistic from the sample and make the decision on the basis of the decision rule.

b

idempotent matrix [ALGE] matrix

b

identity matrix [ALGE] matrix

b

ill-conditioned matrix [ALGE] matrix condition

b

image [MULT] image analysis

b

image analysis [MULT]

Analysis of images as opposed to objects, in which most results are images themselves and can be examined visually. An image is a set of objects arranged in space. Such objects are called pixels; their relative spatial position is indicated by at least two indices from which inter-object distances can be calculated. The number of pixels is a characteristic of an image. Each pixel ij is characterized by a univariate or multivariate measurement: xi, or xi,. An example of the univariate pixel is the grey level in a black and white image, while multivariate pixels are produced by NMR, SIMS, satellites, etc. Multivariate image analysis (MIA) is the application of multivariate data analysis techniques to extract structure, to classify, to correlate images of multivariate pixels. PCA, PCR and PLS have been successfully used in multivariate image analysis.

b

imbedded error (IE) [FACT] en-or terms in factor analysis

168

imbedded errorfunction [FACT]

b imbedded error function [FACT] + rank analysis b inadmissible variable [PREP] + variable

b

includance test [TEST]

+ hypothesis test b incomplete block [EXDE] --+ blocking b incomplete block design [EXDE] -+ design (0 randomized block design)

independence classification [CLAS] + classijication b

b independent, identically distributed (i.i.d.) [PROB] Property referring to uncorrelated distributions with equal variance. It is an often used requirement, for example, for error distributions in regression. It is a more relaxed requirement than the normalitv assumption.

b independent increment process [TIME] -+ stochastic process b independent variable [PREP] + variable b indicator function [REGR] Function of residuals ei and residual standard deviation s designed such that it has zero correlation with the predictor variables:

In OLS,for example $(ei/S) = ei/s

i = 1, n

This function is used in robust regression to downweight large residuals. The most well known is Huber’s psi function.

indicator function [FACT] -+ rankanalysis b

influence analysis [REGR]

169

b indicator variable [PREP] + variable b

indirect calibration [REGR]

+ calibration b inference [ESTIM] Translation of the estimated parameters from sample to population.

b inference [PROBI + population b

inference statistics [DESCI

+ statistics

influence analysis [REGR] Collection of statistics to assess the influence (also called leverage) of observations on the regression model. An observation is called an influential observation if compared to other observations it has a relatively large impact on the estimated quantities like response, regression coefficient, standard error, etc. Diagonals of the hat matrix hii and the Studentized residuals ti are simple statistics for assessing influence. Observation i is called a leverage point if b

One way to assess the influence of observation i is to examine the difference between an estimate calculated with and without observation i (the latter is indicated by \i). This row-wise deletion can be extended to deletion of a group of observations to examine the joint influence of several observations.

Andrews-Pregibon statistic (.- AP statistic) Statistic to measure the joint influence of several observations:

where \m indicates that the quantity was calculated after deleting a group of observations of size m (when m = 1 the \m notation is equal to i) and Z is the predictor matrix X augmented with the response column y on the right.

AP statistic

: Andrews-Pregibon statistic

Cook’s influence statistic Statistic to measure the influence of observation i on the estimated regression coefficient vector 6:

170

influence analysis [REGR]

Di =

(5- b\i)T (WTW)(6

- b\i) - -r; hii p 1 - hii

P S2 where p is the number of predictors, hii is the hat diagonal, Ti is the standardized residual and s is the residual standard deviation. This is an F-like statistic with p and n - p degrees of freedom, that gives the distance between the regression coefficient vector calculated with and without the ith observation.

COVRATIO Statistic to measure the influence of observation i on the variance of the estimated regression coefficients:

1 (1 - hii)[(n - P - l)/(n - P )

-

+ tz/(n -PI]'

This statistic is the ratio of the generalized variance of 6 calculated with and without the ith observation. If COVRATIOi = ,1observation i has no influence. Observation i significantly reduces the variance of b, i.e. i has high leverage when

+

3P lCOVRATIOiI > 1 n COVRATIOi is large when hii is large and small when ti is large. These two statistics may offset each other, therefore it is important not only to examine the COVRATZOi but also hii and ti.

DFBETA Statistic to measure the influence of observation i on the estimated regression coefficients 6:

The above measure scaled and calculated for the jth estimated regression coefficient h, is: A

DFBETASij =

,

.

b. - b.

.

J\'

S\i

fi

where c,, is a diagonal of (XT X)-l. This statistic, which has a t-like distribution, is the difference between the jth regression coefficient calculated with and without the ith observation, scaled by the standard error of Lj . Observation i has high leverage when DFBETASij > 2&

influential observation [REGR] 171

DFFIT Statistic for measuring the influence of observation i on the ith fitted value:

The above measure scaled is:

This statistic, which has a t-like distribution, is the difference between the predicted ith response calculated with and without the ith observation, scaled by the standard error of f i . Observation i has high leverage when DFFITSi > 2 J p m This quantity is a function of magnitude of hii.

ti,

and is enlarged or diminished according to the

b influence curve (IC) [ESTIM] ( : influence function) Function that describes the effect of an observation x on an estimator T given a distribution function F, i.e. it characterizes the stability of the estimator, denoted as IC(T, F, x). It is the first derivative of the estimator T calculated at point x with an underlying distribution function F. It formalizes the bias caused by the observation x taking values from minus to plus infinity. In case of a sample of size n - 1 the empirical influence curve is a plot of T, (XI,. . . ,x,-1, x) as a function of x. A translated and rescaled version of the empirical influence curve, where the ordinate is calculated as

SC(x) = n[Tn(xl, - .- xn-1, X) - Tn-l(Xlr 9

. . . , xn-l)]

is called the sensitivity curve (SC). A measure of the worst influence that an outlier x can have on a value of an estimator T at a distribution function F, i.e. the upper bound of the bias of the estimator is called the gross-error sensitivity. It is given as the maximum absolute value of the influence curve: GES = max IIC(T, F , x)l X

A robust estimator has finite gross-error sensitivity, i.e. its influence curve is bounded. b

influence function [ESTIM]

: influence curve

influential observation [REGR] + influence analysis b

172 F

information content [MISC]

information content [MISC]

+ information theory F

information matrix [DESC]

+ scatter matrix F information theory [MISC] Theory, closely related to probability theory, for quantifying the information content of events, statements, observations, measurements. Obtaining information means reducing uncertainty. The information content of an event x is a continuous function of its probability P(x):

Z(x) = -log,P(x) Z(x) is measured in bits and Z(x) 2 0. The information content is additive, in case of independent events x and y, the joint information is Z(x, Y ) = I(x)

+ Icy)

The mutual information of two events x and y is defined as:

If the two events are independent then Z(x : y ) = zo, : x) = 0

otherwise Z(x : y ) = Z(X)

+ Icy) -

Z(X,

y)

The average information content of a set of events X, called Shannon entropy or simply entropy, is defined as: H(X) = -

c

P ( x ) log, P ( x ) =

X

c

P ( x ) Z(x)

X

In the case of a continuous event space the density functionf(x) replaces P(x): H(X) =-

Sf.)

logzf(rC) d-x

X

Given two event spaces X and Y, the joint entropy is:

H ( X , y> = -

c

P ( x , Y ) log, P ( x , Y )

S

where s ranges over the joint X x Y space. The two conditional entropies are:

interactive computer graphics [GRAPH]

Y

X

X

Y

173

The average mutual information indicates the degree of relationship between the two sets:

If the two sets are independent, then

H(X,Y)=H(X)+H(Y) and

H(X:Y)=H(Y:X)=O

If there is a well defined functional relationship between the two sets, then H(X : Y) = H(Y:X) = H(X) = H ( Y ) The following relationships hold between the above quantities: H(X I Y) = H(X, Y)

- H(Y)

H(Y I X) = H(X,Y) - H(X) H(X:Y)=H(X)-H(XI Y)=H(Y)-H(YIX)= = H(X) + H ( Y ) - H(X,Y) The entropy is used as an index of diversity, called the Shannon-Wiener index. A similar quantity is the Gini index of diversity, defined as:

G(X)=

C P(Xi)P(xj

for all i # j

X

H and G can be normalized by log,(n) and (1 - l/n), respectively, to obtain quantities ranging between 0 and 1. b

input layer [MISC]

+ neural network

interaction [EXDE] + factor b

b

-+

interaction term [ANOVA] term inANOVA

b interactive computer graphics [GRAPH] Collection of techniques on the basis of high performance graphical terminals or workstations to explore the structure of multivariate data sets. Techniques used are:

174

interactive computergraphics [GRAPH]

animation Motion display of three-dimensional data. One of the variables is selected and assigned to the animation axis perceived as being perpendicular to the screen. An imaginary threshold plane parallel to the screen is moved along the animation axis. All the points in front of the threshold plane, i.e. with less than the threshold value in the animation variable are visible, the rest are masked by the plane. This single plane technique is called masking. M A S

K I N G

S

L I C I N G

If there are two parallel planes moving along the animation axis and only the points between the two planes are visible, the animation is called slicing. A cinematic effect can be achieved by simultaneous and continuous movement of the two planes.

connecting plots Data of higher dimensions can be displayed by connecting several two- or threedimensional plots having no common axes. An object is displayed as connected points on several plots. The connection between the same data points is provided by highlighting, coloring or brushing. Brushing is performed dynamically by selecting an area on the plot with an enlarged cursor (called a brush) and moving the cursor around. The cursor motion on one plot causes the corresponding area on the connected plots to be highlighted. three-dimensional motion graphics The three-dimensional effect is created by repeated small rotations applied to the data (with three variables), while the rotated data points are projected on the two-dimensional screen. If the computation and display is fast enough, it gives the illusion of continuous motion. zooming Enlarging a selected part of a plot in order to examine a point cloud exhibiting some interesting structure that is hard to see at the original scale.

b

inverse of a matrix [ALGE]

intercept [REGR] regression coeficient

interdecile range [DESC] + dispersion b

b inter-group covariance matrix [DESC] + covariance matrix b

intermediate least squares regression (ILS) [REGR] partial least squares regression

-3

b

internal failure cost [QUALI qualitycost

-3

b

internally Studentized residual [REGR] residual

-3

b

interquartile range [DESC]

4 dispersion

b

interval estimator [ESTIM]

+ estimator b

interval scale [PREP] scale

-3

b

intra-group covariance matrix [DESC] covariance matrix

-3

intrinsically linear regression model [REGR] + regression model b

b

intrinsically nonlinear regression model [REGR]

+ regression model b

inverse calibration [REGR]

+ calibration b inverse distribution function [PROB] + random variable

b

inverse of a matrix [ALGE]

+ matrix operation

175

176

inverse survival function [PROB]

b inverse survival function [PROB] + random variable

Ishikawa diagram [GRAPH] : cause-effect diagram

b

b Ishikawa’s seven tools [QUAL] The following simple statistical techniques for finding all useful information in the pursuit of quality: check sheet, Pareto analysis, cause-effect diagram, histogram, graph, stratification, scatter plot. b ISODATA [CLUS] + non-hierarchical clustering

(0

optimization clustering)

item [PREP] : object

b

b

iterative key set factor analysis (IKSFA) [FACT]

+ key set factor analysis b iterative principal factor analysis [FACT] + principal factor analysis

iteratively reweighted least squares regression (IRWLS) [REGR] + robust regression b

J-shaped distribution [PROB] + random variable b

b Jp statistic [MODEL] + goodness of fit b Jaccard coefficient [GEOM] + distance (o binarydata)

jackknife [MODEL] + model validation b

K-medians clustering [CLUS]

177

b jackknifed residual [REGR] + residual

Jacobi method [ALGE] + eigenanalysis b

b Jacobian matrix [OPTIM] Matrix J(n, p ) of the first derivatives of a vector valued function f(p):

J(P) =

[$]

i=l,n j=l,p

where f is a vector function of p parameters and p is the p-dimensional parameter vector. This matrix is used, for example, in Gauss-Newton optimization. b Jancey clustering [CLUS] + non-hierarchical clustering b

(0

optimization clustering)

Jarvis-Patrick clustering [CLUS] (0 density clustering)

+ non-hierarchicalclustering

jitter [GRAPH] + dotplot b

b joint distribution [PROB] + random variable

joint probability [PROB] + probability b

K b K-centers clustering [CLUS] + non-hierarchical clustering (o optimization clustering) b K-means clustering [CLUS] + non-hierarchical clustering (0 optimization clustering)

b K-medians clustering [CLUS] -+ non-hierarchical clustering (a optimization clustering)

178

k nearest neighbors method (K") [CLAS]

b k nearest neighbors method (K")[CLAS] Nonparametric method widely used in both classification and cluster analysis. It searches for the K nearest neighbors of each object in the data set and estimates a property (e.g. class membership) of the object from the property of its neighbors. Kis a prespecified parameter, usually ranging between 1 and 10, that must be optimized. The nearness of objects is evaluated by some preselected distance measure.

1

Object ? classifled as X

5-NN

0

l o

l o

x

0

i

X

0

X X

0

X

0

0 0

0

0 0

0

K" estimates the class boundaries and class density functions directly from the training set without calculating parametric classification models. Its classification rule assigns each object to the class corresponding to the majority of its K nearest neighbors. K" makes no assumptions either about the distribution and the shape of the classes or about the class boundaries. This method is particularly appropriate when the classes are not linearly separable, have nonspherical shapes and when the pattern space is densely sampled (high observations /variable ratio). The K" rule has great simplicity, wide applicability and usually good performance. The K" error rate asymptotically approaches the Bayes' optimal error rate. K" is not only a classification technique; it is also the basis of several clustering methods, e.g. Jarvis-Patrick and mode analysis. It is also used for filling in missing values and in libraIy search. b

K-test [TEST]

+ hypothesis test

b

K-Weber clustering [CLUS] non-hierarchical clustering (o optimization clustering)

b

Kalman filter [TIME]

: state-space model

kernel [ESTIM]

179

Karhunen-Loeve expansion [FACT]

b

: principal component analysis b Karhunen-Loeve projection [FACT] + principal component analysis b

-+

Kendall’s t coefficient [DESC] correlation

kernel [ESTIM] Function used in kernel density estimators that satisfies: b

/

K ( x ) dx = 1

-03

for univariate x or

Jm

K(x) dx = 1

--oo

for multivariate x. Usually, the kernel is a symmetric, nonnegative density function. The most popular kernels are:

biweight kernel K(x) = 15/16(1

-X2)2

=o

for 1x1 -= 1 for 1x1 2 1

Epanechnikov kernel K ( x ) = 0.75(1 - O.2x2)/JJ

=o Gaussian kernel (.- normal kernel) For univariate x:

For multivariate x:

for 1x1-= .Js for 1x12 Js

180

kernel classijier [CLAS]

normal kernel : Gaussian kernel rectangular kernel K ( x ) = 0.5

=o triangular kernel K ( x ) = 1 - 1x1

=o b

kernel classifier [CLAS]

: potential function classijier b

kernel density estimator [ESTIM] ( 0 density estimator)

-+ estimator

b kernel smoother [REGR] + smoother

key set factor analysis (KSFA) [FACT] Factor analysis method for finding the set of typical rows of a data matrix, called the key set, that are most orthogonal to each other, so they most completely describe the original variable space. The search for the best key set is based on a selection procedure that avoids an exhaustive search and an optimization criterion that describes the goodness of a selected key set. Various subsets of rows are collected in submatrices and their determinants are calculated. The absolute value of the determinant reflects the degree of orthogonality among a given set of rows. This criterion is maximized to obtain the best subset of rows. Iterative key set factor analysis (IKSFA) is a refinement of KSFA. The initial key set is found by an ordinary KSFA. To improve the key set, each row is replaced in an iterative manner and the replacement is retained if it improved the orthogonality of the set. b

b

knot location [REGR]

+ spline b hut-Vik square design [EXDE] + design b Kolmogorov-Smirnov’s test [TEST] + hypothesis test

kurtosis [DESC]

181

b KroneckerS [ALGE] Bivalued step function:

& 'J . = l

&'J. = O

-

if if

.

i = j

.

i#j

b

Kruskal-Wallis's test [TEST] hypothesis test

b

Kuiper's test [TEST] hypothesis test

b

Kulczinsky coefficient [GEOM] distance (0 binarydutu)

-j.

b

Kulczinsky probabilistic coefficient [GEOM] distance ( 0 binary data)

-j.

b kurtosis [DESC] Measure of peakedness, indicating how concentrated the observations are about the mean, whether the distribution is peaked or flattened. The kurtosis of a variable j is defined as the ratio of the fourth central moment to the square of the second central moment:

k=

'

s f ( n- 1)

Typical k values are k = 3 for a normal distribution (mesokurtic), k = 1.8 for a continuous uniform distribution, k > 3 for a peaked, leptocurtic curve, k < 3 for a flat, platykurtic curve.

182

L cluster [CLUS]

In order to center the k values about zero for a normal distribution, the kurtosis index is often defined as k' = k - 3. In methods with the assumption of normality, the kurtosis of the variables should be checked. In a p-dimensional space, the multivariate measure of kurtosis is defined as

where di is the Mahalanobis distance of the object i from the barycenter c = (XI, . . . ,xp) using the covariance matrix B as a metric: di2 = (xi

-c

) 8-' ~ (xi - C)

For p = 1, kp = k.

L Lcluster [CLUS] Cluster g is called an L cluster if for each of its objects i E g any object t inside the cluster is closer than any other object s outside the cluster: b

mt ~ [ d i It t E gl < min[dis I s # gl Cluster g is called an L*-cluster if the largest intra-cluster distance is smaller than the smallest distance between any of its objects and an object belonging to another cluster:

b L estimator [ESTIM] + estimator

L test [TEST] + hypothesis test b

b L1 estimator [ESTIM] + estimator

L1 regression [REGR] + robust regression b

latent root regression (LRR) [REGR] 183

L2 estimator [ESTIM] + estimator b

b

L-R decomposition [ALGE]

+ matrix decomposition b L-U decomposition [ALGE] + matrix decomposition

Lagrange distance [GEOM] -+ distance (0 quantitative data) b

b

--+

lambda coefficient [GEOM] distance (0 binarydatu)

b Lance-Williams distance [GEOM] + distance ( 0 quantitative data) b

Lance-Williams’ flexible strategy [CLUS] hierarchical clustering (o agglomerative clustering)

--f

b Laplace distribution [PROB] + distribution b

latent class model [MULT]

Latent variable model in which both manifest and latent variables are measured on a nominal scale. The assumption is that the association among categorical manifest variables is due to the mixing of heterogeneous groups, described by the categorical latent variable, and within a category the manifest variables are independent. The objective is to characterize the latent variable that explains the observed association. It is achieved by estimating the latent class parameters: the relative frequency distribution of the latent variable, i.e. the class size, and the relative frequency distribution of the manifest variables in each category, i.e. the conditional latent class probability. The parameter estimation is performed by maximum likelihood latent structure analysis. This method is closely related to maximum likelihood factor analysis, except that in the latter all the variables are assumed to be continuous. b latent root [ALGE] + eigenanalysis b

latent root regression (LRR) [REGR]

Modification of principal components regression on the basis of the eigenanalysis of R*, the augmented correlation matrix. R* is calculated from an augmented predictor

184

latent variable [PREP]

matrix

where the first column is the autoscaled response and the following columns are the autoscaled predictors. The first element of each eigenvector measures the predictability of the response by that eigenvector. Linear combinations of the predictors are calculated with the eigenvectors and the ones with the largest eigenvalues and with the largest first element in the eigenvector are selected for the regression model.

b

latent variable [PREP] variable

b latent variable model [MULT] Causal model in which the dependent variables are the manifest (measurable or observable) variables, and the independent variables are the latent (non-measurable, non-observable) variables. It is assumed that the manifest variables x have a joint probability distribution conditional on the Patent variables y denoted as @(x 1 y), and given the values of the latent variables the manifest variables are independent of one another:

@.(x I Y) = @ l ( X l I Y) @2(x2 I Y)

* *

@&,

I Y)

The observed interdependence among the manifest variables is due to their common dependence on the latent variables. Latent variable models are often described by path diagrams. Latent class model, PLS, LISREL are examples for latent variable models.

latent variable plot [GRAPH] -+ scatter plot b

latent vector [ALGEI + eigenanalysis b

b

Latin square design [EXDE] design

b

lattice [EXDE] design (n simplex lattice design)

-+

b

leader clustering [CLUS] non-hierarchical clustering

-+

b

-+

leaf [MISC] graph theory (o digraph)

(0

optimization clustering)

leverage [REGR] b

-+

learning rate [MISC] neural network

learning set [PREP] + data set b

b

least absolute residual regression [REGR]

+ robust regression b least absolute value estimator [ESTIM] + estimator

b

least median squares regression (LMS) [REGR] robust regression

-+

least significant difference test [TEST] + hypothesis test b

b least squares estimator [ESTIM] + estimator

least squares regression (LS) [REGR] : ordinary least squares regression

b

b

least trimmed squares regression (LTS) [REGR] robust regression

-+

b leave-one-out cross-validation (LOO) [MODEL] + model validation ( 0 cross-validation)

b

Lehman’s test [TEST]

+ hypothesis test b leptokurtic [DESCI + kurtosis

b level [EXDE] + factor b Levenberg-Marquardt optimization [OPTIM] + Gauss-Newton optimization

leverage [REGR] -+ influence analysis b

185

186

leveragepoint [REGR]

b leverage point [REGR] + influence analysis b likelihood [PROB] The probability of obtaining a given data point expressed as a function of a parameter 6. The probability distribution function of continuous variates XI, . . . ,xp,dependent on parameters 61,. . . , 6k, expressed asf(x1, . . . ,x,,; 61,. .. , 6k) and considered as a function of the 6 parameters for fixed x, is called the likelihood function and is denoted as L(6). The maximum value of the likelihood function is called maximum likelihood (ML). The maximum likelihood estimator calculates 6 that maximizes the likelihood function. The logarithm of the likelihood function, called the log likelihood function ln[L(6)], is often maximized. The likelihood ratio is used to test the null hypothesis that a certain population belongs to a subspace w of the whole parameter space 52. It is defined as:

where L(w) denotes the maximum value of the likelihood function calculated only on the parameter subspace o and L(S2) denotes the maximum value of the likelihood function calculated on the whole parameter space R. The null hypothesis is accepted if A is close to 1. b likelihood function [PROB] + likelihood b likelihood ratio [PROB] + likelihood

b

-+

likelihood ratio criterion [FACT] rank analysis

limiting quality level [QUALI + producer’s risk b

b linear algebra [ALGE] (: algebra, matrix algebra) A branch of mathematics dealing with two basic problems: linear equations and eigenanalysis. The building blocks of linear algebra are scalar, vector, matrix, and tensor. The most important techniques are: Gaussian elimination, matrix decomuomatrix transformation, matrix operations, orthogonal matrix transformation, and orthogonal vector transformation. The software packages EISPACK and LINPACK contain excellent routines for solving these problems.

m, b

linear dendrogram [GRAPH]

+ dendrogram

linear equation [ALGE] b

187

linear discriminant analysis (LDA) [CLAS]

+ discriminant analysis b linear discriminant classification tree (LDCT) [CLAS] Classification tree method that, similar to CART, constructs a binary decision tree as classification rule. The LDCT rule is obtained by calculating a linear discriminant function at each node to separate two classes. When there are more than two classes, the user must create the two classes by grouping classes together or by excluding some classes. Several trees can be obtained, depending on the class grouping at each node. The best classification tree is selected by means of cross-validation.

linear discriminant function (LDF) [CLAS] ( .- discriminant function) Linear combination of the predictor variables that provides maximum separation of the classes in discriminant analysis. These linear combinations (their number is min [p, GI) are calculated to maximize the ratio of between-class covariances to within-class covariances subject to the constraint that they be uncorrelated. The linear coefficients, called discriminant weights, for class g are: b

wg = 9-' cg and the constant is

where S is the common class covariance matrix, and cg denotes class centroids. The discriminant weights are eigenvectors of the matrix calculated as the ratio of the between- and within-class covariance matrices. The discriminant weights indicate the correlations between the corresponding predictor variable and the discriminant function, thereby measuring the classification power of the corresponding variables. The discriminant scores, i.e. the projection of the observations onto the new axes are calculated as: sgi

=w:xi

i= l,n

g= 1

,

~

and are plotted on the discriminant plot.

linear discriminant hierarchical clustering (LDHC) [CLUS] + hierarchical clustering (a divisive clustering) b

b linear equation [ALGE] Equation with p + 1 variables of the form

xlul

+ x2a2 + . .. + xpap= y

where al, . . . , up are called coefficients of the linear equation. The standardized linear combination is a special case of a linear equation. Often there is a system

188

linear estimator [ESTIM]

of n connected linear equations: Xila1

+ XiZU2 + . . . + Xipap = yi

i = 1,n

that can also be written as Xa=y

The system is called well-determined if n > p and underdetermined if n < p . The solution has the following properties: - if n = p and X is nonsingular, a unique solution is a = X-' y; - the equation is consistent (i.e. admits at least one solution) if rank@) = rank(X,y); - for y = 0, there exists a nontrivial solution (i.e. a # 0) if rank@) < p ; - the equation XT X = XTy is always consistent. Linear equations are most commonly solved by orthogonal matrix transformation or Gaussian elimination. b linear estimator [ESTIM] -+estimator b linear learning machine (LLM) [CLAS] Binarv classificationmethod, similar to Fisher's discriminant analysis, that separates two classes in the p-dimensional measurement space by a (p - 1)-dimensional hyperplane. The iterative procedure starts with an arbitrary hyperplane, defined by a weight vector w orthogonal to the plane, through a specified origin WO.A training object x is classified by calculating a linear discriminant function

s = WrJ

+ w' x

which gives positive values for objects on one side and negative values for objects on the other side of the hyperplane. The position of the plane is changed by reflection on a misclassified observation as: wo (new) = ;O

(old)

+c

This method, which is rarely used nowadays, has many disadvantages: nonunique solution, slow or no convergence, too simple class boundary, unbounded misclassification risk. b

linear least squares regression [REGR]

: ordinary least squares regression b

linear programing [OPTIM]

+ optimization

linear structural relationship (LISREL) [MULT] 189 b

linear regression model [REGR]

+ regression model b linear search optimization [OPTIM] Direct search optimization for minimizing a function of p parameters f (p). The step taken in the ith iteration is

pi+l = Pi

+ Si di

where S i is the step size and di is the step direction. There are two basic types of linear search optimization. The first type uses a scheme for systematically reducing the length of the known interval that contains the optimal step size, based on a comparison of function values. Fibonacci search, golden section, and bisection belong to this group. The second type approximates the function f (p) around the minimum with a more simple function (e.g. second- or third-order polynomial) for which a minimum is easily obtained. These methods are known as quadratic, cubic, etc. interpolations. b linear structural relationship (LISREL) [MULT] Latent variable model solved by maximum likelihood estimation, implemented in a software package. The model consists of two parts: the measurement model specifies how the latent variables are related to the manifest variables, while the structural model specifies the relationship among latent variables. It is assumed that both manifest and latent variables are continuous with zero expected values, and the manifest variables have a joint normal distribution. The latent variables are of two types: endogenous ( r ] ) and exogenous (c), and are related by the linear structural model:

q=Br]+r6+C B and r are regression coefficients representing direct causal effects among q-r] and r]-6. The error term is uncorrelated with 6. There are two sets of manifest variables y and x corresponding to the two sets of latent variables. The linear measurement models are:

<

y=h,r]+c

and

x=h,(+S

The error terms E and S are assumed to be uncorrelated with q, 6, and f , but need not be uncorrelated among themselves. The parameters of LZSREL can be collected in eight matrices: - Ay: coefficients relating latent endogenous variables r] to y; - Ax: coefficients relating latent exogenous variables 6 to x; - B : coefficients relating latent endogenous variables r] to r ] ; - r : coefficients relating latent endogenous to latent exogenous variables, r] to 6 ; - @: covariance matrix of latent exogenous variables 6;

190

linear transfoimation [PREP]

- q : covariance matrix of residuals <; - I Y ~ covariance : matrix of errors E in measuring y; - z?a: covariance matrix of errors S in measuring x. These parameters, that are estimated by maximum likelihood method, are of three kinds: - fixed parameters have preassigned specific values; - constrained parameters have unknown values, that are equal to the values of one or more other parameters; - free parameters have unknown and unconstrained values to be estimated. The parameters of LZSREL are estimated by fitting the covariance matrix X implied by the model to the observed covariance matrix S minimizing the likelihood function: min [L(B, C)] = min [log I XI - log 181

+ trace [ S c-']- (p + r)]

The initial estimates for the parameters are obtained on the basis of instrumental variables. The maximum likelihood factor analysis model is a subcase of LZSREL, namely when there are no endogenous variables r,~ and y. Another subcase is path analysis. b linear transformation [PREP] + transformation

b

linearly independent vectors [ALGE]

+ vector

link function [REGR] + regression model (0 generalized linear model) b

b

loading [FACT]

: factor loading b

loading plot [GRAPH]

+ scatter plot b locally weighted scatter plot smoother (LOWESS) [REGR] + robust regression

b location [DESC] (: central tendency) A single value that is the most central and typical for representing a set of observations or a distribution. The most commonly used location measures are:

arithmetic mean (u mean)

location [DESC]

191

barycenter : centroid center

.- midrange

centroid

(0

barycenter)

A p-dimensional vector c of the arithmetic means of the p variables. It can be calculated for a group g from the ng objects belonging to that group, called the

group centroid, class centroid or cluster centroid and denoted as cg.

geometric mean

C xij

I/n

Gj = [

= .;/Xtj x2j

n ~ i j ]

.. . ~

n j

lOg(Gj) =

n

When it exists, the geometric mean lies between the harmonic mean and the arithmetic mean: H 5 G 5 T

harmonic mean

mean : arithmeticmean median Robust measure of location, that is equal to the fiftieth percentile Q(0.5) or, equivalently, the second quartile (Q2). If n is odd, then the median is the middle order statistic; if n is even, then the median is the average of the (n/2)th and the ( n / 2 + 1)th order statistics. In other words, the median is the middle measurement in a set of data, i.e. the value that divides the total frequency into two halves. midrange

(0

center)

U, -Lj

MDj = 2 where Uj and Lj are the maximum (upper) and minimum (lower) values, respectively.

mode The most frequently occurring value in a set of observations. It can easily be obtained from both nominal and ordinal data. Occasionally, there is more than one mode; in that case, the distribution is called multimodal, as opposed to unimodal.

192

log likelihoodfunction [PROB]

trimmedmean Robust measure of location in which the sum is calculated after ordering the n observations, and excluding a fraction, m, containing the smallest and the largest values. For example, the 10%-trimmed mean of an ordered sample is: -

C

xij

xj = -

n-2m

m=O.ln

i=m+l,n-m

weighted mean

C wi xij where Wi are observation weights. b

-+ b

log likelihood function [PROB] likelihood logarithmic scaling [PREP] standardization

-+

b logarithmic transformation [PREP] + transformation b logistic distribution [PROB] + distribution

b

logistic growth model [REGR]

+ regression model b

logistic regression model [REGR]

+ regression model b

logit transformation [PREP] transformation

-+ b

lognormal distribution [PROB]

+ distribution b loss function [MISC] + decision theory

lot [QUAL]

193

b loss matrix [CLASI Measure of loss associated with misclassification of objects. The values are arranged in a square matrix with off-diagonal elements Lggt representing the loss associated with assigning an object to class g’ when its actual class is g. The simplest loss matrix has zero for diagonal elements and one for all the off-diagonal elements. The loss matrix is often symmetric. The loss matrix is incorporated in the misclassification risk. b lot [QUAL] (: batch) Collection of items produced under similar conditions, i.e. of homogeneous origin. The parameters of an item that jointly describe its fitness for use are called quality characteristics. These characteristics are usually measurable and are described by variables in the analysis. The limits between which the value of a quality characteristic must lie, if the item is to be accepted, are called tolerance limits. (Note the difference between tolerance and confidence limits.) Specification limits are permissible limits for the values of a quality characteristic. A quality characteristic that does not conform to specification is called a defect or nonconformity. An item in the lot that does not satisfy one or more of the specifications in quality characteristics, i.e. has one or more defects, is called a defective item or a defective unit or a nonconforming item. The average outgoing quality (AOQ) is the average defective fraction in the outgoing lot:

where N is the lot size, n is the sample size, Pd is the defective fraction, and Pa is the probability of acceptance. The worst possible value of AOQ is called the average outgoing quality limit (AOQL). The average outgoing quality curve is a plot of AOQ as a function of the defective fraction in the incoming lot. Average sample number (ASN) is the average number of items in a sample from a lot that is a function of the defective fraction in the lot. It is customary to plot the average sample number against the defective lot fraction. Acceptance number is the maximum number of defective items (relative to the sample size) that still allows the acceptance of the inspected lot. Similarly, the minimum number of defective items that leads to the rejection of the lot is called the rejection number. The plot of the sample number against the acceptance number or the rejection number is called the acceptance line or rejection line, respectively. Lot sentencing is a decision made regarding lot disposition: usually either the acceptance or rejection of the lot. There are three approaches: accepting without inspection, 100% inspection, and acceutance samuling. RectiQing inspection is a sampling plan in which the inspection activity affects the average outgoing quality. The defective items are removed or replaced by conforming items, considerably improving the quality of the lot, so the lot must be re-examined.

194

lot sentencing [QUAL]

b lot sentencing [QUAL] + lot

b

lot tolerance percent defective (LTPD) [QUAL] producer’s risk

b

lot-plot method [QUAL] acceptance sampling

lower control limit (LCL) [QUAL] + controlchart b

b

lower quartile [DESC]

--+ quantile b

lurking variable [PREP]

+ variable

M b

M estimator [ESTIMI

-+ estimator b

M test [TEST]

+ hypothesis test

b

machine learning [MISC] artijicial intelligence

b

MacNaughton-Smith clustering [CLUS] hierarchical clustering (0 divisive clustering)

MacQueen clustering [CLUS] + non-hierarchical clustering ( 0 optimization clustering) b

b

Mahalanobis distance [GEOM] distance ( 0 quantitative data)

Mahalanobis-like distance [GEOM] + distance (0 ranked data) b

masking [GRAPH]

b

main effect term [ANOVA] term in ANOVA

b

Malinowski’s indicator function (MIF) [FACT]

+ rank analysis b

Mallows’ C, [MODEL] goodness ofprediction

---+ b

Manhattan distance [GEOM] distance (o quantitative data)

-+

b

manifest variable [PREP]

+ variable b Mann-Kendall’s test [TEST] + hypothesis test

b

Mann-Whitney’s test [TEST] hypothesis test

-+

b

map [GRAPH] scatter plot

b

mapping [MULT] data reduction

b

marginal distribution [PROB]

+ random variable

marginal frequency [DESC] + ftequency b

Markov chain [TIME] -+-stochastic process (o Markov process) b

b

Markov process [TIME] stochastic process

F

masking [GRAPH]

+ interactive computer graphics

(0

animation)

195

196

MASLOC clustering [CLUS]

b MASLOC clustering [CLUS] + non-hierarchical clustering ( 0 optimization clustering) b matrix [ALGE] Rectangular array of numbers arranged in rows and columns. Both rows and columns are of equal length. The smaller number of rows or columns is called the order of a matrix. The usual notation for a matrix is:

i=l,n j = l , p

X=[xij]

where Xij denotes the element of matrix X in row i and column j , n denotes the number of rows, and p the number of columns, respectively. The element x11 is called the leading element. An element of a matrix where i = j is called a diagonal element, while an element where i # j is called an off-diagonal element. An element where the column index is one larger than the row index, i.e. j = i 1 , is called a superdiagonal element, while an element with indices satisfying i = j + 1 is called a subdiagonal element.

+

superdiagonals

diagonals subdiagonals

I

”[

1st row 2nd row

... ... nth row

1st

2nd

...

pth column

Matrix condition, matrix rank and matrix norm are important characteristics of a matrix. A list of special matrices with practical importance follows.

asymmetric matrix Square matrix in which, in contrast to a symmetric matrix, row and column indices are not interchangeable: Xij

# Xji

bidiagonal matrix Square matrix in which only diagonal elements and elements on the line immediately above or below the diagonal (superdiagonal or subdiagonal) are nonzero. In an upper bidiagonal matrix:

x ‘J. . = O

if

i>j

or

i<j-1

In a lower bidiagonal matrix: Xij=O

if

i<j

or

i>j+l

matrix [ALGE]

197

diagonal matrix Square matrix in which all off-diagonal elements are zero:

xi, = O

if

i#j

Hessenberg matrix Square matrix in which elements are zero either below the subdiagonal or above the superdiagonal. In a lower Hessenberg matrix:

x '.J . = O

if

i<j-I

In an upper Hessenberg matrix :

xi, = 0

if

i>j+l

idempotent matrix Square matrix W in which

x2= x For example, the hat matrix is an idempotent matrix.

identity matrix Diagonal matrix, often denoted as 1, in which all diagonal elements are 1:

xi, = 1

if

i=j

and

xi, = O

if

i#j

nilpotent matrix Square matrix W in which

X' = 0

for some r

nonsingular matrix Square matrix in which, in contrast to a singular matrix, the determinant is not zero. Only nonsingular matrices can be inverted. null matrix (.- zero matrix) Square matrix in which all elements are zero: x.. 'J = 0

for all i and j

orthogonal matrix Square matrix in which WTW = XWT = 1

Examples are: Givens transformation matrix, Housholder transformation matrix, and matrix of eigenvectors.

198

matrix [ALGE]

positive definite matrix Square matrix in which

zTxz>o

for all Z # O

positive matrix Square matrix in which all elements are positive: xij

>0

for all i and j

positive semi-definite matrix Square matrix in which

zTxz>o

for all

Z # O

singular matrix Square matrix in which the determinant is zero. In a singular matrix, at least two rows or columns are linearly dependent. square matrix Matrix in which the number of rows equals the number of columns: n = p symmetric matrix Square matrix in which row and column indices are interchangeable:

x =p

xij = xJ.' .

triangular matrix Square matrix in which all elements either below or above the diagonal are zero. In an upper triangular matrix x 'J. . = O

if

i>j

In a lower triangular matrix

x 'J. . = O

if

i<j

tridiagonal matrix Square matrix in which only diagonal elements and elements adjacent to the diagonals (superdiagonals and subdiagonals) are nonzero:

This matrix is both an upper and a lower Hessenberg matrix.

matrix decomposition [ALGE]

199

unit matrix Square matrix in which all elements are one: x 'J, ' = 1

for all i and j

zero matrix : null matrix matrix algebra [ALGE] : linear algebra

b

b matrix condition [ALGE] (: condition of a matrix) Characteristic of a matrix reflecting the sensitivity of the quantities computed from the matrix to small changes in the elements of the matrix. A matrix is called an ill-conditioned matrix, with respect to a problem, if the computed quantities are very sensitive to small changes like numerical precision error. If this is not the case, the matrix is called a well-conditioned matrix. There are various measures of the matrix condition. The most common one is the condition number:

When 11. 1 1 indicates the two-norm, the condition number is the ratio of the largest to the smallest nonzero eigenvalue of the matrix. A large condition number indicates an ill-conditioned matrix. b matrix decomposition [ALGE] The expression of a matrix as a product of two or three other matrices that have more structure than the original matrix. Rank-one-update is often used to make the calculation more efficient. A list of the most important methods follow.

bidiagonalization

B = UXVT where B is an upper bidiagonal matrix, U and V are orthogonal Housholder matrices. This decomposition is accomplished by a sequence of Housholder transformations. First the subdiagonal elements of the first column of X are set to zero by U1,next the p - 2 elements above the superdiagonal of the first row are set to zero by V1 resulting in U1 XV1. Similar Housholder operations are applied to the rest of the columnsj via Uj and to the rest of the rows i via Vi:

U = (Up U,-1 . . . Ul)

and

V = (VpVp-1 . . . VI)

Bidiagonalization is used, for example, as a first step in singular value decomposition.

200

matrix decomposition [ALGE]

Cholesky factorization

X=LL~ where X is a symmetric, positive definite matrix, and L is a lower triangular matrix. The Cholesky decomposition is widely used in solving linear equation systems in the form X b = y. diagonalization : singular value decomposition eigendecomposition : spectral decomposition G R decomposition : L-U decomposition

L-U decomposition ( : L-R decomposition,triangularfactorizution) X=LU where L is a lower triangular matrix and U is an upper triangular matrix. The most common method to calculate the L-U decomposition is Gaussian elimination.

Q-R decomposition X=QR where Q is an orthogonal matrix and R is an upper triangular matrix. This decomposition is equivalent to an orthogonal transformation of X into an upper triangular form, used in solving linear equations and in eigenanalysis. singular value decomposition (SVD) ( : diagonalizution)

X=UAVT where U and V are orthogonal matrices, and A is a diagonal matrix. The values of the diagonal elements in A are called singular values of X,the columns of U and V are called left and right singular vectors, respectively. A is calculated by first transforming W into a bidiagonal matrix and then eliminating the superdiagonal elements. When n = p and X is a symmetric positive definite matrix the singular values coincide with the eigenvalues of X. Furthermore U = V, that contains the eigenvectors of X. spectral decomposition ( : eigendecomposition) X=VAVT

where W is a symmetric matrix, A is a diagonal matrix of the eigenvalues of X, and V is an orthogonal matrix containing the eigenvectors.

matrix operation [ALGE]

201

triangular factorization : L-U decomposition tridiagonalization

T = UXUT where X is a symmetric matrix, T is a tridiagonal matrix, and U is an orthogonal matrix. This decomposition, similar to bidiagonalization, is achieved by a sequence of Housholder transformations: U = (Ul . . . Up-l). It is used mainly in eigenanalysis. b matrixnorm [ALGE] + norm

matrix operation [ALGE] The following are the most common operations on one single matrix X, or on two matrices X and Y with matching orders. b

addition of matrices Z(n, p ) = X(n, P )

+ Y ( n ,P )

zij

= xij

+ Yij

Addition can be performed only if X and Y have the same dimensions.

determinant of a matrix Scalar defined for a square matrix X(p, p) as:

where the summation is taken over all permutations a of (1, . . . , p ) , and la1 equals +1 or -1, depending on whether a can be written as the product of an even or odd number of transpositions. For example, for p = 2

1x1= XllX22 - x12x21 The determinant of a diagonal matrix X is equal to the product of the diagonal elements:

If are:

1x1 # 0, then W is a nonsingular matrix. Important properties of the determinant

4

l X Y = 1x1IYI IXTI = 1x1 lcXl = CPlXl

b) c)

202

mat& operation [ALGE]

inverse of a matrix The inverse of a square matrix X is the unique matrix X-' satisfying:

The inverse exists only if X is a nonsingular matrix, i.e. if 1x1 # 0. The most efficient way to calculate the inverse is via Cholesky decomposition. The matrix X- is called generalized inverse matrix of a nonsquare matrix X if it satisfies:

The generalized inverse always exists, although usually it is not unique. If the following three equations are also satisfied, it is called a Moore-Penrose inverse (or pseudo inverse) and denoted as X+

(XX+)T

= X X+

(X+X)T

= X+ X

For example, the Moore-Penrose inverse matrix is calculated in ordinary least squares regression as:

X+ = (XTX)-' XT The generalized inverse is best obtained by first performing a singular value decomposition of X:

X=UAVT then

Multiplication can be performed only if the number of columns in X equals the number of rows in Y.

matrix rank [ALGE]

203

partitioning of a matrix

scalar multiplication of a matrix z(n, p ) = c X(n, p ) zij = cxij

subtraction of matrices z(n, p ) = X(n, p ) - U h p ) ZIJ'

= X"i j - Y i j

Subtraction can be performed only if X and U have the same dimensions.

trace of a matrix The sum of the diagonal elements of X, denoted as tr (X):

transpose of a matrix Interchanging row and column indices of a square matrix: Z(p, 4 = X T h p ) Zji b

-+

= Xij

matrix plot [GRAPH] draftsman's plot

b matrix rank [ALGE] (: rank ofa matrix) The maximum number of linearlv indeuendent vectors (rows or columns) in a matrix X(n, p), denoted as r (X). Consequently, linearly dependent rows or columns reduce the rank of a matrix. A matrix X is called a rank deficient matrix when r(X) < p. The rank has the following properties:

- 0 Ir(X) Imin(n,p) - r(X) = r(XT) - r(X+Z)sr(X)+r(Z) - r(XZ) Imin [r(X), r(Z)] - r(XT X) = r (XxT)= r (x)

204

matrix transformation [ALGE]

b matrix transformation [ALGE] Numerical method that transforms a real matrix X(n, p) or a real symmetric matrix X(p,p) into some desirable form by pre- or post-multiplying it by a chosen set of nonsingular matrices. The method is called orthoaonal matrix transformation when the multiplier matrices are orthogonal and an orthogonal basis for the column space of X is calculated; otherwise the transformation is nonorthogonal, and is usually performed by elimination methods. Matrix transformations are the basis of several numerical algorithms for solving linear equations and for eigenanalvsis. b maximum likelihood (ML) [PROB] + likelihood b maximum likelihood clustering [CLUS] Clustering method that estimates the partition on the basis of the maximum likelihood. Given a data matrix X of n objects described by p variables, a parameter vector 6 = (n1 .. . n ~ p1.. ; . p ~ X;I . . . CG) and another parameter vector y = (nl . . .n G ) the likelihood is:

r

1

where s, is the set of objects Xi belonging to the gth group and n, is the number of objects in s,. Parameters n,, p,, and C, are the prior cluster probabilities, the cluster centroids, and the within-cluster covariance matrices, respectively. The ML estimates of these quantities are:

n,: P , = n,/n pg:

cg = C x i / n g sg

C(xi C,:

9, =

- CglT(Xi - cg)

Sg

g=l,G

ng b maximum likelihood estimator [ESTIM] + estimator b maximum likelihood factor analysis [FACT] Factor extraction method based on maximizing the likelihood function for normal distribution

L(X I P , E) The population covariance matrix E can be expressed in terms of common and unique factors as E=ILILT+U

mean absolute deviation (MAD) [DESC]

205

and fitted to the sample covariance matrix S. The assumption is that both common factors and unique factors are independent and normally distributed with zero means and unit variances, consequently the variables are drawn from a multivariate normal distribution. The common factors are assumed to be orthogonal, and their number M is prespecified. The factor loadings IL and the unique factors U are calculated in an iterative procedure based on maximizing the likelihood function, that is equal to minimizing: min = tr M

[x-'S] -In Ix-'s~

-p

The minimum is found by nonlinear optimization. Convergence is not always obtained, the optimization algorithm often ends in a local minimum, and sometimes one must face the Heywood case problem. The solution is not unique; two solutions can differ by a rotation. maximum likelihood latent structure analysis [MULT] + latent class model b

b

maximum linkage [CLUS]

-+ hierarchical clustering b

(0 agglomerative clustering)

maximum scaling [PREP] standardization

--+

b

maxplane rotation [FACT]

+ factor rotation McCulloh-Meeter plot [GRAPH] + scatter plot (0 residual plot) b

McNemar's test [TEST] + hypothesis test b

b

McQuitty's similarity analysis [CLUS] hierarchical clustering (0 agglomerative clustering)

----f

mean [DESC] -+ location b

mean absolute deviation (MAD) [DESC] + dispersion b

206

mean character difference [GEOM]

b mean character difference [GEOM] + distance (0 quantitative data) b mean deviation [DESC] + dispersion b mean function [TIME] + autocovariance function b mean group covariance matrix [DESC] + covariance matrix b mean square error (MSE) [ESTIM] Mean squared difference between a true value and estimated value of a parameter. It has two components: variance and squared bias.

+

MSE(6) = E[6 - S]2 = v[6]+ B2[6] = E[6 - E[ii]]2 (E[ii] - S)2

Estimators that calculate estimates with zero bias are called unbiased estimators. Biased estimators increase the bias and decrease the variance component of MSE, trying to find its minimum at optimal complexity.

MSE BIAS VARIANC

4

I\

NCE

BIAS I

OPTIMAL COMPLEXITY

b mean square error (MSE) [MODEL] + goodness of fit

COMPLEXITY

metric scale [PREP]

201

b mean squares in ANOVA (MS) [ANOVA] Column in the analvsis of variance table containing the ratio of the sum of squares to the number of degrees of freedom of the corresponding term. It estimates the variance component of the term and is used to test the significance of the term. The mean square of the error term is an unbiased estimate of the error variance. b

mean trigonometric deviation [DESC] dispersion

-+

b

measure of distortion [CLUS] assessment of clustering

-+

b median [DESC] + location

median absolute deviation around the median (MADM) [DESC] + dispersion b

b

median linkage [CLUS] hierarchical clustering (0 agglomerative clustering)

-+

b

median test [TEST] hypothesis test

-+

b membership function [MISC] + fuzzy set theory

metameric transformation [PREP] + transformation b

b

metameter [PREP] transformation (0 metameric transformation)

-+

b

metric data [PREP] data

----+

b

metric distance [GEOMI

+ distance b

-+

metric multidimensional scaling [MULT] multidimensional scaling

b

metric scale [PREP] scale

208 b

midrange [DESC]

midrange [DESC]

+ location b military standard table (MIL-STD) [QUAL] Military standard for acceptance sampling that is widely used in industry. It is a collection of sampling plans. The most popular one is MIL-STD 105D, which contains standards for single, double and multiple sampling for attributes. The primary focus is the acceptable quality level. There are three inspection levels: normal, tightened and reduced. The sample size is determined by the lot size and the inspection level. MIL-STD 414 contains sampling plans for variable sampling. They also control the acceptable quality level. There are five inspection levels. It is assumed that the quality characteristic is normally distributed.

minimal spanning tree (MST) [MISC] + graph theory b

b minimal spanning tree clustering [CLUS] --+ non-hierarchical clustering (0 graph theoretical clustering) b

minimax strategy [MISC]

+ gametheory b minimum linkage [CLUS] + hierarchical clustering (0 agglomerative clustering)

b Minkowski distance [GEOM] + distance ( 0 quantitative data) b

minres [FACT]

+ principal factor analysis b misclassification [CLAS] + classification

misclassification matrix [CLAS] + classification b

b

misclassification risk (MR) [CLAS]

+ classification b missing value [PREP] Absent element in the data matrix. It is important to distinguish between a real missing value (a value potentially available but not measured), a don’t care value

model [MODEL]

209

(the measured value is not relevant), and a meaningless value (the measurement is not possible or not allowed). There are several techniques for dealing with missing values. The simplest solution, applicable only in the case of few missing values, is to delete the object (row) or the variable (column) containing the missing value. A more reasonable procedure is to fill in the missing value by some estimated value obtained from the same variable. This can be: the variable mean calculated from all objects with nonmissing values; the variable mean calculated from a subset of objects, e.g. belonging to the same class; or a random value of the normal distribution within the extremes of the variable. While the variable mean produces a flattening of the information content of the variable, the random value introduces spurious noise. Missing values can also be estimated using multivariate models. Principal component, K nearest neighbors and regression models are the most popular ones. b mixed data [PREP] -+ data b

mixed effect model [ANOVA]

+ term in ANOVA mixture design [EXDE] + design b

b moat [CLUS] Measure of external isolation of a cluster defined for hierarchical agglomerative clustering. It is the difference between the similarity level at which the cluster was formed and the similarity level at which the cluster was agglomerated into a larger cluster. Clusters in complete linkage usually have larger moats than clusters in single linkage. b mode [DESC] + location b mode analysis [CLUS] + non-hierarchicalclustering

(0

density clustering)

b model [MODEL] Mathematical equation describing causal relationship between several variables. The responses (or dependent variables) in a model may be either quantitative or qualitative. Examples of the first group are the regression model, the factor analysis model, and the ANOVA model, while classification models, for example, belong to the second group. The value of the highest power of a predictor variable is called

210

model [MODEL]

the order of a model. A subset of predictor variables that contains almost the same amount of information about the response as the complete set is called an adequate subset. The number of independent pieces of information needed to estimate a model is called the model degrees of freedom. A model consists of two parts: a systematic part described by the equation, and the model error. This division is also reflected in the division of total sum of sauares. The calculation of optimal parameters of a model from data is called model fitting. Besides fitting the data, a model is also used for prediction. Once a model is obtained it must be evaluated on the basis of goodness of fit and goodness of Drediction criteria. This step is called model validation. Often the problem is to find the optimal model among several potential models. Such a procedure is called model selection.

additive model Model in which the predictors have an additive effect on the response. biased model Statistical model in which the parameters are calculated by biased estimators. The goal is to minimize the model error via bias-variance trade-off. In these models the bias is not zero, but in exchange, the variance component of the squared model error is smaller than in a corresponding unbiased model. PCR, PLS, and ridge regression are examples of biased regression models; RDA, DASCO,SIMCA are biased classification models. causal model Model concerned with the estimation of the parameters in a system of simultaneous equations relating dependent and independent variables. The independent variables are viewed as causes and dependent variables are effects. The dependent variables may affect each other and be affected by the independent variables; these latter are not, however, affected by the dependent variables. The cause-effect relationship cannot be proved by statistical methods, it is postulated outside of statistics. Causal models can be solved by LISREL or by path analvsis and are frequently described by path diagrams. deterministic model Model that, in contrast to a stochastic model, does not contain random element. hierarchical model : nested model nested model (: hierarchical model) Set of models in which each one is a special case of a more general model obtained by dropping some terms (usually by setting a parameter to zero). For example, the model y = bo

+ b1x

model degrees offreedom (dfl [MODEL]

211

is nested within y = bo

+ b1x + b 2 1

There are two extreme approaches to finding the best nested model: - the top-down approach starts with the largest model of the family (also called the saturated model) and then drops terms by backward elimination; - the bottom-up approach starts with the simplest model and then includes terms one at a time from a list of the possible terms, by forward elimination. It is important not only to take account of the best fit but also the trade-off between fit and complexity of the model. Hierarchical models can be compared on the basis of criteria such as adjusted R2,Mallows’ C,,PRESS,likelihood ratio and information criteria (e.g. the Akaike’ information criterion). In contrast to nested models, nonnested models are a group of heterogeneous models. For example, y = bo + 61 ln(x) + bz(l/x) is nonnested within the above models. Maximum likelihood estimators and information criteria are particularly useful when comparing nonnested models.

parsimonious model Model with the fewest parameters among several satisfactory models. soft model A term used to denote the soft part of the modeling approach, characterized by the use of latent variables, principal components or factors calculated from the data set and therefore directly describing the structure of the data. This is opposite to a hard model, in which a priori ideas of functional connections (physical, chemical, biological, etc. mathematical equations) are used and models are constructed from these.

stochastic model Model that contains random elements. b

model I [ANOVA]

-+ term in ANOVA b model I1 [ANOVA] + term in ANOVA

b model degrees of freedom (df) [MODEL] The number of independent pieces of information necessary to estimate the parameters of a statistical model. For example, in a linear regression model it is calculated as the trace of the hat matrix: tr [W]. In OLS with p parameters: tr [W] = p. The total degrees of freedom (number of objects n) minus the model degrees of freedom is called the error degrees of freedom or residual degrees of freedom.

212

model error [MODEL]

For example, in a linear regression model it is calculated as n - tr [W]. The error degrees of freedom of an OLS model with p parameters is n - p.

model error [MODEL] (: error term) Part of the variance that cannot be described by a model, denoted e or ei. It is calculated as the difference between observed and calculated response: b

e=y-i,

or

ei = yi - fi

The standard deviation of ei, denoted 0,is also often called model error. The estimated model error s is an important characteristic of a model and the basis of several goodness of fit statistics.

model fitting [MODEL] Procedure of calculating the optimal parameter values of a model from observed data. When the functional form of the model is specified in analytical form (e.g. linear or polynomial), the procedure is called curve fitting. In contrast, the form of the model is sometimes defined only by weak constraints (e.g. smooth) and the model obtained is stored in digital form. Overfitting means increasing the complexity of a model, i.e. fitting model terms that make little or no contribution. Increasing the number of parameters and the model degrees of freedom beyond the level supported by the available data causes high variance in the estimated quantities. In case of a small data set, in particular, one should be careful about fitting excessively complex models, (e.g. nonlinear models). Underfitting is the opposite phenomenon. Underspecified models (e.g. linear instead of nonlinear, or with terms excluded) result in bias in the estimates. b

model selection [MODEL] Selection of the optimal model from a set of candidate models. The criterion for optimality has to be prespecified. One usually is interested in finding the best predictive model, so the model selection is often based on a goodness of prediction criteria. Important examples are biased regression and classification. The range of candidate models can be defined by the number of predictor variables included in the model (e.g. variable subset selection, stepwise linear discriminant analysis), by the number of components (e.g. PCR, PLS, SZMCA), or by the value of a shrinkage parameter (e.g. ridge regression, RDA). b

b

model sum of squares (MSS) [MODEL]

+ goodness of fit b model validation [MODEL] (.- validation) Statistical procedure for validating a statistical model with respect to a prespecified criterion, most often to assess the goodness of fit and goodness of prediction of a model. A list of various model validation techniques follows.

model validation [MODEL]

213

bootstrap Computer intensive model validation technique that gives a nonparametric estimate of the statistical error of a model in terms of its bias and variance. This procedure mimics the process of drawing many samples of equal size from a population in order to calculate a confidence interval for the estimates. The data set of n observations is not considered as a sample from a population, but as the entire population itself, from which samples of size n, called bootstrap samples, are drawn with resubstitution. It is achieved by assigning a number to each observation of the data set and then generating the random samples by matching a string of random numbers to the numbers that correspond to the observations. The estimate calculated from the entire data set is denoted t, while the estimate calculated from the bth bootstrap sample is denoted tb. The distribution of tb can be treated as if it were a distribution constructed from real samples. The mean value of tb is

where B is the number of bootstrap samples. The bootstrap estimate of the bias B(t) and the variance V(t) of statistic t is B(t) = 1 - t

V(t) =

b

B-1

cross-validation (cv) Model validation technique used to estimate the predictive power of a statistical model. This resampling procedure predicts an observation from a model calculated without that observation, so that the predicted response is independent of the observed response. The predictive residual sum of squares, PRESS,is one of the best goodness of prediction measures. In linear estimators, e.g. OLS, PRESS can be calculated from ordinary residuals, while in nonlinear estimators, e.g. PLS,the predicted observations must literally be left out of the model calculation. Cross-validation, which repeats the model calculation n times, each time leaving out a different observation and predicting it from a model fitted to the other n - 1 observations, is called leave-one-out cross-validation (LOO). Cross-validation in which n’ = n/G observations are left out, is called G-fold cross-validation. In this case the model calculation is repeated G times, each time leaving out n’ different observations. Because the perturbation of the model is greater than in the leave-one-out procedure, the G-fold goodness of prediction estimate is usually less optimistic than the LOO estimate.

214

model validation [MODEL]

jackknife Model validation technique that gives a nonparametric estimate of statistical error of an estimator in terms of its bias and variance. The procedure is based on sequentially deleting observations from the sample and recomputing the statistic t calculated without the ith observation, denoted t(i). The mean of these statistics calculated from the truncated data sets is i

4.) = n The jackknife estimate of the bias B(t) and the variance V(t) of statistic t is:

For example, jackknife shows that the mean is an unbiased estimate

B(K) = (n - l)(K(.) -TI) = 0

V(K)=

n(n - 1)

Similarly, the bias and variance of the variance is estimated as

The bias and variance of more complex estimators, like regression parameters, eigenvectors, discriminant scores, etc. can be calculated in a similar fashion obtaining a numerical estimate for the statistical error. Jackknife is similar to cross-validation in that in both procedures observations are omitted one at a time and estimators are calculated repeatedly on truncated data sets. However, the goal in the two techniques is different.

moment [PROB] 215

resampling (: sample reuse) Model validation procedure that repeatedly calculates an estimator from a reweighted version of the original data set in order to estimate the statistical error associated with the estimator. With this technique the goodness of prediction of a model can be calculated from a single data set. Instead of splitting the available data into a training set, to calculate the model, and an evaluation set, to evaluate it, the evaluation set is created in a repeated subsampling. Cross-validation, bootstrap and jackknife belong to this group. resubstitution Model validation based on the same observations used for calculating the model. Such validation methods can be used to calculate goodness of fit measures, but not goodness of prediction measures. For example, in regression models RSS or R2,in classification the error rate or the misclassification risk are calculated via resubstitution. sample reuse : resampling training-evaluation set split In case of many observations, the calculated model can be evaluated without reusing the observations. The whole data set is split into two parts. One part, called the training set, is used to calculate the model, and the other part, called the evaluation set, is used to evaluate the model. A key point in this model validation method is how to partition the data set. The sizes of the two sets are often selected to be equal. A random split is usually a good choice, except if there is some trend or regularity in the observations (e.g. time series). b

modified control chart [QUAL]

+ control chart b moment [PROB] The expected value of a power of a variate x with probability distribution function f ( x ):

1 00

pr(x) =

y f ( x )&

--M

The central moment is the moment calculated about the mean: 60

--M

216

momentum term [ M S C ]

The absolute moment is defined as 00

LLpw = J

Ixl'f(x)b

--6o

The most important moment is the first moment, which is the mean

/

00

LL =

xf(x)b

-00

and the second central moment, which is the variance of the variate u 2=

Jm(x

- p)2f(x) dr

-m

The skewness of a distribution is py/a3 and the kurtosis of a distribution is 4iu4. b

momentum term [MISC]

+ neural network b Monte Carlo sampling [PROB] + sampling b

Monte Carlo simulation [PROB]

: simulation b Mood-Brown' test [TEST] + hypothesis test

Mood's test [TEST] + hypothesis test b

b Moore-Penrose inverse [ALGE] + matrix operation ( 0 inverse of a matrix) b monothetic clustering [CLUS] -+ cluster analysis

b monotone admissibility [CLUS] + assessment of clustering (0 admissibilityproperties) b Moses' test [TEST] + hypothesis test

multicollineariy [MULT]

217

b most powerful test [TEST] + hypothesis testing b

moving average (MA) [TIME]

The series of values

m q = ~ w i x q + i / ~ w i q=O,n-k

i=l,k

i

i

where X I , . . . , xn are a set of values from a time series, k is a span parameter (k c n) and w l , . . . , Wk are a set of weights.

. .

... .

. ..

....

..

If all the weights are equal to l l k , the moving average is called simple and can be constructed by dividing the moving total (sum of k consecutive elements) by k: k

C xilk

Exilk

Exilk

i=l

i =2

i =3

etc.

The set of moving averages constitutes the movinp average model. b moving average chart [QUALI + control chart b moving average model (MA) [TIME] + time series model b

-+

moving total [TIME] moving average

multiblock PLS [REGR] + partial least squares regression b

multicollinearity [MULT] : collinearity

b

218

multicriteria decision making (MCDM)[QUAL]

b multicriteria decision making (MCDM) [QUAL] Techniques for finding the best settings of process variables that optimize more than one criterion. These techniques can be divided into two groups: the generating techniques use no prior information to determine the relative importance of the various criteria, while in the preference techniques the relative importance of the criteria are given a priori. The most commonly used MCDM techniques follow.

desirability function Several responses are merged into one final criterion to be optimized. The first step is to define the desired values for each response. Each measured response is transformed into a measure of desirability dr, where 0 5 dr 5 1. The overall desirability D is calculated as the geometric mean:

D=

y&&Tx

The following scale was suggested for desirability: Value

Quality

1 .oo-0.80 0.80-0.63 0.63-0.40 0.40-0.30 0.30-0.00

excellent good poor borderline unacceptable

When the response is categorical, or should be in a specified interval or above a specified level to be acceptable, only dr = 1 and dr = 0 are assigned. When the response Yr is continuous, it is transformed into dr by a one-sided transformation:

dr = exp [- exp [(-~r)ll or by a two-sided transformation:

outranking method Ranks all possible parameter value combinations. Methods belonging to this group are: ELECTRE, ORESTE, and PROMETHEE. PROMETHEE ranks the parameter combinations using a preference function that gives a binary output comparing two parameter combinations. overlay plot Projects several bivariate contour plots of response surfaces onto one single plot. Each contour plot represents a different criteria. Minimum and maximum boundaries for acceptable criterion in each contour plot can be compared visually on the aggregate plot and the best process variables selected.

multinomial variable [PREP]

219

Pareto optimality criterion Selects a set of noninferior, so-called Pareto-optimal points, that are superior to other points in the process variable space. The nonselected points are inferior to the Pareto-optimal points in at least one criterion. The position of the Pareto-optimal points are often investigated on principal components plots and biplots. utility function An overall optimality criterion is calculated as a linear combination of f k, k = 1, K different criteria

The optimum of such a utility function is always the set of Pareto-optimal points, given certain weights. b multidimensional scaling (MDS) [MULT] Mapping method to construct a configuration of n points in the p’-dimensional space from a matrix containing interpoint distances or similarities measured with error in the p-dimensional space. Starting from a distance matrix D,the objective is to find the location of the data points X I , . . . ,xn in the p’-dimensional space such that their interpoint distance d E D is similar in some sense to d’ in the projected space. Usually p’ is restricted to 1, 2, or 3. Multidimensional scaling is a dimension reduction technique that provides a one-, two-, or three-dimensional picture that conserves most of the structure of the original configuration. There are several solutions to this problem, all are indeterminate with respect to translation, rotation and reflection. The solution is a non-metric multidimensional scaling (NMDS)if it is based only on the rank order of the pairwise distances, otherwise it is called metric multidimensional scaling. The most popular solution minimizes the stress function

The minimum indicating the optimal configuration is usually found by the steepest descent method. b multigraph [MISC] + graph theory

multinomial distribution [PROB] -+ distribution b

multinomial variable [PREP] + variable b

220

multiple compatison [ANOVA]

b multiple comparison [ANOVA] + contrast b multiple correlation [DESC] + correlation

multiple correlation coefficient [MODEL] + goodness of fit b

b

multiple least squares regression [REGR]

: ordinaly least squares regression

multiple regression model [REGR] + regression model b

b multiple sampling [QUAL] + acceptance sampling b

multiplication of matrices [ALGE]

+ matrix operation

multistate variable [PREP] + variable b

b multivariate [MULT] This term refers to objects described by several variables (multivariate data), to statistical parameters and distributions of such data, to models and methods applied to such data. In contrast, the term univariate refers to objects described by only one variable (univariate data), to statistical parameters and distributions of such variable, to models and methods applied to one single variable. One-variable-at-a-time (OVAT) is a term which refers to a statistical method that tries to find the optimal solution for a multivariate problem considering the variation in only one variable at a time, while keeping all other variables at a fixed level. Such a limited approach looses information about variable interaction, synergism and correlation. Despite this drawback, such methods are used due to their simplicity and ease of control and interpretability.

multivariate [PROB] + random variable b

b multivariate adaptive regression splines (MARS) [REGR] Nonparametric, nonlinear regression model based on splines. The MARS model has the form:

multivariate analysis [MULT]

)’ = bo +

c

fj

(xj)

+

c

fjk(xj ,xk)

+

c

f j k l ( x j ,xk,xI)

221

+

The first term contains all functions that involve a single variable, the second term contains all functions that involve two variables, the third term has functions with three variables, etc. A univariate spline has the form yi = bo

+ Cm b m &(xi)

i = 1,n

where B m are called spline basis functions and have the form: &(xi)

9

= [Sm(xi - tm>]+

m = 1,M

Values tm are called knot locations that define the predictor subregions, Sm equals either +1 or -1. The + sign indicates that the function is evaluated only for positive values. Index q is the power of the fitted spline; q = 1 indicates linear spline, while q = 3 indicates cubic spline. MARS is a multivariate extension of the above univariate model: m

j

Each multivariate basis function is a product o f j = 1, J univariate basis functions. The parameter J can be different for each multivariate basis function. Once the basis functions are calculated the regression coefficients bm are estimated by the least squares procedure. The above multivariate splines are adaptive splines in MARS. This means that the knot locations are not fixed, but optimized on the basis of the training set. The advantage of the MARS model is its capability to depict different relationships in the various predictor subregions via local variable subset selection, to include both additive and interactive terms. As in many nonlinear and biased regression methods, a crucial point is to determine the optimal complexity of the model. In MARS the complexity is determined by q, the degree of the polynomials fitted, by M,the number of terms, and by J , the order of the multivariate basis functions. These parameters are optimized by cross-validation to obtain the best predictive model. b multivariate analysis [MULT] Statistical analysis performed on multivariate data, i.e. on objects described by more than one variable. Multivariate data can be collected in one or more data matrices. Multivariate methods can be distinguished according to the number of data matrices they deal with. Disulav methods, factor analvsis. cluster analvsis and principal coordinate analvsis explore the structure of one single matrix. Canonical correlation analvsis and Procrustes analvsis examine the relationship between two matrices. Latent variable models connect two or more matrices. Multivariate analysis that examines the relationship among n objects described by p variables is called Q-analysis. The opposite is R-analysis when the relationship

222

multivariate analysis of variance (MANOVA) [ANOVA]

among p variables determined by n observations is of interest. For example, clustering objects or extracting latent variables is Q-analysis, while clustering variables or calculating correlation among variables is R-analysis. b multivariate analysis of variance (MANOVA) [ANOVA] Multivariate generalization of the analvsis of variance, studying group differences in location in a multidimensional measurement space. The response is a vector that is assumed to arise from multivariate normal distributions with different means but with the same covariance matrix. The goal is to verify the differences among the population locations and to estimate the treatment effects. The one-way u 4 N O v A model with I levels and K observations per level partitions the total scatter matrix T into between-scatter matrix B and within-scatter matrix W

T=B+W where

i

k

i

k

The test for location differences invwes generalized variances. The nu esis

is rejected if the ratio (Wilks’ A)

is too small.

multivariate calibration [REGR] + calibration b

multivariate data [PREP] + data b

path-

nearest centroid sorting [CLUS]

223

b multivariate dispersion [DESC] The dispersion of multivariate data about its location is measured, for example, by the covariance matrix. However, sometimes it is convenient to assess the multivariate dispersion by a single number. ?Irvocommon measures are: the generalized variance, which is the determinant of the covariance matrix, IBI, and the total variation, which is the trace of the covariance matrix, tr[S]. In both cases, a large value indicates wide dispersion and a low value represents a tight concentration of the data about the location. For example, these quantities are optimized in non-hierarchical cluster analysis. Generalized variance plays an important role in maximum likelihood estimation, total variation is calculated in principal component analysis. b

multivariate distribution [PROB]

+ random variable b multivariate image analysis (MIA) [MULT] + image analysis b

multivariate least squares regression ( M U ) [REGR]

: ordinary least squares regression b multivariate normal distribution [PROB] + distribution

multivariate regression model [REGR] + regression model b

b mutual information [MISC] + information theory

np-chart [QUAL] + controlchart (0 attribute control chart) b

b narrow-band process [TIME] + stochastic process

nearest centroid sorting [CLUS] + non-hierarchical clustering ( 0 optimization clustering) b

224

nearest centrotype sorting [CLUS]

b nearest centrotype sorting [CLUS] + non-hierarchical clustering (n optimization Clustering)

b

nearest means classification (NMC) [CLAS]

: centroid classijication b nearest neighbor density estimator [ESTIM] + estimator (0 density estimator)

nearest neighbor linkage [CLUS] + hierarchical clustering (0 agglomerative clustering) b

b negative binomial distribution [PROB] + distribution

b negative exponential distribution [PROB] + distribution

nested design [EXDE] + design b

nested effect term [ANOVA] + term in ANOVA b nested model [ANOVA] + analysis of variance

nested model [MODEL] + model b

nested sampling [PROB] + sampling b

b network [MISC] + graph theory b neural network (NN) [MISC] (.- artijicial neural network) New, rapidly developing field of pattern recognition on the basis of mimicking the function of the biological neural system. It is a parallel machine that uses a large number of simple processors, called neurons, with a high degree of connectivity. A NN is a nonlinear model, and an adaptive system that learns by adjusting the strength of connections between neurons.

neural network (NN)[MISCJ

225

output layer

YI

Y2

A simple NN model consists of three layers, each composed of neurons: the input layer distributes the input to the processing layer, the output layer produces the output, the processing or hidden layer (sometimes more than one) does the calculation. The neurons in the input layer correspond to the predictor variables, while the neurons in the output layer represent the response variables. Each neuron in a layer is fully connected to the neurons in the next layer. Connections between neurons within the same layer are prohibited. The hidden layer has no connection to the outside world; it calculates only intermediate values. The nodenode connection having a weight associated with it is called a synapse. These are automatically adapted in the training process to obtain an optimal setting. Each neuron has a transfer function that translates input to output. There are different types of networks. In the feedforward network the signals are propagated from the input layer to the output layer. The input xi of a neuron is the weighted output from the connected neurons in the previous layer. The output yi of a neuron i depends on its input Xi and on its transfer functionf(xi). A popular transfer function is the sigmoid:

where ti is a shift parameter. Good performance of a neural network depends on how well it was trained. The training is done by supervised learning, i.e. iteratively, using a training set with known output values. The most popular technique is called a backpropagation network. In the nth iteration the weights are changed as:

226

neuron [MISC]

where ;rl is the learning rate or step size parameter, a! is the momentum term. If the learning rate is too low, the convergence of the weights to an optimal value is too slow. On the other hand if the learning rate is too high, the system may oscillate. The momentum term is used to damp the oscillation. Function Sj is the error for node j. If node j is in the output layer and has the desired output dj , then the error is:

If the nodej is in the hidden layer, then

where k is indexing the neurons in the next layer.

neuron [MISC] + neural network b

b Newman-Keuls' test [TEST] -+ hypothesis test

Newton-Raphson optimization [OPTIM] Gradient optimization using both the gradient vector (first derivative) and the Hessian matrix (second derivative). The scalar valued function f (p), where p is the vector of parameters, is approximated at POby the Taylor expansion: b

f

(P) = Z(P>

=f

(Po) + (P - Po)Td P 0 )

+ 0.5 (P - PolT H(Po)(P - Po)

where g is the gradient vector and H is the Hessian matrix evaluated at PO. In each step i of the iterative procedure the minimum of f (p) is approximated by the minimum of z (p) Pi+l

= Pi - si H-' (Pi) g(Pi)

where the direction is the ratio of the first and second derivatives H-' g and the step size si is determined by a linear search outimization. This procedure converges rapidly when pi is close to the minimum. Its disadvantage is that the Hessian matrix and its inverse must be evaluated in each step. Also, if the parameter vector is not close to the minimum, then H may become negative definite and convergence cannot be reached. When the function f (p) is of a special quadratic form, as in the nonlinear least squares regression problem, the Gauss-Newton outimization offers a simplified computation.

nilpotent matrix [ALGE] + matrix b

b

node [MISC] graph theory

b

noise variable [PREP] variable

b

nominal scale [PREP] scale

non-hierarchical clustering [CLUS] 227

-+

b no-model error rate [CLAS] + classification (0 error rate)

noncentral composite design [EXDE] + design (0 composite design) b

b nonconforming item [QUAL] + lot b

nonconformity [QUAL] lot

-+

b

non-error rate (NER) [CLAS] ( 0 error rate)

+ classification

b non-hierarchical clustering [CLUS] ( .-partitioning clustering) Clustering that produces a division of objects into a certain number of clusters. This clustering results in one single partition, as opposed to hierarchical clustering, which produces a hierarchy of partitions. The number of clusters is either given a priori or determined by the clustering method. The clusters obtained are often represented by their centrotypes or centroids. Non-hierarchical clustering methods can be grouped as: density clustering, graph theoretical clustering, and optimization clustering.

density clustering Non-hierarchical clustering that seeks regions of high density (modes) in the data to define clusters. In contrast to most optimization clustering methods, density clustering produces clusters of a wide range of shapes, not only spherical ones. One of the most popular methods is Wishart's mode analysis, which is closely related to single linkage clustering. For each object mode analysis calculates its distances from all other objects, then selects and averages the distances from its K nearest neighbors. Such an average distance for an object from a dense area is small, while outliers that have few close neighbors have a large average distance.

228

non-hierarchical clustering [CLUS]

Another method based on the nearest neighbor density estimation is the JarvisPatrick clustering. This is based on an estimate of the density around an object by counting the number of neighbors K that are within a preset radius R of an object. Two objects are assigned to the same cluster if they are on each other's nearest neighbor list (length fixed a priori) and if they have at least a certain number (fixed a priori) of common nearest neighbors. The Coomans-Massart clustering method, also called CLUPOT clustering, uses a multivariate Gaussian kernel in estimating the potential function. The object with the highest cumulative potential value is selected as the center of the first cluster. Next, all members of this first cluster are selected by exploring the neighborhood of the cluster center. After the first cluster and its members are defined and set aside, a new object with the highest cumulative potential is selected to be the center of the next cluster and members of this cluster are selected. The procedure continues until all objects are clustered.

graph theoretical clustering Non-hierarchical clustering that views objects (variables) as nodes of a graph and applies graph theory to obtain clusters of those nodes. The best known such method is the minimal spanning tree clustering. The tree is built stepwise such that, in each step, the link with the smallest distance is added that does not form a cycle in the path. This process is the same as the single linkage algorithm. The final partition with a given number of clusters is obtained by minimizing the distances associated with the links, i.e. cutting the longest links. optimization clustering Non-hierarchical clustering that seeks a partition of objects into G clusters optimizing a predefined criterion. These methods, also called hill climbing clustering, are based on a relocation algorithm. They differ from each other in the optimization criterion. For a given partition the total scatter matrix T can be partitioned into within-group scatter matrix W and between-group scatter matrix B: T = W B where W is the sum of all within-group scatter matrices: W = C, W,. The eigenvalues of BW-' are A j , j = 1,M where M = min[p, GI. For univariate data the optimal partition minimizes W and maximizes B. For multivariate data similar optimization is achieved by minimizing or maximizing one of the following criteria :

+

o error sum of squares The most popular criterion to minimize is the error sum of squares, i.e. the sum of squared distances between each object and its own cluster centroid or centrotype. This criterion is equal to minimize the trace of W r 1

non-hierarchical clustering [CLUS]

229

Methods belonging to this group differ from each other in their cluster representation,in how they find the initial partition and in how they reach the optimum. The final partition is obtained in an iterative procedure that relocates an object, i.e. puts it into another cluster if that cluster’s centrotype or centroid is closer. Methods in which clusters are represented by centrotypes are called nearest centrotype sorting methods, e.g. K-median clustering and MASLOC. G objects are selected from a data set to be the centrotypes of G clusters such that the sum of distances between an object and its centrotype is a minimum. Leader clustering selects the centrotypes iteratively as objects which lie at a distance greater than some preset threshold value from the existing centrotypes. Its severe drawback is the strong dependence on the order of the objects. Methods representing clusters by their centroids are called nearest centroid sorting methods. K-means clustering (also called MacQueen clustering) selects K centroids (randomly or defined by the first K objects) and recalculates them each time after an object is relocated. Forgy clustering is similar to the K-means method, except that it updates the cluster centroids only after all objects have been checked for potential relocation. Jancey clustering is similar to Forgy clustering, except that the centroids are updated by reflecting the old centroids through the new centroids of the clusters. K-Weber clustering is another variation of the K-means method. Ball and Hall clustering takes the overall centroid as the first cluster centroid. In each step the objects that lie at a distance greater than some preset threshold value from the existing centroids are selected as additional initial centroids. Once G cluster centroids have been obtained, the objects are assigned to the cluster of the closest centroid. ISODATA clustering is the most elaborate of the nearest centroid clustering methods. It obtains the final partition not only by relocation of objects, but also by lumping and splitting clusters according to several user-defined threshold parameters. o cluster radius K-center clustering is a nearest centrotype sorting method that minimizes the cluster radiuses, i.e. minimizes the maximum distance between an object and its centrotype. In the covering clustering method the maximum cluster radius is fixed and the goal is to minimize the number of clusters under the radius constraint. o cluster diameter Hansen-Delattre clustering minimizes the maximum cluster diameter

1

230

nonlinear estimator [ESTIM]

o intra-cluster distances

Schrader clustering minimizes the sum of intra-cluster distances

o determinant of W

Friedman-Rubin clustering minimizes the determinant of the within-group scatter matrix min [IWll It is equivalent to maximizing

or to minimizing the Wilks’ lambda statistic min

[71

o trace of B W-’ Another criterion suggested by Friedman and Rubin is to maximize Hotelling’s trace

r max [tr [B W-’I] = max

1

lj 1 A,

largest eigenvalue of W Maximize Roy’s greatest root criterion

o

max[A11 b nonlinear estimator [ESTIM] + estimator b nonlinear iterative partial least squares (NIPALS) [ALGE] + eigenanalysis (o power method) b nonlinear mapping (NLM) [MULT] Mapping method, similar to multidimensional scaling, that calculates a two- or three-dimensional configuration of high-dimensional objects. It tries to preserve relative distances between points in the low-dimensional display space so that they are as similar as possible to the distances in the original high-dimensional space. After calculating the distance matrix in the original space, an initial (usually random)

nonparametric estimator [ESTIM]

231

configuration is chosen in the display space. A mapping error is calculated from the two distance matrices (calculated in the original and in the display spaces). Coordinates of objects in the display space are iteratively modified so that the mapping error is minimized. Any monotone distance measure can be used for which the derivative of the mapping error exists, e.g. Euclidean, Manhattan, or Minkowski distances. In order to avoid local optima it has been suggested that the optimization be started from several random configurations. Principal components projection can also be used as a starting configuration. There are several mapping error formulas in the literature, the most popular one is:

E=-

1

Cdij

i<j

(dij -d;jI2 dij

i<j

where 4, are distances in the display space, while di, are distances in the original space. The various algorithms also differ in the minimization procedure; the most popular is the steepest descent method. There are some drawbacks to NLM. new test points cannot be placed on a previously calculated map; computation is time consuming; the solution is often a local minimum.

nonlinear partial least squares regression (NLPLS) [REGR] + partial least squares regression b

nonlinear programing [OPTIM] + optimization b

b nonlinear regression model [REGR] + regression model b non-metric data [PREP] + data

non-metric multidimensional scaling (NMDS) [MULT] + multidimensional scaling b

non-metric scale [PREP] + scale b

b nonparametric classification [CLAS] + parametric classijication

nonparametric estimator [ESTIM] + estimator b

232

nonprobabilisticclassification [CLAS]

b nonprobabilistic classification [CLAS] -+ probabilistic classif cation b nonsingular matrix [ALGE] + matrix

b norm [ALGE] Measure of the size of a vector or a matrix, denoted as assigns a scalar to a vector or a matrix.

11.11.

It is a function that

matrix norm Measure of the size of a matrix W. The most frequently used matrix norms are: o Frobenius norm

o p-norm IlXllp = SUP l l ~ ~ l l p / l l ~ l l p U#O

where v is a nonzero vector. When p = 2, this norm is called the two-norm, and is equal to the largest eigenvalue of W. vector norm Measure of the size of a vector x. The most frequently used vector norm is the p-norm:

When p = 2 this norm is called the Euclidean norm; when p = 00 the norm reduces to

b normal distribution [PROB] -+ distribution b normal equation [ALGE] A set of simultaneous equations obtained in calculating a least squares estimator. For example the normal equations for the coefficient and intercept of a single

number of factors [FACT]

regression y = a nzxiyi

+ bx are: - 1 X i I

a=

233

i

Cyi i

I

n

b normal kernel [ESTIM] + kernel b

normal probability plot [GRAPH]

+ quantile plot b normal process [TIME] + stochasticprocess b normal residual plot [GRAPH] + quantile plot b normality assumption [PROB] The assumption underlying many statistical models and tests, that requires a variable, e.g. the model error, to follow a normal distribution. For example, in ordinary least squares regression and discriminant analysis the error is assumed to be normally distributed with zero mean and equal variance, and to be uncorrelated. The most often used techniques for verifying the normality assumption are D’Agostino’s test and the normal probability plot. b normalized Hamming coefficient [GEOM] + distance (o binalydata) b notched box plot [GRAPH] + box plot b null hypothesis [TEST] + hypothesis testing b

null matrix [ALGE]

+ matrix

number of factors [FACT] + rank analysis b

234 b

-+

numerical data [PREP]

numerical data [PREP] data

F numerical error [OPTIM] Error due to numerical approximations in calculating a number from mathematical functions or in representation of the number itself. When iterative procedures are used, error propagation can also occur, i.e. at a given stage of the calculation part of the error derives from the error at a previous stage. The propagation of experimental and numerical errors can be examined by repeating the calculation with slightly perturbed data. Several types of numerical errors can limit the accuracy of the results:

discretization error Error due to substituting integrals with finite sums. roundoff error Error due to the finite representation of the digits of a number x. Digital computers store the digits of a number as a finite sequence of discrete physical quantities. For example, in floating-point notation, the number is represented by a finite number t of nonzero digits before the decimal point (mantissa) and a finite number e which indicates the position of the decimal point with respect to the mantissa (exponent): x=tloe

The numbers t and e (together with the basis of the exponent, 10 or 2) determine the subset M of real numbers R, called machine numbers, M C R, which can be represented exactly in a given machine. Thus, any number x # M is approximated by a machine number rd(x) E M satisfying Ix - rd(x)l IIx - rl

for all

rEM

The quantity Ix - rd(x)I is the round-off error. This type of error can obviously occur not only in representing the final results, but also during intermediate calculations.

truncation error Error due to substituting infinite series with a finite number of terms of the series. b

numerical optimization [OPTIM]

: optimization b

numerical taxonomy [CLUS]

: cluster analysis

offset[REGR]

235

0 b object [PREP] (; data point, item, observation, unit) The basic unit in data analysis, e.g. an individual, an animal, a molecule, a food sample. Each object is described by one or more measurements, called data. A collection of objects constitutes a data set. Each object corresponds to a row of the data matrix. The ratio between the number of objects n and the number of variables p is called the object-variable ratio n/p. If this ratio is smaller than one the data set is an underdetermined system, while a data set with a ratio higher than one is a well-determined system. In the case of an underdetermined system one must apply biased methods to mitigate the high variance of estimates calculated from such a data set.

b

objective function [OPTIM] optimization

b object-variable interaction [FACT] + correspondencefactor analysis b

object-variable ratio [PREP] object

b oblimax rotation [FACT] + factor rotation b

oblimin rotation [FACT] fuctor rotation

oblique rotation [FACT] + factor rotation b

b

observation [PREP]

: object b

off-diagonal element [ALGE]

+ matrix

b

off-line quality control [QUAL] Tuguchi method

offset [REGR] + regression coeficient b

236

one-sided test [TEST]

b one-sided test [TEST] + hypothesis testing

one-variable-at-a-time (OVAT) [MULT] + multivariate b

b one-way analysis of variance [ANOVA] The simplest analysis of variance model, containing one single effect. The total variance has two components: variance due to the effect, and a random component. The assumption is that there are I normal populations (where I is the number of levels the effect can assume) with different means but common variances, from which samples of size K are drawn. In this model the total variance of the data is partitioned into two parts: sum of squares of differences between treatment means and the grand mean (often called between sum of squares); and sum of squares of differences between observations and their own treatment means (often called within sum of squares or error sum of squares):

i=l,I

k=l,K

n=IK

The first term measures the effect of treatments, the second term is due to random error. The total sum of squares has n - 1 degrees of freedom, the between sum of squares has I - 1 degrees of freedom and the within sum of squares has n - I degrees of freedom. The one-way ANOW model is used, for example, in regression diagnostics to partition the total variance into variance due to the regression model and variance due to random error. In this case the number of levels is substituted by the number of parameters of the regression model.

on-line quality control [QUAL] + Taguchi method b

b open data [PREP] + data (o closed data)

operating characteristic [TEST] + hypothesis testing b

b operating characteristic curve (OCC) [QUAL] Plot of the type I1 error against the process quality. It indicates the probability of accepting a lot if it contains a certain percentage of defective items estimated from a sample. The probability of accepting the lot with no defective items is one. As the number of defective items increases the probability of acceptance decreases

Optimization [OPTIM]

237

asymptotically to zero. The shape of the OC curve depends on the sampling plan and on the distribution assumed. b operations research (OR) [MISC] Application of mathematical tools based on objective and quantitative criteria, where necessary with given constraints, to system organization, decision making and system control in order to provide the best solutions. Techniques most commonly used in OR are: graph theory, game theory, linear programming, optimization, simulation.

optimal design [EXDE] + design b

b optimization [OPTIM] (: numerical optimization) Finding the optimal value (minimum or maximum) of a numerical function f, called the objective function, with respect to a set of parameters p, f ( P I , . . . , pp). If the values that the parameters can take on are constrained, the procedure is called a constrained optimization.

global maximum local maximum

local minimum

I

global minimum

P

At stationary points of a function, i.e. at minimum (global or local), maximum (global or local) or saddle points the first (partial) derivatives are zero: af/apl = . . . = af/ap, = o At a minimum the second derivative matrix (the Hessian matrix) is positive definite:

238

optimization clustering [CLUS]

Optimization procedures are iterative, i.e. in each step i the solution, the estimated values of the parameters, represents an improved approximation of the true parameter. Whether a procedure converges to a global or local optimum depends on the initial guess about the parameter values. Various convergence criteria can be used to decide when a procedure reaches the optimum. In each iteration step i a new set of parameters is calculated as

+

pi+l = Pi sidi where Si is the step size and di (a p-dimensional vector) specifies the step direction to be taken in moving from pi to pi+l. The optimization methods differ from each other in the way they determine di and S i . In direct search optimization the choice of the step direction and step size depends only on the values of the function, while in gradient optimization they are calculated from derivatives. The problem of optimization is often referred to as mathematical programming. If both objective function and the constraints are linear functions of the unknown parameters, the optimization is called linear programming, otherwise it is called nonlinear programming. An important problem in optimization, one which must be dealt with, is numerical error. optimization clustering [CLUS] + non-hierarchical clustering b

b

order of a matrix [ALGEI

+ matrix b order of a model [MODEL] + model b order statistic [DESCI + rank b

ordinal scale [PREP]

+ scale b ordinary least squares regression (OLS) [REGR] (: least squares regression, linear least squares regression, multiple least squares regression, multivariate least squares regression) The most popular regression method in which the model is linear in both the parameters and the predictors

y=%b+e

and the regression coefficients b are calculated with the least squares estimator solving for the normal equations:

6 = (XT x)-l xTy

orthogonal matrix [ALGE]

239

These coefficients are estimated by minimizing the residual sum of squares or, equivalently, by maximizing the correlation between the linear combination of the predictors and the response: mba [corr2 (y, b xT)] OLS is BLUE, i.e. it is an unbiased linear estimator that has the smallest variance among all unbiased linear estimators if the normality assumption on the residuals hold and if the true relationship is linear. If the normality assumption is violated the least squares estimator is not necessarily the optimal unbiased one. Generalized least squares regression can be applied to deal with heteroscedasticity and collinearity. The bias and variance of the estimated regression coefficient are:

where s is the residual standard deviation and Am is the rnth eigenvalue of (XTX). This method calculates an unbiased model; however, it is suitable only for welldetermined systems and is very sensitive to outliers. The estimated response and any statistic related to it is scale invariant. It is a linear estimator, therefore the cross-validated residuals can be calculated as:

where hii are diagonals of the hat matrix.

ordinary residual [REGR] + residual b

b

ordinary residual plot [GRAPH]

+ scatter plot (O residual plot) b orthoblique rotation [FACT] + factor rotation

orthogonal design [EXDE] + design b

b

orthogonal matrix [ALGE] matrix

--+

240

orthogonal matrix transformation [ALGE]

F orthogonal matrix transformation [ALGE] ( : orthogonalizution) Transformation of a square matrix X into an upper triangular matrix R via multiplication with an orthogonal matrix Q:

R=Q~X

The above transformation can also be viewed as a Q-R decomposition of X:

X=QR Each method expresses the columns of X in terms of a new orthonormal basis, contained in the columns of Q. Orthogonal matrix transformations, for example, afford an easier solution to the least squares regression problem and provide a numerically more stable solution. The orthogonal matrix Q can be calculated in several ways. Givens transformation Orthogonal matrix transformation that brings X to an upper triangular form through a series of orthogonal transformations, i.e. the transformation matrix Q is a product of orthogonal matrices Gij . In contrast to the Householder transformation, each step sets to zero only one sub-diagonal element, i.e. Gij operates on columnj of %, annihilating element i in that column. The matrix Gij differs from the identity matrix in only four elements:

with

Although the Housholder transformation is preferred for general use, because fewer computations are required and slightly better results are obtained, the Givens transformation is recommended for large, very sparse (many elements are zero) X matrices. Gram-Schmidt transformation Orthogonal matrix transformation that brings X to an upper triangular form. In contrast to the Housholder transformation, which does not calculate Q explicitly, but only the matrices Hj, here in each step one calculates a new column of Q, orthogonal to the previous ones. Housholder transformation The most important orthogonal matrix transformation that brings X to an upper triangular form through a series of orthogonal transformations, i.e. the transformation matrix Q is a product of p orthogonal matrices Hj . The matrix Hj is symmetric,

orthonormal basis [ALGE]

241

orthogonal and operates on column j of X, annihilating all sub-diagonal elements:

H.-I[-- 2vvT

’-

VTV

This transformation is used in many algorithms, e.g. in bidiagonalization, or tr idiagonalization of a matrix before performing an eigenanalysis. b orthogonal rotation [FACT] + factor rotation b orthogonal vector transformation [ALGE] Let G be an orthogonal matrix of rank p and nj an orthonormal basis of p vectors. Then

gj = Gnj

forms a new orthonormal basis. Any vector x can be represented in terms of both the old and the new basis as

where Xj indicates the old coordinates and xi the new transformed coordinates. b orthogonal vectors [ALGE] + vector b

orthogonalization [ALGE]

: orthogonal matrix transformation

orthomax rotation [FACT] + factor rotation b

orthonormal basis [ALGE] A set of p vectors that are orthogonal vectors and have their norm equal to one: b

nj nk = O

if

j# k

and

/(njII= 1 j = l , p

Without imposing the orthogonality and normality restrictions, any set of linearly independent vectors nl, . . . , np,such that each vector in the space can be written as linear combination of nl , . . . , np, is simply called the basis. The minimum number of vectors needed to form a basis is equal to the dimension of the vector space. A p-dimensional vector x can be represented as x = C x j nj j

where xj is calledjth coordinate. The orthonormal basis is used, for example, in orthogonal vector transformation.

242

outlier [DESC]

b outlier [DESC] Observation that is separated from the bulk of the data. Outliers in the sample may distort statistics calculated from such a sample. Outliers are often observations from a different than the sampled population. Their effect on an estimator is measured by the breakdown point and the influence curve. Outliers must be detected and tested to determine whether they should be discarded before modeling. Influence analvsis detects outliers in the predictor space, called leverage points. Robust estimators are often used to mitigate the distorting effect of potential outliers.

b

output layer [MISC] neural network

b

-+

outranking method [QUAL] multicriteria decision making

b

overfitting [MODEL]

---, model fitting

overlay plot [QUAL] + multicriteria decision making b

b

p-chart [QUAL]

+ control chart

(0 attribute control

chart)

p-norm [ALGE] -+ norm b

parameter design [QUAL] Design for investigating the response of a process which is independent of or at least insensitive to different sources of variation. This design is different from the experimental design, in which different sources of variation are investigated. The goal of the parameter design is to model a response by making it independent from different sources of variation rather than controlling those sources. b

b

parameter estimation [ESTIM] estimation

partial least squares regression (PLS) [REGR] 243 b parametric classification [CLAS] Classification where the classification rule is expressed as a function of a relatively small number of parameters of the class density functions, which is usually assumed to be normal (e.g. DASCO, LDA, QDA, RDA, SIMCA, UNEQ). Its opposite is the nonparametric classification where no particular form of class density function is assumed and it is estimated directly from the training set (e.g. ALLOC, CART,

K"). b parametric estimator [ESTIM] + estimator

Pareto analysis [QUAL] + qualitycost b

b

Pareto chart [QUALI quality cost

-+ b

Pareto distribution [PROB]

+ distribution b

Pareto optimality criterion [QUAL] multicriteria decision making

-+

Parks distance [GEOMI -+ distance ( 0 mixed type data) b

b

parsimonious model [MODEL] model

-+

b

partial confounding [EXDE]

+ confounding b

partial correlation [DESC] correlation

-+

partial factorial design [EXDE] -+ design b

b partial least squares regression (PLS)[REGR] Biased regression method that relates a set of predictor variables X to a set of response variables Y. A least squares regression is performed on a set of uncorrelated, so called latent variables T that are a standardized linear combinations

244

partial least squares regression (PLS) [REGR]

of the original predictor variables. PLS models the response(s) as Y=IL'BQ+lE

where I5 is a diagonal matrix containing the least squares coefficients of the latent variables, Q contains the weights of the responses and lE is the error matrix.

The latent variables are calculated one at a time, maximizing their covariance: max [corr2 (urn, tm), var (u,)var (tm)] m

where

tm

=Xw;f,

and

Um

rn = I, M T =Yqm

The connections among the various matrices of the PLS algorithm are shown in the following scheme. A PLS component rn is calculated as:

1. Initialize: 2. X weight: 3. X latent variable: 4. Y weight: 5. U latent variable: 6. Check convergence: 7. Inner relation: 8. X loadings: 9. X residuals: 10. Y residuals:

=Y1 : w = XT u, tm = Xw; q', = YT tm

scale to length 1 scale to length 1

uk = Yq', if ( h u m 5 E ) then 7. else

Um = uk

and 2.

The bias-variance trahd-off is controlled by the num,er of latent varia,.;s M:the more latent variables that are included the larger is the variance and the smaller is the bias of the estimated regression coefficients. As the number of latent variables increases (max [MI = p ) , the PLS model converges to the OLS model. The optimal number of components must be determined by a model selection criterion, e.g. cross-validation.

path analysis [MULT]

245

A modification of the PLS regression that permits the calculation of the latent variables as a linear combination of a subset of the predictor variables is called intermediate least squares regression (ILS). A special case of ZLS is the forward selection method, i.e. when a latent variable consists of only one predictor variable. An extension of the two-block PLS is the multiblock PLS model in which several predictor and response blocks are connected by regressing their latent variables on each other according to a prespecified path. There are several nonlinear extensions of the linear PLS model nonlinearizing the inner relationship, i.e. assuming nonlinear function among the latent variables: Um

= f (tm)

+e

The quadratic partial least squares regression (QPLS) fits a second-order polynomial: f (tm) = a0 a1 tm a2 t i . The nonlinear partial least squares regression (NLPLS) applies the smoother to estimate the above nonlinear functions. The spline partial least squares regression (SPLS) uses a nonadaptive bivariate regression sdine to approximate f . PLS is used extensively in the context of experimental design. The generating optimal linear PLS estimation (GOLPE) method performs a variable selection on the basis of D -optimal design in the loading space to obtain the best predictive model. The computer aided response surface optimization (CARSO) method uses PLS to fit a linear or quadratic response surface.

+

+

b partial pivoting [ALGE] + Gaussian elimination

b

partitioning clustering [CLUS]

: non-hierarchical clustering b partitioning of a matrix [ALGE] + matrix operation b Pascal distribution [PROB] + distribution

b path [MISC] + graph theory

b path analysis [MULT] Method for solving a causal model, i.e. studying patterns of causation among a set of variables, in which certain variables are viewed as a cause, others as an effect.

246

path diagram [MULT]

This is a special case of LISREL. The analysis is based on the following assumptions: - relationships are linear and additive; - all error terms are uncorrelated with each other; - models are recursive, only one-way causal flow is allowed; - endogenous variables are measured on an interval or ratio scale; - manifest variables are measured without error; - the model is properly specified, with all causal relationships included. Each equation is treated separately, i.e. errors in estimating the parameters of one equation do not affect the estimation of parameters of other equations. The parameters of the model are usually estimated by least squares. Path coefficients, calculated from the regression coefficients, indicate the effect of a causal variable on an effect variable. A path diagram is often used to enhance the analysis graphically. b path diagram [MULT] Graphical representation of the relationship among variables in a causal model. The diagram shown below corresponds to the following equations:

+ Yl 61 + c1 v 2 = B 2 v1 + b x1 = A.1 61 + 81 x 2 = A.2 61 + 8 2 rl1 = B1 rl2

+ €1 Y2 = A4 v 2 + €2 Y3 = A.5 r12 + €3 Y1 = A.3 rll

61

6 2

-

-El

Pearson’sfirst index [DESC]

247

The conventional notation is the following: - x: manifest exogenous variable - y: manifest endogenous variable latent exogenous variable - 3: latent endogenous variable - B: effect of endogenous variable on endogenous variable - y : effect of exogenous variable on endogenous variable - <: error in q - 6: error in measuring x - E error in measuring y - square: manifest variable - circle: latent variable - one-way arrow: one-way direct causal effect - two-way arrow: potential correlation Models (e.g. the above example) that postulate reciprocal causation between endogenous variables (indicated by two-way arrows) are called non-recursive, while models postulating one-way causal flows (indicated by one-way arrows) are called recursive.

e:

pattern recognition (PARC) [MULT] Process for developing a decision rule on the basis of a statistical sampling, and using this rule to assign a class label to objects either from the same sample (recognition) or from a future sample (prediction). Two branches of pattern recognition are supervised pattern recognition (also called supervised learning), usually referring to classification, and unsupervised pattern recognition (also called unsupervised learning), usually referring to cluster analysis. The difference is that in a classification problem the classes are predetermined and a training set of labeled (classified) objects representing the classes is available, whereas it is the task of cluster analysis to define or to search for the classes on the basis of unlabeled objects. b

pattern recognition by independent multicategory analysis (PRIMA)[CLAS] + centroid classification b

pattern space [GEOM] : hyper space

b

b Pearson distance [GEOM] + distance (o quantitative data)

Pearson’s correlation coefficient [DESC] + correlation b

b Pearson’s first index [DESC] + skewness

248

Pearson’s second index [DESC]

b Pearson’s second index [DESC] + skewness

b

percent data [PREP] (0 closed data)

+ data

b percentage scale [PREP] + scale ( 0 ratioscale)

b

percentile [DESC]

+ quantile b

periodogram [GRAPH]

Graphical summary of a time series on the basis of superposition of sinusoid waves of various frequencies.

2-

I

I

I

I

I

I

1

I

I

1.51-

0.5-

-

’

0

I

1

1

I

I

The periodogram of x ( t ) is the plot of P(w) against o,where 0 5 o 5 ~tand 2

P(w) =

n This plot helps to detect strictly cyclic patterns in the data. b

pictogram [GRAPH]

Similar to bar plot, graphical display of the frequency distribution of a categorical variable. Instead of vertical bars, however, the relative frequencies of the values taken by the categorical variable are represented by the size of graphical objects related to the problem, e.g. humans, bottles, animals, houses, etc.

plot [GRAPH]

249

t pie chart [GRAPH] Graphical display of the frequency distribution of a categorical variable. It is a circular figure with slices of varying size reflecting the relative frequencies of the values (A, B, C, D, E ) taken by the categorical variable. It is customary to arrange the values of the variable from the one most frequently occurring to the one least frequently occurring.

t

Pillai’s trace test [FACT]

+ rank analysis

Pitman’s test [TEST] + hypothesis test b

F pivot [ALGE] + Gaussian elimination b pixel [MULT] + image analysis b

Plackett-Burman design [EXDE]

+ design b platykurtic [DESC] + kurtosis b

plot [EXDE]

+ design matrix b

plot [GRAPH]

: scatter plot

250

point estimator [ESTIM]

b point estimator [ESTIM] + estimator

point proportion admissibility [CLUS] + assessment of clustering (0 admissibility properties) b

b

Poisson distribution [PROB]

+ distribution b

Poisson process [TIME] counting process)

+ stochastic process (

b polar-wedge diagram [GRAPH] + histogram b

polynomial regression model [REGR]

+ regression model (0 nonlinear regression model) b

polythetic clustering [CLUS]

+ cluster analysis

pooled covariance matrix [DESC] + covariance matrix b

b population [PROB] Finite or infinite set of individuals (objects, items). A population implicitly contains all the useful information for calculating the true values of the population parameters. The affinity is a measure of the distance between two populations. A sample is an (ideally representative) subset of a population that is collected in order to estimate the properties of the underlying population. A sample is called a random sample if it has been drawn in such a way that all individuals had the same chance to be chosen. The number of individuals in a sample is called the sample size. The sample is obtained by some selection process, called sampling, or by simulation. The translation of the estimated properties, predictions, and decisions from sample to population is called inference. b positive definite matrix [ALGE] + matrix b

-+

positive matrix [ALGE] matrix

b positive semi-definite matrix [ALGE] --+ matrix

power curve [TEST]

251

post-processing [PREP] + preprocessing posterior class probability [CLAS] + Bayes’ rule

posterior probability [PROB] + probability b

potential function classifier [CLAS] (: kernel class$er) Nonparametric classification method which calculates the classification rule on the basis of potential density functions. Around each object a multivariate density function (kernel) is assumed and each class density function is calculated as the sum of the densities of the objects in that class.

An important step is choosing the width of the kernel density functions. This determines the smoothness of the global class density functions. An object is assigned to the class with the highest density at that point. One of the most popular potential function classifier is ALLOC, which uses a Gaussian kernel density estimator.

b power curve [TEST] + hypothesis testing

252

power function distribution [PROB]

t power function distribution [PROB] + distribution t

power method [ALGE] eigenanalysis

-+

t power of a test [TEST] -+ hypothesis testing

precedence test [TEST] -3 hypothesis test t

t predicted value -+ prediction t

[MODEL]

prediction [MODEL]

Process for calculating the value of a variable on the basis of a statistical model and observed data. For example calculating a response value from a regression model. Such a value is called a predicted value. t

predictive residual [REGR] residual

-+

t

predictive residual plot [GRAPH] scatter plot (0 residual plot)

-+

t

predictive residual sum of squares (PRESS) [MODEL] goodness of prediction

-+

t

predictive squared error (PSE) [MODEL] goodness of prediction

-+

t

predictor variable [PREP] variable

-3

t Pregibon plot [GRAPH] + scatter plot ( 0 residual plot)

t

preprocessing [PREP]

Operation performed on the data set before a statistical model is calculated. Preprocessing can be based on a priori knowledge about the data, on assumptions underlying the statistical model, or on practical experience. Common preprocessing

principal component analysis (PCA) [FACT]

253

techniques are the missing value treatment, a priori deletion of obvious outlier obiects, a priori deletion of variables considered not relevant to the problem, transformation and standardization of the variables, splitting the data set into training and evaluation sets, and assigning weights. As opposed to preprocessing, the term post-processing is used for operations performed on the data set or on the results obtained after a statistical model has been calculated. Examples are factor rotation, post-pruning techniques in simplifying classification trees, outlier deletion and variable selection based on the modeling results. b

prevention cost [QUALI

+ qualitycost b principal component (PC) [FACT] + common factor

b principal component analysis (PCA) [FACT] (: component analysis, Karhunen-Loeve expansion) Factor extraction method in factor analysis, sometimes referred to as eigenanalysis. PCA, in contrast to other factor extraction methods, models the p variables in X as linear combinations of the common factors %(n,M), here called principal components (PC), without a unique factor:

The sum of the variance of all the p components equals the variance of the original variables. The principal components are also linear combinations of the variables:

The components are calculated according to the maximum variance criterion, i.e. each successive component is an orthogonal linear combination of the original variables such that it covers the maximum of the variance not accounted for by the previous components. The properties of the principal components are:

E[tm]= 0

254

principal component plot [GRAPH]

m

where E denotes expected value, V denotes variance, C denotes covariance, tr denotes trace and 1.1 denotes determinant. The factor loadings IL@,M) of the components are the eigenvectors of the covariance matrix 8 scaled by the square root of the corresponding eigenvalues. Because the solution is scale dependent, the variables are usually autoscaled before the analysis, i.e. the components are extracted from the correlation matrix. The projection of p-dimensional objects onto the first few (usually two) principal component axes is called the principal component projection (PCP) or KarhunenLoeve projection. The graphical display of such a projection is called a principal component plot. PCA is the basis, for example, of principal component regression and of some classification methods (SIMCA, DASCO, RDA, CLASSY). b

principal component plot [GRAPH]

+ scatter plot

principal component projection (PCP) [FACT] + principal component analysis b

b principal component regression (PCR) [REGR] Biased regression method that models the response variable as a linear combination of the principal components

y=XLHILHb T T+e

where ILH is the eigenvector matrix in the high variance subspace of (XTX). The principal components are orthogonal linear combinations of the predictor variables with the eigenvectors of the correlation or covariance matrix as coefficients. They are calculated with the criterion of maximum variance: max[var(l,x)] m

m=l,M

The estimated PCR regression coefficients are:

where a is the least squares regression coefficient on the principal components, and AH is the eigenvalue vector in the high variance subspace of (XTX). The bias-variance trade-off is controlled by the number of principal components M: the more components that are included, the larger the variance and the smaller

principal coordinate analysis [MULT]

255

the bias of the estimated regression coefficients. As the number of components increases (max [MI = p), the PCR model converges to the OLS model. The optimal number of components must be estimated by a model selection criterion, e.g. cross-validation. The bias and variance of the estimated regression coefficient are:

where ILL and LH are the eigenvector matrices of (XTX) in the low and high variance subspaces, respectively; s is the residual standard deviation and /3 is the true regression coefficient vector. b principal coordinate analysis [MULT] Calculation of p coordinates of n objects in a p-dimensional Euclidean space from inter-object distances dst:

4t

+

s,t=l,n

Xij

i=l,n

j=l,p

The values Xij are calculated in a two-stage procedure. First the distance matrix

D is transformed into a symmetric matrix A, such that A =X X ~

et

= C x s j xtj j

with S

t

The elements astcan be calculated as = -0.5(&

-&

-;it

+d)

with

-

4 = ;it= -n1 C s

1 ;i= Xst n2 s t The second stage is the diagonalization of matrix A: 4t

7y

A=VAV~

where V is the eigenvector matrix and A contains the eigenvalues on the diagonal. Finally, the matrix X is calculated as:

The principal coordinate analysis solution defines the axes of the configuration obtained corresponding to the principal component axes. In fact, the same solution can be obtained by carrying out a principal component analysis of the covariance

256

principal factor analysis (PFA) [FACT]

matrix, when it is assumed that the centroid of the objects lies at the origin of the coordinates. b principal factor analysis (PFA) [FACT] Factor extraction method which performs principal component analysis of the reduced correlation matrix. The rank of this matrix is always lower than the rank of the observed correlation matrix. There are estimated communalities on the diagonal, instead of ones, that are most commonly calculated as the squared multiple correlation coefficient:

h.2 = 1 - - 1 1 rJJ

j=L p

where rJJ is the jth diagonal element of the inverse correlation matrix. Iterative principal factor analysis, also called minres, minimizes the residual correlation matrix according to the least squares criterion: min [(B - I) - (LLT - diag (L LT))] L

The factor loadings and the unique factors are calculated in an iterative procedure, where M,the number of factors is prespecified, the common factors are assumed to be orthogonal, and the communalities are restricted to be between zero and one. b

principal fraction [EXDE]

+ fraction b prior class probability [CLAS] The probability of the occurrence of class g, based on outside knowledge, without considering any data. When such information is not available, there are two common ways to estimate the prior probability: either they are taken to be equal or they are taken to be proportional to the number of objects of the class in the training set:

P, = 1 b

or

P, = ; n,

prior probability [PROB] probability

-3

b probabilistic classification [CLAS] Classification in which not only simple class assignments but also posterior class probabilities are calculated and the objects are assigned to the class of highest posterior probability. The posterior class probabilities of an object indicate its degrees of class membership. In contrast, when only the class assignments of objects

process capability ratio [QUAL]

257

is given without any probabilistic information, the method is called categorical classification, or nonprobabilistic classification. b probability [PROB] Positive number between zero and one that measures the chance that an event will occur: low numbers for rare events, high numbers for common or sure events. More formally, the probability P(A) of an event A is the ratio of the number of times it occurs to the total number of trials. The joint probability measures the chance that two or more events will occur concurrently. e.g. P(A B) denotes the probability of A and B occurring together. The probability of an event B given that another event A occurs, called the conditional probability, is defined as:

The probability assigned to an event before any trial is called prior probability, in contrast to posterior probability, which is the probability assigned to the event after some trial. The prior and posterior probabilities are related according to Baves’ theorem.

probability density function (pdf) [PROB] -+ random variable b

b probability function [PROB] + random variable b probability limits [QUAL] + control chart

b

probability plot [GRAPH]

+ quantile plot b probability theory [PROB] Branch of mathematics dealing with probabilitv, random variable, distribution, population, likelihood.

probit transformation [PREP] + transformation b

b process capability ratio [QUAL] Measure of the ability of the process to manufacture a product that meets certain specifications. It is defined for two-sided specification as:

258

process variable [PREP]

USL - LSL 6a and for one-sided specification as: pcr =

USL - p p - LSL or pcr = 30 30 where USL and LSL are the upper and lower specification limits, respectively. pcr =

b

process variable [PREP]

+ variable b Procrustes analysis [MULT] Method for comparing two configurations A and B, each defined by a data matrix. For example, to compare different scaling methods, different judge’s descriptions, different measures of pairwise dissimilarity, etc. This method can also be used to reveal objects that have relative positions which vary considerably in different configurations, to assess the robustness of the results. The measure of similarity between the two configurations is the sum of the squared distances between corresponding points in the two configurations:

A ~ ( w A , wB)

=

CE ( x - x;l2 ~ = tr i

[(xA

- xBIT (xA - xB)]

j

There are three kinds of transformations used to match the two configurations: translation of the origin, rotation and/or reflection, and uniform dilation. b

Procrustes transformation [FACT]

+ factor rotation

producer‘s risk [QUALI The probability of rejecting a good lot, i.e. the type I error in hypothesis testing. The acceptable quality level (AQL) is a measure of good quality associated with the producer’s risk defined as the maximum percentage of defective items in a lot that is still accepted by the producer. Expressed as percentage per unit of time, the measure is called the acceptable reliability level. Similarly, the consumer’s risk is the probability of accepting a bad lot, i.e. the type I1 error in hypothesis testing. The lot tolerance percent defective (LTPD) (also called rejectable quality level or limiting quality level) is a measure of bad quality associated with the consumer’s risk defined as the maximum percentage of defective items in a lot that is still acceptable to the consumer. b

product moment correlation coefficient [DESC] -+ correlation b

proximity measure [GEOM]

259

profile [PREP] + standardization b

profile symbol [GRAPH] + graphical symbol b

b

projection method [MULT] data reduction

-+

b projection pursuit (PP) [MULT] Mapping method to find revealing low (two- or three-) dimensional representations of high-dimensional data. The intention is to discover views of a multivariate data set that exhibit nonlinear effects, like clustering and concentrations near nonlinear manifolds, that are not captured by the linear correlation structure. A numerical index, called the projection index, is assigned to every projection. The projection index characterizes the amount of structure present in the projection by measuring the nonnormality of the distribution of the data points. This index is maximized via numerical optimization with respect to the parameters defining the projections. The analysis finds many projections with local maximum in the projection index by repeatedly invoking the optimization procedure, each time removing the solutions previously found. Interactive computer graphics is often used in combination with this method for visual pattern recognition.

b

projection pursuit regression (PPR) [REGR]

: smooth multiple additive regression technique

promax rotation [FACT] + factor rotation b

b

proportional data [PREP] data (0 closed data)

-+

proportional scale [PREP] -+ scale F

prototype classification [CLAS] + centroid classijication b

proximity measure [GEOM] -+ distance b

260

pseudo distance [GEOM]

b pseudo distance [GEOM] -+ distance

b

pseudo inverse [ALGE] ( 0 inverse of a matrix)

.--, matrix operation

b

Q-analysis [MULT]

.--, multivariute analysis b

Q-R decomposition [ALGE]

+ matrix decomposition b

-+

quadratic discriminant analysis (QDA) [CLAS] discriminant analysis

b quadratic estimator [ESTIM] -+ estimator b quadratic form [ALGE] Defined for a p-dimensional vector x as:

Q(x)

X ~ A= X I$J)stxsxt 2

= UllXl

s

=

t

+ . . . +uppx,”+

2Ul2XlX2

+ . .. + 2up-l,pxp-1xp

where An(p,p) is a symmetric matrix. Q(x) is called the positive definite quadratic form if Q(x) > 0, and the semi-positive definite quadratic form if Q(x) 2 0. For example, the Mahalanobis distance is a quadratic form, where A is often the covariance matrix and x is the vector of the coordinate differences. Canonical analysis converts a quadratic form into a weighted sum of squares form without cross-product terms, called the canonical form. The procedure consists of centering and of rotating the axes in order to decorrelate them. b quadratic partial least squares regression (QPLS) [REGR] + partial least squares regression

quadratic regression model [REGR] + regression model b

quality cost [QUAL]

261

b quadratic score [CLAS] + classification

b

-+

qualitative data [PREP] data

b qualitative variable [PREP] + variable

b

quality characteristic [QUAL]

+ lot b quality control (QC) [QUAL] (: statistical quality control) Statistical analysis of data collected to measure the quality characteristics of the product, compare them with established standards and take action to ensure the required product quality. The primary goal is the systematic reduction of variability in the quality characteristics. A process can be characterized by the operating characteristic curve and the process capability ratio, monitored by a control chart, investigated by =meter design. The quality of a can be estimated by acceptance samulinq, the type I error is measured by the producer’s risk. Defects of different severity are dealt with in a demerit system. Multicriteria decision making optimizes more than one quality criterion. The cost of producing conforming products is quantified by the quality cost. The Taguchi method is a novel approach to quality control.

quality control chart [QUAL] : control chart

b

b qualitycost [QUAL] Cost associated with producing, identifying, avoiding, or repairing products that do not conform with the standards. There are four categories of quality cost. The prevention cost is associated with efforts in design and manufacturing to prevent nonconformance. The appraisal cost is associated with measuring, evaluating products, components, and materials to ensure conformance. The internal failure cost is associated with nonconforming products, components, and materials discovered prior to their delivery to the customers. The external failure cost is associated with nonconforming products delivered to customers. Pareto analysis is a quality-cost analysis aiming at cost reduction by identifying quality costs by category, by product, or by type of defect. A Pareto distribution of percentage defective items is calculated and plotted as a bar plot, called a Pareto chart, e.g. by department, by machine, by operator, etc.

262

quantal variable [PREP]

Defects

Conventionally, the bars are arranged in decreasing order. Usually, a few items on the left represent a substantial amount of the total.

quantal variable [PREP] + variable b

b quantile [DESC] (.- fractile) A value within the range of a variable which divides the data set into two groups, such that the fraction of the observation specified by the quantile falls below and the complement fraction falls above. For example, the quantile Q(0.8) indicates a value of a variable for which 80% of the observations lie below and 20% lie above. The most often used quantiles are:

quartiles divide the range into quarters Q(0.25) Q(0.5) Q(0.75) The jth quartile is the j(n + 1)/4th ordered value. deciles divide the range into tenths Q(o.1) Q(O.2) Q(0.3) Q(0.4) Q(0.5) Q(O.6) Q(0.7) Q(O.8) Q(o.9) The jth decile is the j(n l)/lOth ordered value.

+

percentiles divide the range into hundredths Q(O.01) Q(0.02) Q(0.03) . . . Q(0.97) Q(0.98) Q(0.99) The jth percentile is the j ( n + 1)/100th ordered value.

quantileplot [GRAPH]

Q,

Q,

Q,

lower quartile

median

upper quartile

263

-

. . ..I . ...i . .. I. . .. interquartile range

The three quartiles are important in measuring scale and location: - the first quartile Ql = Q(0.25) is called the lower quartile - the second quartile Q2 = Q(0.5) is the median, a robust measure of location - the third quartile Q3 = Q(0.75) is called the upper quartile The distance between the upper and the lower quartiles is called the interauartile range (Q3 - Ql), which indicates the spread of the bulk of the data. The half-interquartile range 0.5 . (Q3 - Ql), also called quartile deviation, is a robust measure of dispersion. Quantiles are used, for example, in quantile plots and in robust estimators. b quantile function [PROB] -+ random variable b quantile plot [GRAPH] Plot of the auantiles of a distribution against the corresponding fractions. The scale of the horizontal axis of the fraction ranges between 0 and 1. The scale of the vertical axis of the quantile is the scale of the variable. Except for the labeling of the horizontal axis, this plot is the same as X i vs. i, when the X i values are ordered.

I

0.0

I

0.2

I

I

I

0.4

0.6

0.8

Fractions of data

1.0

264

quantile-quantile plot [GRAPH]

The plot of the quantiles of two distributions against each other is called the quantile-quantile plot. If both quantiles are from an empirical distribution, it is called an empirical quantile-quantile plot. If the number of observations is equal for the two empirical distributions, the empirical quantile-quantile plot is simply a plot of one sorted variable against the other. Data points lying along a straight line indicate distributions of similar shape. The intercept of the line indicates the difference in location, the slope shows the difference in scale. When the quantiles on the horizontal axis come from a theoretical distribution the plot is called a theoretical quantile-quantile plot, or more commonly, a probability plot. When the normal distribution is used, this plot is called a normal probability plot. This plot is very effective for testing the assumption of normality on a variable. Departure from a straight line indicates nonnormality; an opposite curvature at the ends indicates long or short tails; a convex or concave curvature is related to asymmetry. The normal residual plot is a theoretical quantile-quantile plot of the quantiles of residuals (obtained, for example, from a regression model and usually standardized or Studentized) against the corresponding quantiles from a normal distribution. Deviation from the normal straight line can result either from nonnormal residuals or from model misspecification, e.g. a nonlinear relationship described by a linear model. b

quantile-quantile plot [GRAPH]

+ quantile plot b

quantitative data [PREP] data

--+

b

quantitative variable [PREP]

+ variable b

quartile [DESC]

+ quantile b quartile coefficient of skewness [DESC] + skewness b quartile deviation [DESC] + dispersion

quartimax rotation [FACT] + factor rotation b

random factor [EXDE] b

quartimin rotation [FACT] factor rotation

-3

b

quasi-Newton optimization [OPTIM]

: variable metric optimization

Quenouille’s test [TEST] + hypothesis test b

R R2 [MODEL] + goodness of fit b

b R-analysis [MULT] + multivariate analysis

b

R-chart [QUAL]

+ control chart b

(0

variable control chart)

Restimator [ESTIM]

+ estimator

Rajski’s coherence coefficient [GEOM] -3 distance ( 0 ranked data) b

b

-+ b

Rajski’s distance [GEOM] distance ( 0 ranked data)

random effect [ANOVA] term in ANOVA

--3

b

random effect model [ANOVA]

+ term in ANOVA random factor [EXDE] + factor

b

265

266

random number [PROB]

b random number [PROB] + random variable b

random process [TIME]

: stochastic process

random sample [PROB] + population b

b random sampling [PROB] + sampling b random series [PROB] + simulation b

random variable [PREP]

+ variable b random variable [PROB] Function that maps events into a set of values. The variate, denoted by X, is a generalization of the random variable. It also has probabilistic properties, but is defined without reference to a particular type of probabilistic experiment. A variate is the set of all random variables that follow a particular probabilistic law. A multivariate is a vector of variates. A random number x associated with a given variate is a number indicating a realization of a random variable belonging to that variate; e.g. X = x. The probability of a variate taking on a value less than or equal to x is denoted P [X 5 x ] . The set of all random numbers that a variate can take is called the range of the variate. The set of all values that P[.. .] can take is called the probability domain of the variate. There are several functions associated with a variate. The distribution function or cumulative distribution function (cdf) of a variate, denoted F(x), maps the range of the variate into the probability domain of the variate:

F(x) = P [X 5 x ] = a!

0 5 a! 5 1

F(x) is the probability that the variate takes a value less than or equal to x. The function F(x) is nondecreasing in x and takes the value one at the maximum of x. The inverse distribution function or quantile function of a variate, denoted G(a!),maps the probability domain into the range of the variate: G ( a ) = x = G(F(x)) G(a!)is the random value x, such that the probability that the variate takes a value less than or equal to x is a!.

random variable [PROB]

267

The survival function or reliability function of a variate, denoted S(x), is the complement function to F(x), i.e. it is the probability that the variate takes a value greater than x : S(x) = P [ X > x] = 1-a! = 1 - F ( x )

The inverse survival function of a variate, denoted Z(a), is the random number that is exceeded by the variate with probability a!: Z(a) = x = Z(S(x)) = G ( l - a)

The probability density function (pdf) or density function or frequencyfunction of a variate, denoted f ( x ) , is the first derivative of the distribution function F(x) with respect to x:

Its integral over the range takes a value in that range:

XU

- XL is equal to the probability that the variate

If the variate is discrete, the functionf(x) is called the probability function (pf). It is the probability that the variate takes on the value x : f ( x ) = P [X= x]

The hazard function or failure function of a variate, denoted h(x) is the ratio of the probability density function to the survival function at x :

Moments are also important functions of a variate. The following terminology may characterize a distribution:

asymmetric distribution Distribution without central value p such thatf(x - p ) = f ( p - x ) .

268

random variable [PROB]

asymptotic distribution The form of a distribution as the sample size tends to infinity. The form of the distribution is called asymptotic normal distribution when it approaches the normal distribution as the sample size tend to infinity. bell-shaped distribution Distribution in which the density function has a shape that is similar to the contour of a bell. Bell-shaped distributions are symmetric. For example, the normal distribution is a bell-shaped distribution. 0.6

I I I I

I

bimodal distribution Distribution with a density function having two modes, i.e. two maximums. It is often the result of mixing two unimodal distributions.

random variable [PROB]

269

conditional distribution Distribution of a set of variates at fixed values of another set of variates. contaminated distribution Mixture of normal distributions with identical locations but different dispersions. Contamination, i.e. observations from a normal distribution with a larger dispersion than the base distribution, causes a heavy (long) tail in the base distribution. continuous distribution Distribution that is a function of one or more variates measured on a proportional or ratio scale. discrete distribution Distribution that is a function of one or more variates measured on a nominal or ordinal scale. empirical distribution Distribution that is a function of a variate describing a sample. J-shaped distribution Distribution in which the density function has a shape that is similar to a letter J or a reversed J. J-shaped distributions are skewed.

z&

-

n oo

I 0 20

I

I 0 40

I

I 0 60

I

I 0 80

I 1 0

joint distribution Distribution of two or more variates (especially used for two variates). marginal distribution Unconditional distribution of a single variate in a multivariate distribution. It does not depend on the values taken by the other variates.

270

random walk [TIME]

multivariate distribution Joint distribution of several variates. symmetric distribution Distribution with central value p, such thatf(x - p ) =f(p - x ) . theoretical distribution Distribution that is a function of a variate describing a population. U-shaped distribution Distribution in which the density function has a shape that is similar to a letter U.

0

9

0

I

I

I

I

I

I

I

I

I 0

univariate distribution Distribution of a single variate. b random walk [TIME] Stochastic process defined as:

x (t) = x ( t - 1)

+ a(t)

where a(t) is an i.i.d. random variable with zero mean and variance 0,”. The mean function of x ( t ) is p(t) = 0. Because its autocovariance function y ( t ) = ta;

increases linearly with time, the random walk is nonstationary. Its autocorrelation function is

randomized design [EXDE]

271

b randomization [EXDE] Random assignment of runs to treatments and random ordering of the runs. Randomization is either complete, i.e. runs are randomized along the whole design matrix, or runs are randomized within a block. Randomization is performed to eliminate unforeseen bias and to cancel correlation between adjacent runs. Randomization helps to ensure that the experimental error is an independently distributed random variable.

randomized block design [ANOVAI Analysis of variance model in which the total sum of squares is composed of the sum of squares due to effects, the sum of squares due to blocking and the sum of squares due to random error. The simplest randomized block design model has one effect A with I levels and one blocking variable B with J levels: b

Yij=y+Ai+Bj+eij

i=l,I

j=l,J

n=I J

There is one observation per treatment in each block. Because the only randomization of treatments is within blocks, the blocks represent a restriction on randomization. This model is additive, i.e. there is no interaction between the effects and blocks. The treatment and block effects are defined as deviations from the grand mean, so that

When the effects and blocks are fixed, the expected mean squares are: EMSA= U 2

+ J CAH/(I - 1) i

The model with one effect A and two blocking variables B and C is called a Latin square: Yijk

-

=Y

+ Ai + Bj + c k +

eijk

This model is also completely additive, i.e. there is no interaction term between the effect and the two blocking variables. b randomized block design [EXDE] + design b

randomized design [EXDE]

+ design

272

range [DESC]

range [DESC] + dispersion b

b

range chart [QUALI

-+ control chart (n variable control chart) b

range scaling [PREP] standardization

-+

b rank [DESC] Ordinal number indicating the position of an object with respect to other objects when they are ordered according to some criterion, such as values assumed by a variable. When ranking objects it may happen that some of them have indistinguishable positions, i.e. they have a tied rank. In this case, the usual solution is to assign equal rank to all objects in the tied group, calculated as the average of ranks assigned ignoring the tie. For example:

3.5 5.0 1.3 3.5 4.0 1.2 3.5 6.7 3.5 rank(x): 4.5 8 2 4.5 7 1 4.5 9 4.5

X:

A statistic calculated on ordered data, i.e. on data with observations arranged in ascending order, is called a rank order statistic or order statistic. Examples are: median, interquartile range, rank correlation. b rank analysis [FACT] Collection of techniques for determining the number of factors, i.e. the complexity of a data matrix.

average eigenvalue criterion A factor is significant if its eigenvalue is greater than the average eigenvalue. When variables are autoscaled, i.e. the correlation matrix is factored, this criterion is the eigenvalue-one criterion. double cross-validation (dcv) Determines M based on an optimization of the predictive ability. It calculates the ratio PRESS(M) RSS(M - 1) with i

j

i

j

rank anarysis [FACT]

273

is reproduced from an (M - 1)-component model calculated from the complete data matrix, while xIij\ij is reproduced from an M-component model calculated from a data matrix in which elements were deleted diagonally including the ijth element. The above ratio is compared with either Q = 1 or with a more conservative empirical function &j

Q=/

(p - M)(n - M - 1) (p-M-l)(n-M)

a ratio less than Q indicates optimal M.

eigenvalue-one criterion A factor is significant if its eigenvalue is greater than one. It is the average eigenvalue criterion on autoscaled data.

eigenvalue threshold criterion A factor is significant if its eigenvalue is greater than a specified threshold value. It is a generalization of the average eigenvalue criterion.

fixed percentage of explained variance The number of factors M is determined such that the factor model explains a prespecified fixed percentage of the total variance.

5

Am

Horn’s method A modification of the average eigenvalue criterion suggesting that the threshold value should not be uniformly the average eigenvalue, but should decrease with increasing rank of factors. The individual thresholds are calculated as eigenvalues from a data matrix of the same size as the matrix analyzed, but filled with random values from a normal distribution with the same variance as the original variables.

Hotelling-Lawley trace test Determines M based on testing the distribution of

imbedded error function Under the assumption that the error is random and identically distributed in the data matrix, the eigenvalues associated with the residual error should be approximately

274

mnk analysis [FACT]

equal, i.e. A, = A,+~ = . . . =A,

m = M + l,p

In this case the imbedded error can be written as

If the above assumption holds, this function decreases for m = 1,M and increases for m = M + 1,p . The optimal number of factors M is indicated by the minimum of IE. In practice, when the error is not truly random or identically distributed, IE keeps decreasing even at m > M,but a much slower rate, so the curve IE - m flattens at M.

indicator function : Malinowski’s indicator function likelihood ratio criterion Determines M based on testing the distribution of

Malinowski’s indicator function (ME) ( : indicator function) An empirical function of the real error RE or residual standard error RSD:

MIF =

-

RE @-M)2

RSD

- (m$+lnb

- (p-M)2 -

Am

@-W2

It shows a minimum when the true number of factors M is reached. This indicator function appears to be more sensitive than the imbedded error function.

Pillai’s trace test Determines M based on testing the distribution of 1

=

(ix) m = l , M

Roy’s greatest root test Determines M based on testing the distribution of AR = Al.

scree test Determines M based on the phenomenon that the residual variance levels off when the proper number of common factors is obtained. This leveling off is assessed visually on the scree plot, which is the residual variance, or simply the eigenvalues plotted against the number of common factors.

Rayleigh distribution [PROB]

275

Wilks’ A test Determines M based on testing the distribution of

Aw

=

(ik)

m=l,M

b rank correlation [DESC] + correlation b

rank deficient matrix [ALGE]

+ matrixrank b rank distance [GEOM] + distance (0 rankeddata)

b

rank of a matrix [ALGE]

: rnutrixrank b rank order statistic [DESC] + rank b rank test [TEST] + hypothesis testing b

ranking variable [PREP]

+ variable b rankit transformation [PREP] + transformation

rank-one update [ALGE] Matrix decomposition of X’calculated from the matrix decomposition of X where matrix W’ can be obtained from X by: - adding a rank one matrix to X; - appending a row or column to X; - deleting a row or column from X. In such situations it is more efficient to update the decomposition of X instead of starting the calculation from the beginning. The most straightforward updating is that of the Q-R decomposition. b

ratio scale [PREP] -+ scale b

Rayleigh distribution [PROB] + distribution b

276

Rayleigh’s test [TEST]

b Rayleigh’s test [TEST] -+ hypothesis test

b

-+

real error (RE) [MODEL] goodness offit

real error (RE) [FACT] -+ error terms in factor analysis b

b

reciprocal regression model [REGR] regression model

-+

b

reciprocal transformation [PREP] transformation

-+

b

-+

rectangular distribution [PROB] distribution

rectangular kernel [ESTIM] + kernel b

b

rectifying inspection [QUAL] lot

-+

b

recursive residual [REGR] residual

-+

b

reduced correlation matrix [FACT] principal factor analysis

-+

b

reduced model [ANOVA] analysis of variance

-+

reduced variable [PREP] + variable b

b

reflected design [EXDE] design

-+

b regression analysis [REGR] Collection of statistical methods using a mathematical equation to model the relationship among measured or observed quantities. The goal of this analysis is

regression curve [REGR]

277

twofold: modeling and predicting. The relationship is described in algebraic form as y=f(x)+e

where x denotes the predictor variable(s), y the response variable(s), f ( x ) is the systematic part, and e is the random element of the response, also called the model error or residual. To calculate a regression model one must select the structural form f (x) (the most common is the linear regression model), the probabilistic model for the error term (the most common is to assume normality), and the estimator for the regression coefficients (the most common is least squares). Regression analysis is not merely the estimation of model parameters. It also includes the calculation of goodness of fit and goodness of prediction statistics, regression diagnostics, i.e. influence analysis and residual analysis. Besides the well-known ordinary least squares regression model, biased regression, nonlinear regression, and robust regression models are also important. Calibration and the standard addition method are two special applications of regression.

regression coefficient [REGR] (: regression parameter) Coefficient of a predictor or a function of predictors in a regression model. In a linear regression model it is the partial derivative of the response variable with respect to a predictor variable b

It indicates the importance of the corresponding predictor variable in the model. Although the least squares estimator is the most popular method for calculating the regression coefficients, generalized least squares, biased estimators, and robust estimators are also often applied. The variance inflation factor measures the effect of an ill-conditioned predictor matrix on the coefficient estimates. The standardized regression coefficient is the regression coefficient divided by the ratio of the standard deviation of the response to the standard deviation of the corresponding predictor variable, i.e. it is the regression coefficient for autoscaled variables. The constant term in a regression model is called the intercept or offset. It can be considered as the regression coefficient of a dummy predictor variable with all elements being set to one. b regression curve [REGR] Graphical representation of a regression model. For a univariate regression it can be drawn on a plane of the predictor and response variables. If the relationship is linear the regression curve is called a regression line. For multiple regression the model is represented as a regression surface, also called the response surface.

278

regression diagnostics [REGR]

F regression diagnostics [REGR] (: diagnostics) Collection of techniques used to detect and assess the quality and reliability of a regression model. The goal of diagnostics is twofold: recognition of important phenomena due to outliers rather than the bulk of the data and suggestion of appropriate remedies to find a better regression model. Regression diagnostics is performed to narrow the gap between theoretical assumptions and observed data. In contrast to robust regression, which solves this problem by dampening the effect of outliers, regression diagnostics identifies the outliers and deals with them directly. It looks for model misspecification, departure from the normalitv assumption and from homoscedasticity of the residuals, collinearity in the predictor variables and influential observations. Residual analvsis, comprising numerical and graphical analysis of the ordinary and various derived residuals, is one of the most important parts of regression diagnostics. A collection of statistics, known as influence analysis, helps to reveal the effect of individual observations on the regression model. Assessment of collinearity and its harmful effect on the regression estimates is another task of regression diagnostics. b

regression equation [REGR]

: regression model

regression function [REGR] : regression model

b

b

regression line [REGR]

+ regression curve b regression model [REGR] (.- regression equation, regression function) Mathematical (usually algebraic) equation, also called structural model, to describe the relationship among predictor and response variables. The graphical representation of a regression model is called regression curve. A list of regression models of particular interest follows.

exponential regression model Intrinsically linear regression model of the form y = exp [bTx]e Taking natural logarithm of both sides gives a linear model: lnb) = bTx + ln[e]

first-order regression model Regression model in which the exponents of all variables are 1, e.g. y = bo b2 x2. In such a model the following relationship holds:

+ bl + XI

regression model [REGR] 279

Geometrically this model is a p-dimensional hyperplane, e.g. a straight line if p = 1, a plane if p = 2, etc.

generalized linear model (GLM) Regression model describing the relationship between a response variable and a set of predictor variables. The probability distribution of the response, specified in terms of predictors, must belong to an exponential family, e.g. normal, Poisson, etc. This restriction rules out discretization and various mathematical transformations of the response. The predictor variables are connected to the response only through a linear combination: ?-)

= bo

+ blxl + b2XZ + . . . + bpXp

The expected value of each response y can be expressed as some known function of this linear combination:

where g is called the link function. Examples are: ANOVA (response of normal distribution and identity link function), OLS (response of normal distribution and identity link function), log-linear models (response of Poisson distribution and the inverse of the link function is the natural logarithm), logit and probit analysis (response of binomial distribution and logit or probit function).

Gompertz growth model Growth model that assumes a linear relationship between the relative growth rate and time. It has a double exponential form:

This model has an S-shaped curve which is not symmetrical about its inflection point. The parameter a is the limiting growth, when x = 0, y = a e-b.

growth model Nonlinear regression model which describes growth as a function of an increasing independent variable, usually time. In general, growth models are mechanistic, i.e. the model is defined by solving differential equations that represent an assumption about the type of growth. Examples are: logistic, Gompertz, Richards and Weibull growth models. intrinsically linear regression model Nonlinear regression model that can be transformed into linear form; e.g. logistic model, reciprocal model.

280

regression model [REGR]

intrinsically nonlinear regression model Nonlinear regression model that cannot be transformed into linear form. linear regression model Regression model in which the response variable is a linear function of the regression coefficients, i.e. ay/abj is not a function of bj. Examples of linear regression are ordinary least squares regression, ridge regression, stepwise regression, principal components regression, partial least squares regression. logistic growth model Growth model assuming that the growth rate is proportional to the product of the present size and the future amount of growth U

+ bexp [-la]

=1

+

When x = 0 the starting growth value is y = a/(l b). The parameter a is called limiting growth. It is the value that y approaches as x tends to infinity. The values b and k are always positive. The plot of y vs. x has an S shape, the slope of the curve is always positive. logistic regression model Nonlinear regression model which describes the probability P of a binary response y of the form: 1 = 1 exp [bo blX]

+

+

This function has an S shape that approaches one asymptotically. It can be linearized by applying the logit function: In

(&)

= bo

+ b1x

multiple regression model : multivariate regression model multivariate regression model (: multiple regression model) Regression model in which the response is a function of more than one predictor variable. The simplest one is of linear form y = bo + x b T

+e

nonlinear regression model Regression model in which the response variable is a nonlinear function of the regression coefficients. In contrast, a regression in which the response variable is a linear function of the parameters but contains powers or cross-products of the predictor variables is called a polynomial regression model, or second-, third-

regression model [REGR]

281

(etc.) order regression. Nonlinear regression models can be divided into two groups: parametric and nonparametric. The first group consists of models that have a specific parameterized form arising from the scientific field of the data; these models are suggested by the theory of the subject, e.g. growth models. The second group contains more flexible models in which the form of nonlinearity is not prespecified but estimated from the data. Examples are: ACE, SMART, MARS, nonlinear PLS. In nonparametric methods smoothers and splines are used extensively for the approximation of functions. The least squares estimator is the most popular one for calculating the parameters and functions. Because the parameters enter nonlinearly into the criterion to be minimized, nonlinear optimization techniques must be employed, e.g. the Gauss-Newton method.

quadratic regression model : second-order regression model reciprocal regression model Intrinsically linear regression model of the form: 1 bTx+e Taking reciprocals on both sides gives a linear model: y=-

1 -=bTx+e Y

Richards growth model Variation of the logistic growth model including an additional parameter: a Y = (1 +bexp[-kr])lld second-order regression model (: quadratic regression model) Model in which at least one regression coefficient is a linear function of the corresponding predictor, i.e. abj axj

- = krj This model contains predictors on second power 0

# and/or product terms Xjxk.

single regression model : univariate regression model

univariate regression model ( ; single regression model) Regression model in which the response is a function of only one predictor variable. The simplest is the linear form 0

y = bo

+ blx+ e

282

regression parameter [REGR]

Weibull growth model Growth model of the form: Y = a - bexp (-cxd)

The starting growth value at x = 0 is y = a - b. The limiting growth parameter is Q.

zero-order regression model Regression model containing only a constant, i.e. it is independent of the predictors: y=bo+e b

regression parameter [REGR]

: regression coefficient

regression surface [REGR] + regression curve b

b

regressor [PREP]

+ variable b

regular simplex [OFTIM]

+ simplex b regularized discriminant analysis (RDA) [CLAS] Parametric classification method, like SIMCA and DASCO, that is an extension of quadratic discriminant analvsis. It seeks a biased estimate of the class covariance matrices in order to reduce their variance. RDA has two biasing schemes: the class covariance matrices are pulled towards a common covariance matrix:

S&)

= (1 - h)Sg +AS

and shrunk towards a multiple (the average of the eigenvalues) of the identity matrix 1:

The first biasing is controlled by the parameter h and the second is regulated by the shrinkage parameter y. Both range between 0 and 1, their values are chosen based on cross-validated misclassification risk. If A. = 0 and y = 0, RDA is equal to quadratic discriminant analysis, whereas if h = 1 and y = 0, RDA is equal to linear discriminant analysis. The case of h = 1 and y = 1 corresponds to nearest

relativefrequency [DESC]

283

means classification and the case of A = 0 and y = 1 gives weighted nearest means classification.

WNMC

NMC

Euclidean distance

i QDA

LDA

ridge

Mahalanobis distance

1

intermediate QDA LDA

-

Holding y = 0 and varying A produces models lying between QDA and LDA; holding A = 0 and increasing y , RDA attempts to unbias the sample-based eigenvalue estimates; holding A = 1 and increasing y gives rise to a ridge regression analog for LDA. rejectable quality level [QUALI + producer’s risk b

b rejection line [QUALI + lot b rejection number [QUAL] + lot

rejection region [TEST] + hypothesis testing b

relative efficiency [ESTIM] + efficiency b

relative frequency [DESC] + frequency b

284

b

relative standard deviation [DESC]

relative standard deviation [DESC] dispersion

b

reliability function [PROB] random variable

b

reliability score [CLAS] classification

b relocation algorithm [CLUS] Basic algorithm of non-hierarchical optimization clustering, consisting of the following steps: 1. Set an initial partition of n objects into G clusters. 2. Select a criterion to optimize. 3. Calculate the centroid (centrotype) of each cluster. For i = 1, n 4. Assign object i to another cluster if that improves the optimization criterion. 5. If object i was relocated, recalculate the centroid (centrotype) of its old and new cluster. End 6 . If no relocation occured stop, otherwise goto step 3. A variation of the above algorithm is when step 5 is omitted, i.e. the cluster centroids are recalculated only after a full cycle.

b

renewal process [TIME] stochastic process ( counting process)

replication [EXDE] Repetition of runs with the same treatment. Replicates are identical rows in the design matrix. They are assumed to be independent observations. Replication makes it possible to estimate the experimental error (precision) and to obtain a more precise estimate of the effect of a factor. b

b

reproduced correlation matrix [FACT] factor loading

i .

b

resampling [MODEL] model validation

b residual [REGR] Quantity remaining after some other quantity has been subtracted; for example the part of a variable unexplained by a statistical model. In regression, the part of the

residual [REGR]

285

response variable not described by the regression model. This part is due to random variation or model misspecification, as opposed to the systematic part described by the model. The residual is calculated as the difference between the observed and the predicted value of the response variable: e = y - 3. In ordinary least squares the residuals are assumed to be uncorrelated and to follow a normal distribution with zero mean and equal variance. Residuals, investigated in residual analvsis, play an important part in regression diagnostics. A list of various residuals follows.

adjusted residual Residual adjusted for the effect of the predictor values, assuring equal variance:

where hii is the ith diagonal in the hat matrix.

Anscombe residual Transformed residual informative in case of response of nonnormal distribution: ef = f (yi) - f

(fi)

i = 1,n

where f is a function that transforms y into a normally distributed variable.

cross-validated residual ( : deletion residual,predictive residual) Residual of a model fitted to data with the ith observation excluded: ei\i = yi - f i \ i

i = 1, n

where fi\i denotes the predicted response value of the ith observation from a model calculated without the zth observation. Measures of goodness of mediction are calculated on the basis of this residual. In the case of linear estimators (e.g. OLS or ridge regression) the cross-validated residuals can be calculated from the ordinary residuals as:

where hii is the ith diagonal element of the hat matrix. In the case of nonlinear estimators (e.g. PLS) the cross-validated residuals must be calculated by the leaveone-out technique.

deletion residual : cross-validated residual externally Studentized residual : Studentized residual internally Studentized residual : standardized residual jackknifed residual : Studentized residual

286

residual [REGR]

ordinary residual Residual of a model fitted to all the observations:

ej=yj-fj

i=l,n

Measures of goodness of fit are calculated on the basis of this residual. predictive residual : cross-validated residual

o recursive residual Residual of a time series model in which bi-1 is calculated using only the first i - 1 observation:

o standardized residual ( : internally Studentized residual) Residual of a model fitted to all the observations scaled by its standard error estimated from all the observations:

with

This residual is scale invariant and has a t-like distribution. This scaling assures homoscedasticity and unit variance. Studentized residual ( : externally Studentized residual,jackknifed residual) Residual standardized so as to have the same precision along the observations. This standardization eliminates the effect of the location of the objects: ei t. i=l,n '-s\iJm;

where S\i is the standard error estimated without the ith observation. It can be calculated as S\j

=

d

(n -P)s* - e?/(l - hji) n-p-1

This residual has zero mean and unit variance, it is scale invariant, and is more appropriate for detecting violations of assumptions in a regression model.

response surface [EXDE]

287

b residual analysis [REGR] Part of regression diagnostics,based on examining residuals from a regression model via numerical and/or graphical techniques. The goal is to infer any incorrect assumptions concerning the model error (e.g. homoscedasticity or assumption of normality). The most popular plots are residual scatter plots and residual quantile dots. b

residual correlation matrix [FACT] factor loading

-+

residual degrees of freedom [MODEL] -+ model degrees of freedom b

b residual mean square (RMS) [MODEL] + goodness of fit

residual plot [GRAPH] + scatter plot b

b residual standard deviation (RSD) [MODEL] + goodness of fit b residual standard error (RSE) [MODEL] + goodness offit

b

residual sum of squares (RSS) [MODEL] goodness of fit

-+

b

residual variance [MODEL] goodness of fit

-+

b resistant estimator [ESTIM] -+ estimator b

resolution [EXDE] confounding

-+

b

response curve [GRAPH] scatter plot

-+

b response surface [EXDE] Mathematical function that describes the response as a function of the factors. It can be visualized and plotted only if there are no more than two factors. Response surfaces are often described by polynomial approximations.

288

response surface methodology (RSM) [EXDE]

If only terms up to degree one are included (i.e. only main effects), the function is called a first-order response surface, which is a hyperplane. The corresponding design is a first-order design. A second-order response surface also includes interactions and squared factors. This surface is curved possibly with minima, maxima, ridge, and saddle points. The corresponding design is called a second-order design.

Response surface methodology (RSM) is a collection of statistical techniques that, by design and analysis of an experiment, seeks to relate the response (output) of a system to factors (input) that have an effect on the response. RSM is used for predicting responses at given factor levels, choosing factor levels to obtain specified response; finding factor levels to obtain optimal response. b

response surface methodology (RSM) [EXDE]

+ response sugace b

response variable [PREP]

+ variable b

restricted model [ANOVA]

+ analysis of variance b resubstitution [MODEL] + model validation

ridge regression (RR)[REGRJ

289

b Richards growth model [REGR] + regression model

b ridge regression (RR)[REGR] Biased regression method based on the assumption that large regression coefficients are likely to be spurious, therefore it shrinks them toward zero by adding a small constant y to each diagonal element of the correlation matrix. The constant y is called the shrinkage parameter. Increasing y increases the bias in the regression coefficient estimates but it also decreases their variance. The selection of optimal y on the basis of plotting the regression coefficients as function of y is called the ridge trace. The value of y is increased until stability is indicated in all regression coefficients. Stability does not mean convergence, but rather a lack of rapid change of the coefficients as a function of y . A better strategy for estimating the optimal y is based on a goodness of prediction criteria. The regression coefficient estimates are

b=(WTW+yI[)-lXTy

They are calculated by minimizing r

1

or equivalently by maximizing max [corr2 (y, X bT)var (X bT)/ var (W bT + y ) ] b

bT b = 1

The bias and variance of the estimated regression coefficient is:

x + y 11-2 /9 ~ ( 6=)s2tr [(wT x + y I>-'] =

B2(6) = y2 /9T[WT

S'

C

Aj /(Aj

+ y12

j

where /9 is the true regression coefficient, s is the error standard deviation and Aj are eigenvalues of the correlation matrix. As the ridge estimator is a linear estimator the cross-validated residuals can easily be calculated as

where hii(y) is a diagonal element of the ridge hat matrix W ( y ) = X (WT W

+ y I [ p WT

with W having centered predictor variables. The degrees of freedom of a ridge regression model is given by the trace of W ( y ) .

290 b

ridge trace [REGR]

ridge trace [REGR] ridge regression

--f

risk function [MISC] + decision theory b

b robust estimator [ESTIM] + estimator b

robust locally weighted regression [REGR]

-+ robusf regression b robust regression [REGR] Regression method that is insensitive to small deviations from the distributional assumptions. It weights down the influence of observations with large residuals, hence safeguarding against outliers in the response. The least squares estimator can be made robust by iteratively reweighting the residuals in inverse proportion to their magnitude. For example, the biweight can be applied as an iterative procedure in which the weights are calculated as a function of the residuals. The indicator function offers another solution to mitigate the effect of outliers. Robust regression estimators can also be defined by minimizing other functions than the sum of squares of the residuals. The following robust methods are the most popular.

bounded influence regression (: GM estimator) Robust regression that limits the influence of outliers in a regression model by applying some weighting function. The derivative of the minimization criterion is

C

w(xi)

Ilr(ei/S) x i = 0

i

where s is an estimate of the error standard deviation, w denotes some weight function of the predictor vectors and @ is an indicator function. Unfortunately the breakdown point of the bounded influence regression increases with an increasing number of predictor variables.

GM estimator : bounded influence regression iteratively reweighted least squares regression (IRWLS) Robust regression that estimates the regression coefficients by minimizing the criterion: mp

[7

wi

r;]

where W i weights the squared standardized residuals r; according to their magnitude. The weights Wi are calculated simultaneously with the estimates of error standard deviation in an iterative fashion.

robust regression [REGR]

291

L1 regression : least absolute residual regression least absolute residual regression (: Ll regression) Robust regression that estimates the regression coefficients by minimizing the sum of absolute residuals, not the sum of squared residuals: r

1

Although this method is thought to be more robust than the least squares method, its breakdown point is still no better than 0%.

least median squares regression (LMS) Robust regression that estimates the regression coefficients by minimizing the median instead of the sum of the squared residuals: c

-

J' 1

min mede?

This method is very robust with respect to outliers in the response; its breakdown point is 50%.

least trimmed squares regression (LTS) Robust regression that estimates the regression coefficients by minimizing the criterion:

1

m p C(e2)l:. [ i

where (e2)1:nare the ordered squared residuals. In the summation the m largest squared residuals are omitted. When m is around 4 2 the breakdown point of this estimator is 50%.

locally weighted scatter plot smoother (LOWESS) regression

.- robust local& weighted

robust locally weighted regression ( .- locally weighted scatter plot smoother) Robust regression in which, at each observation Xi, a weight vector wk(xi) is calculated and a weighted least squares criterion is minimized. This weight vector is centered at Xi and scaled such that it becomes zero at points further away than a specified nearest neighbor. The size of the neighborhood, i.e. the fraction of the observations with nonzero weights, is a parameter to be optimized. After the initial model and residuals have been calculated the weight vectors are modified on the basis of the size of the residuals: observations with large residuals receive small weights and observations with small residuals receive large weights. This correction is performed by using the biweight.

292 b

robustness [ESTIM]

robustness [ESTIM] estimator ( 0 robust estimator)

-+ b

Rogers-Tanimoto coefficient [GEOM] distance (0 binarydata)

-+ b

root mean square (RMS) [DESC] dispersion

-+

b

root mean square deviation (RMSD) [DESC] dispersion

-+

b

root mean square error (RMSE) [MODEL]

+ goodness of fit b

rootogram [GRAPH] histogram

-+

b

Roquemore design ‘[EXDE]

+ design b

Rosenbaum’s test [TEST]

+ hypothesis test b rotatable design [EXDE] + design

rotation [FACT] : factor rotation

b

b

rotation matrix [ALGE] (.- transformation matrix)

An orthogonal matrix M that brings a matrix W into another matrix X’, preserving

the length of its vectors:

W’=WM

Due to the orthogonality of M, the following properties hold: M = M-1 MTM=ll IMI = f l

round-off error [OPTIM] -+ numerical error b

sampling [PROB]

293

b rowvector [ALGE] + vector

Roy’s greatest root test [FACT] + rankanalysis b

b run [EXDE] + design matrix

b Russel-Rao coefficient [GEOM] + distance (0 binarydata)

S b S-chart [QUAL] + control chart ( 0 variable control chart) b S, statistic [MODEL] + goodness ofjit

b sample [PROB] + population

b sample reuse [MODEL] + model validation b

sample size [PROB]

+ population b sampling [PROB] The process of drawing a subset, called a sample, from a population in order to estimate the properties of the population. Mathematically, simulation performs the sample drawing process. There are several sampling strategies to choose from:

cluster sampling (: nested sampling) Sampling from selected, restricted parts of the population. Examples are subsampling and two-stage sampling where essentially more than one sample is collected from selected parts of the population.

294

samplingplan [QUAL]

Monte Carlo sampling Sampling on the basis of mathematical experiments, involving random numbers, in which mathematical constraints (e.g. distribution parameters) are imposed. nested sampling : cluster sampling random sampling Sampling by dividing the population into equal parts and selecting from them using a random procedure. Such a process ensures that an unbiased sample is obtained. stratified sampling Sampling from a population that is heterogeneous but composed of k clearly distinguishable homogeneous sub-populations (strata) with known relative frequencies. In such a case, k individuals are selected, one from each stratum. systematic sampling Sampling at regular intervals. It is appropriate only for homogeneous population. On the other hand, systematic sampling results in biased estimates if the investigated property changes regularly with the sampling interval.

b

sampling plan [QUAL] acceptance sampling

b

saturated design [EXDE] design

b saturated model [MODEL] + model (0 nested model) b scalar [ALGE] Quantity that, in contrast to a vector, has only magnitude, but no direction. A scalar has the same value in each coordinate system, i.e. a scalar is scale invariant.

b

scalar multiplication of a matrix [ALGE] matrix operation

b

scale [DESC]

: dispersion b scale [PREP] Variables can be characterized according to the scale on which they are defined. A qualitative variable is measured on a nominal or ordinal scale, also called a non-metric scale, while a quantitative variable is defined on a proportional or ratio

scale [PREP] 295

scale, also called a metric scale. The scale of measurement greatly affects the choice of model and estimator.

arithmetic scale : ratio scale interval scale :proportional scale nominal scale The mathematically weakest scale for qualitative variables where the coded information is only a name (category or label) without any order relation. The number of possible categories of a nominal variable is usually finite, although it is possible to have a countably infinite number. On a nominal scale the only admitted operations among the categories are = and #. For example, color, taste, or chemical categories are measured on a nominal scale. When a variable has only two categories (presence/absence of a property, male/female, no/yes) it is called a binary variable and is usually coded as 0/1, -/+ or 1/2. Nominal variables with several categories are often coded in several binary variables, each corresponding to a category. ordinal scale Stronger than a nominal scale for qualitative variables where the categories can be arranged in order. The number of possible categories of an ordinal variable is usually finite, although it is possible to have a countably infinite number. On an ordinal scale total ordering (<, >) or partial ordering (5,2 ) operations are defined among the categories. Thus, an ordinal variable indicates an ordering or ranking of measurements, with relative rather than quantitative differences. Variables that are originally on a proportional or ratio scale can be reduced to ranks measured on an ordinal scale. proportional scale (: interval scale) Scale for quantitative variables; discrete or continuous. The starting point of the scale is not well-defined, so only the difference between two values is meaningful, not the ratio. Besides the operations =, #, <, >, 5,2 of the weaker ordinal scale, - and + are also defined. For example, temperature is measured on a proportional scale. ratio scale ( : arithmetic scale) Stronger than a proportional scale for quantitative variables where the starting point of the scale is unambiguously defined. All operations are exactly defined; both the difference and the ratio between two values are meaningful. For example, length, weight, or volume are measured on a ratio scale. A ratio scale usually has a unit associated with it. Unitless ratio scales are the frequency count scale and the percentage scale. The former measures counts, the latter percentages of the total.

296

scale invariance [ESTIM]

b scale invariance [ESTIM] Characteristic of an estimator or a model stating that it does not change with a change in the scale of the data. Examples are: ordinary least squares regression, discriminant analysis. Many estimators, however, result in different estimates depending on the scale of the data, i.e. they are scale variant. Examples are: PCR, PLS, SIMCA, RDA, K",most of the cluster analysis models.

b

scaling [PREP] standardization

--3

scatter [DESC] : dispersion

b

b

scatter diagram [GRAPH]

: scatter plot b scatter matrix [DESC] (; dispersion matrix) Square matrix T that describes the dispersion of multivariate data around the mean. Its elements are:

In the case of centered variables the scatter matrix equals the information matrix, defined as XT X. The scatter matrix is related to the covariance matrix B as:

B = T/n

or

B = T/(n- 1)

the latter being for unbiased estimates. The scatter matrix T can be decomposed into the within-group scatter matrix W and the between-group scatter matrix B:

T=W+B Several multivariate methods are based on optimizing or testing such decomposition, e.g. MANOVA, discriminant analysis, non-hierarchical cluster analysis. b scatter plot [GRAPH] ( ; scatter diagram, plot) Cartesian plot in which the coordinates are either original variables or quantities (statistics) derived from the data. In contrast to the quantile-quantile plot, the

scatterplot [GRAPH]

297

variables are paired, i.e. the corresponding values are measured on the same object. The scatter plot is an efficient way to represent the relationship between two or three variables. no-dimensional scatter plots are the most common. Additional variables can be represented on a static two-dimensional plot by adding color, size or shape to the data points. A more efficient way to create a three-dimensional scatter plot is by using interactive computer graphics.

biplot Graphical display of a data matrix by means of markers for its rows and for its columns such that inner products of these markers represent the corresponding matrix elements. The most popular biplot is the principal component biplot, in which the row markers are principal component scores and the column markers are the eigenvectors scaled with the corresponding eigenvalues.

0

0 0

PC1

It is a joint display of rows and columns, in contrast to most other scatter plots, which display only rows or only columns of a data matrix. The distances of the column points from the origin are roughly proportional to the standard deviations of the variables. Cosines of angles between two vectors drawn from the origin to two column points represent the correlation between the two variables. Distances between row points represent their dissimilarities. While distance between a row and a column point (i.e. between an object and a variable) is not directly interpretable, the relative positions of objects with respect to variables, and vice versa, are the essence of the biplot.

contour plot (: density map) Two-dimensional graphical representation of a response surface, i.e. the projection of a response variable onto a plane of the predictor variables or factors, by means of isoresponse curves.

298

scatterplot [GRAPH]

x2

x1

The levels of the two factors or of two linear combinations of factors are indicated on the horizontal and vertical axes, whereas the values of the response are indicated by contour lines drawn in the plane of the two factors. These contour lines are projections of the outlines of cross-sections of the response surface parallel to the factor plane. Coomans’ plot Graphical display of the goodness of class separation, often used in SIMCA. Residuals of objects fitted to one class are plotted against residuals of the same objects fitted to another class. On the two-dimensional plot the separability of the classes can be assessed only pairwise.

Residuals from model of class B

Objects classified as A

Objects classified neither A nor B

Objects can be classified as both A and B

Objects classified as B

t

density map : contourplot discriminant plot Two- (or three-) dimensional scatter plot of discriminant scores, usually of the first two (or three) linear discriminant functions. This plot gives information about class separability.

scatterplot [GRAPH] 299

latent variable plot Two-dimensional scatter plot, in which the predictor latent variable is plotted against the response latent variable, calculated in the same component. This plot is similar to the principal component plot, except that the axes are not simply eigenvectors of a matrix but eigenvectors of complex aggregates of the predictor and the response matrices. This plot serves as a diagnostic tool in PLS modeling: it reveals outlier and high leverage observations, and any nonlinear relationship between predictors and responses. loading plot Display of variables on a two- (or three-) diniensional scatter plot as their projections on new axes calculated by a linear combination of the original variables. The most common loading plot projects the variables on the principal component axes. PC2

. high loading in PC2 low loading in PCl

+lx1 x7 x2

I( zero

+

0-

point high loading in PCl

X6 152

-

- - - - - -. XS I

7

non important in either components

x4

J

x3

I

I

-1

I

-1

1

151

I

0

I

b

+l PC1

loading of variable 5 in component 1

map Two- (or three-) dimensional scatter plot of multidimensional objects in which the coordinates are nonlinear combinations of the original variables. The two- (or three-) dimensional configuration is calculated by some mapping technique, like multidimensional scaling, nonlinear mapping, or projection pursuit. principal component plot ( : score plot) Scatter plot of the scores (projections) of the observations, usually on the first and second principal component axes. This plot is a linear projection of the observations onto a two-dimensional subspace that conserves most of the variance.

300

scatterplot [GRAPH] 0 = obj. group 1

I

t pc2

X = obj. group 2

+ = obj. group 3

X

x

X

X

I

+ +

+

0 PC1

If the first two components explain a high percentage of the variance, then the distribution of the observations in the two dimensions is a fair approximation of the distribution in the original measurement space. This plot is one of the most popular graphical tools in exploratory data analysis. Clusters, outliers, trends can be discovered by visualization of the multivariate distribution. residual plot lbo-dimensional scatter plot of residuals from a regression model. This plot plays an important role in residual analysis. It is used to verify the normality and homoscedasticity assumption on the residual, to identify outliers and to check the linearity of the relationship. The ideal plot shows a horizontal band of points with constant vertical scatter from left to right.

0 0

0

0

ID

Depending on which residual is assigned to the vertical axis (e-ordinary, ewcross-validated, r-standardized, t-Studentized) and what is plotted on the horizontal

scatterplot [GRAPH]

301

axis (x-predictor, y-response, j-estimated response, h-hat diagonal, ID-observation id), one can obtain several different residual plots: o McCulloh-Meeter plot

ln[h/n(l

- h)]

vs. ln[e2]

o ordinary residual plot

ID vs. x vs. y vs. j , vs.

e e e e

o predictive residual plot

ID vs. ecv Y vs. ecv e vs. ew o Pregibon plot

h

vs. e2

o standardized residual plot

ID vs. r y vs. r o Studentized residual plot

ID vs. t y vs. t o Williams plot

h

vs. ew

response curve System response plotted as a function of one factor. It is a one-dimensional response surface. CI

score plot :principal component plot

time series plot Graphical representation of a time series: x(t) plotted against the corresponding time intervals t. The successive points are often connected for better visualization.

302 b

scatterplot matrix [GRAPH]

scatter plot matrix [GRAPH]

: draftsman’s plot b Scheffe’s test [TEST] + hypothesis test

Schrader clustering [CLUS] + non-hierarchical clustering (o optimization clustering) b

b

score [FACT]

+ common factor

b

score plot [GRAPH] scatter plot

b

scree plot [GRAPH] rank anabsis (n scree test)

scree test [FACT] + rank analysis b

b second kind model [ANOVA] + tern in ANOVA

second-order design [EXDE] + design b

second-order regression model [REGR] -+ regression model b

seed point [CLUS] + cluster b

b

semi-interquartilerange [DESC]

+ dispersion b sensitivity curve (SC) [ESTIM] + influence curve

sensitivity of classification [CLAS] + classification b

b sequential design [EXDE] + design

similarity index [GEOM]

303

b sequential sampling [QUALI + acceptance sampling b

sequential variable selection [REGR]

+ variable subset selection

Shannon entropy [MISC] + infomation theory b

b

Shannon-Wiener index [MISC]

+ information theory b shape difference coefficient [GEOM] + distance (0 quantitative data) b Shapiro-Wilk’s test [TEST] -+ hypothesis test

sharpness of classification [CLAS] + classification b

b Shewhart chart [QUALI + control chart

b

shot-noise process [TIME]

+ stochastic process b

shrinkage parameter [REGR]

+ ridge regression b

Siegel-Thkey’s test [TEST]

+ hypothesis test

sign test [TEST] + hypothesis testing b

significance level [TEST] + hypothesis testing b

b similarity index [GEOM] Measure associated with a pair of objects, that depicts the extent to which the two objects are similar. The more similar the two objects are, the larger is the similarity index. A similarity index sst calculated for objects s and t has the following properties:

304

similarity matrix [GEOM]

O 5 s s t 51 sss = 1 sst = Sts

Pairwise similarity indices calculated from a data matrix X(n,p) can be arranged in a symmetric matrix S(n, n), called the similarity matrix. Rows and columns represent objects; diagonal elements sss equal to one. A similarity index sst can be calculated from the dissimilarity index & t as: sst

= 1/(1+ dst)

Sst

= 1- d s t / P a x

or

Pairwise dissimilarity indices calculated from a data matrix X ( n , p ) can be arranged in a symmetric matrix D(n, n), called the dissimilarity matrix. Rows and columns represent objects; diagonal elements dss are equal to zero. The most often used dissimilarity indices are distances.

similarity matrix [GEOM] + similarity index b

b similarity transformation [ALGE] Multiplication by a nonsingular matrix Z such that for two matrices X and Y the following equations hold:

XZ = ZY Y=Z-'XZ

The matrices W and Y are called similar and their eigenvalues are the same. b

simple matching coefficient [GEOM] (a binaly data)

+ distance

simplex [OPTIM] Geometric figure formed by a set of p + 1points in the p-dimensional space. In two dimensions the simplex is a triangle, in three dimensions it is a tetrahedron, etc. When the points are equidistant the simplex is called a regular simplex. A simplex is used, for example, in simplex optimization to find the minimum or maximum of a function. b

simplex centroid design [EXDE] + design b

simplex Optimization [OPTIM] 305 b

-+

simplex lattice design [EXDE] design

b simplex optimization [OPTIM] Direct search optimization based on comparing the values of a function at the p + 1 vertices of a simplex and moving the simplex towards the optimum during an iterative procedure. The technique as originally proposed uses a regular simplex; however, allowing the simplex to be nonregular (points not equidistant) increases the power and efficiency of the optimization. The simplex is moved toward the optimum via three basic operations: reflection, contraction, and expansion. The process starts with a p 1 set of initial parameter values and evaluates the function at each vertex

+

In minimization the simplex is moved away from the vertex with the largest function value (pc) by reflecting this vertex in the opposite face of the simplex. The new reflected vertex PR is on the line joining the centroid of all points pp , and the vertex to be eliminated pc:

where a is the reflection coefficient. Expansion (PR+)and contraction (PR-) help to move the simplex in a valley, where the function value could be the same at the old and the new vertices.

A

I

B

The convergence criterion generally used in the simplex optimization requires the standard deviation of the function values at the p + 1vertices to be less than a prespecified small value.

306

simulated annealing (SA) [OPTIM]

simulated annealing (SA) [OPTIM] Direct search optimization searching for the most probable configuration of parameters on the basis of simulating the evolution of a substance to thermal equilibrium. The distribution of configurations s is expressed by the Boltzman distribution: b

W

where C(s) is the function to be minimized, s and w are configurations and c is a control parameter. Starting with an initial configuration s of the parameters, another configuration r in the neighborhood of s is produced by modifying one randomly selected parameter. The new configuration is accepted with probability 1 if AC(r, s) 5 0, otherwise with probability P = exp [-AC(r, s)/c]

This probability is compared to a random number from the uniform distribution [0,1] and the new configuration is accepted if P is larger than the random number. The iteration continues until convergence is reached. Then the control parameter c is lowered and the optimization continues with the new parameter. Generalized simulated annealing (GSA) and variable step size generalized simulated annealing (VSGSA) are improvements on &I. b simulation [PROB] (: Monte Carlo simulation) Technique for imitating the random process of drawing samples from a predefined population in order to obtain estimates of the population parameters. Given a mathematical formula that cannot easily be evaluated by analytical reduction or by standard procedures of numerical analysis, it is often possible to find a stochastic process for generating statistical variables with frequency distributions that can be related to the mathematical formula. The simulation actually generates a sample, determines its empirical distribution and uses it in a numerical evaluation of the formula. A random series is a series of numbers drawn randomly from a distribution. It has an important role in simulation studies. Simulation is used to evaluate the behavior of a statistical method, to compare several similar statistical methods, or to solve mathematical problems arising from stochastic processes. The advantage of using simulation instead of real data sets is that in the former case the distribution of the underlying population is known. b

single linkage [CLUSI

+ hierarchical clustering b

(0

agglomerative clustering)

single regression model [REGR]

+ regression model

skewness [DESC] b

301

single sampling [QUALI

+ acceptance sampling b single tail test [TEST] + hypothesis testing b

singular matrix [ALGE]

+ matrix b singular value [ALGE] + matrix decomposition

((7

singular value decomposition)

b singular value decomposition (SVD) [ALGE] + matrix decomposition b

singular vector [ALGE]

+ matrix decomposition

(0

singular value decomposition)

b size difference coefficient [GEOM] + distance (0 quantitative data) b

-+

size of a test [TEST] hypothesis testing

b skewness [DESC] Measure of asymmetry of a distribution around its location. For a symmetric distribution the mean, median and mode are equal, i.e. there is no skewness. If there is a longer tail on the right, i.e. the mean is larger than the median and the median is larger than the mode, there is positive skewness. If there is a longer tail on the left, i.e. the mean is smaller than the median and the median is smaller than the mode, there is negative skewness.

I

2.0

POSITIVE

I

NEGATIVE

308

skewness [DESC]

A list of the most common skewness measures follows.

Bonferroni index Avery sensitive measure, defined in terms of the deviations from the mean, weighted by the frequency distribution:

i

with -51j)

s i =fi(xij

i = 1, n

B = 0 if the distribution is perfectly symmetric, and approaches 1 as the distribution becomes increasingly asymmetric. This index is based on the following relationship, which holds for symmetric ranked data: Xi

+ Xn-i+l

= constant = 251

coefficient of skewness Scaled difference between the arithmetic mean and the mode:

blj =

xj

- mode [Xj] sj

where si is the standard deviation. In a p-dimensional space, the multivariate measure of skewness is defined as

7 h , p

=

S

x t

n2

s, t = 1, n

where dstis the square root of the Mahalanobis distance between objects s and c

dst = (x,

- E)T 8-' ( ~ t E)

where 8-' is the inverse covariance matrix. For p = 1, bl,p = bl.

Pearson's first index Scaled sum of the cubed differences from the arithmetic mean:

smoother [REGR] 309 y1 = 0 for a symmetric distribution; y < 0 for a right-tailed distribution and y1 > 0 for a left-tailed distribution.

Pearson’s second index Scaled difference between the arithmetic mean and the median: X j - median [Xj ] yzj = 3 quartile coefficient of skewness Measure based on the three quartiles:

where

Q3

is the upper quartile,

Ql

is the lower quartile and Q2 is the median.

skip-lot sampling [QUALI -+ acceptance sampling b

b

-+

slicing [GRAPH] interactive computer graphics (o animation)

b smooth multiple additive regression technique (SMART) [REGR] (0 projection pursuit regression) Nonparametric multiple response nonlinear regression model that describes each response (usually) as a different linear combination of the predictor functions f m . Each predictor function is taken as a smooth but otherwise unrestricted function (usually) of a different linear combination of the predictor variables. The model is:

where qmk and wmj are linear coefficients for the predictor functions and for the predictor variables, respectively. The least squares solution is obtained by simultaneously estimating, in each component m = 1,M, the linear coefficients q and w and the nonlinear function f . The coefficients qmk are estimated by univariate least squares regression, the coefficients w m j by a Gauss-Newton step, and the function f, by a smoother. The optimal number of components M is estimated by cross-validation. b smoother [REGR] Function estimator which calculates the conditional expectation

f

(4 = ELY I XI

3 10

smoother [REGR]

There are two basic kinds of smoothers: the kernel smoother and the window smoother. The kernel smoother estimates the above conditional expectation at xi by assigning weights to the points, fitting a weighted polynomial to all the points and taking the fitted response value at xi. The largest weight is put at xi and the rest of the weights decrease symmetrically as the points become further from X i . A popular robust kernel smoother is the locally weighted scatter plot smoother. The window smoother can be considered as a special case of the kernel smoother in which all points within a certain interval (window) Ni around Xi have weight 1 and all points outside the interval have weight 0. According to the degree of the polynomial the smoother can be local averaging (degree zero), local linear fit (degree one), local quadratic fit (degree two) etc. The local averaging window smoother calculates the f i value as the average of the y values for those points with x values in an interval Ni around X i :

The local averaging, although a commonly used technique, has some serious shortcomings. It does not reproduce a straight line if the x values are not equispaced and it has bad behavior at the boundaries. The local linear fit alleviates both problems. It calculates the smooth f i value by fitting a straight line (usually by least squares) to the points in the interval Ni and taking the fitted response value at X i . Higher degree polynomials can be fitted in a similar fashion. 1.5 y

I

nod

span = 0.1

0

I u

-1.5

'

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

5.5

6 X

smoother [REGR]

3 11

1 c

-

- 1 .5

1

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

5.5

6

X

In window smoothers the key point is how to select the size of the interval, also called the span parameter. This parameter controls the trade-off between bias and variance of the estimated smooth. Increasing the span value, i.e. the size of the interval, increases the bias and decreases the variance. A larger span value makes

312

soft independent modeling of class analogy (SIMCA) [CLAS]

the smooth less wiggly. Ideally, the optimal span value should be estimated via cross-validation. In contrast to the fixed span smoother, the adaptive smoother uses a span parameter that varies over the range of x. This smoother is preferable if the error variance or the second derivative of the underlying function changes over the range of x. b soft independent modeling of class analogy (SIMCA) [CLAS] Parametric classification method designed to deal with low object-variable ratio. Each class is represented by a principal component model, usually of fewer components than the original dimensionality, and the classification rule is based on the distances of objects from these class models. These object-class distances are calculated as squared residuals:

C L

e:g = xi

- cg -

tmg 1mg

I’

i=l,n

g=l,G

where i is the object, g is the class and m is the component index, cg is the class centroid, tmg denotes principal component scores and 1, the corresponding loadings, in the mth component and gth class. The optimal number of components, M, is determined for each class separately by double cross-validation. This procedure results in principal component models which are optimal for describing the withinclass similarities but not necessarily optimal for discriminating among classes.

x3

These class models define unbounded M-dimensional subspaces in the pdimensional pattern space. In order to delimit the models, i.e. to create class boxes, normal ranges are defined using the class residual standard deviations sg:

spectral decomposition [ALGE]

313

SIMCA calculates both modeling and classification power for each variable based on the residuals. Similar to RDA and DASCO, S I M o l can be viewed as a modification of quadratic discriminant analysis where the class covariance matrices are estimated by truncated principal component representation. b soft model [MODEL] + model

Sokal-Michener coefficient [GEOM] -+ distance ( 0 binarydata) F

b Sokal-Sneath coefficient [GEOM] + distance ( 0 binarydata) b Sorenson coefficient [GEOM] + distance (0 binarydata) b spanning tree [MISC] + graph theory

Spearman’s p coefficient [DESC] -+ correlation b

b

specific factor [FACT]

: unique factor b

specific variance [FACT] factor analysis

-+

b

specification limits [QUAL] lot

-+

specificity of classification [CLAS] + classification b

b

spectral decomposition [ALGE]

+ matrix decomposition

314

spectral density function [TIME]

b spectral density function [TIME] (: spectrum) Function of a stationary time series x ( h ) , i = 1,n defined as:

f (w) = y ( 0 )

+2

c

y ( t ) COS(t, w )

k

or in normalized form: f (w)/y(O) = 1

+2

c

p ( t ) COS(t, w )

k

where y (t)is the autocovariance function and p ( t ) is the autocorrelation function, and 0 5 w 5 n. Its integrated form is called the spectral function:

F(w) =

1

f (@)do

spectral function [TIME] + spectral density function b

b spectral map analysis (SMA) [MULT] Dimension reduction and display technique related to biplot and correspondence factor analysis. It was developed for the graphical analysis of drug contrasts. Contrast is defined here as the logarithm of an activity ratio (specificity) in proportion to its mean. The word spectra here refers to the activity spectra of drugs, i.e. the logarithm of activities in various tests. The map of compounds described by their activity spectra is obtained after special scaling.

spectrum [ALGE] + eigenanalysis b

b

spectrum [TIME]

: spectral density function b spherical data [PREP] + data

b spline [REGR] Function estimate obtained by fitting piecewise polynomials. The x range is split into intervals separated by so-called knot locations. A polynomial is fitted, in each interval, with the constraint that the function be continuous at the knot locations. The integral and derivative of a spline is also a spline of one degree higher or lower; often also with a continuity constraint. The degree of a spline can range from

spline [REGR]

315

zero to very high; however, the first-, second-, and third-degree splines are of more practical use.

1

0.5 C

0

-0.5

-1

-1.5

0

0.5

1.5

1

2.5

2

3

3.5

4

4.5

5

5.5

6 X

knot,

knot,

knot,

A spline is defined by its degree, by the number of knot locations, by the position of knots and by the coefficients of the polynomial fitted at each interval. A spline of degree m with N knot locations (tk, k = 1, N ) can be written in a general form as:

where x! and (Xi positive part, i.e.

- tk):

(xi - tk): = (xi - tk)J (xi - tk): = 0

are called basis functions. The notation (..)+ means the if if

xi Xi

> tk

5 tk

This representation casts the spline as an ordinary regression equation. The coefficients boj and bkj are estimated by minimizing the least squares criterion. Depending on the continuity requirements at various knot locations, not all of the above basis functions are present in a spline, i.e. some of the bkj coefficients are zero. A frequently used spline of degree m with N knots and with continuity constraint on the function and on its derivatives up to degree m - 1 has the form:

3 16

splinepartial least squares regression (SPLS) [REGR]

+ +

The number of coefficients in such a spline is m N 1. There are several equivalent basis function representations of the same spline. Another form of the above spline is: N I.

bk[Sk(Xi - tk)]? -k ei

yi = b0 -k k= 1

where s k is either +1 or -1. In fitting a spline one must select the degree rn, the number and location of the knots. The degree of the spline is sometimes fixed a priori. The number and location of knots is either fixed, or variable. Splines with variable knot locations are called adaptive splines. Adaptive splines offer a more flexible function approximation than splines with fixed knot locations. The bias-variance trade-off is controlled by the degree of the fitted polynomial and the number of knots. Increasing the degree and the number of knots increases the variance and decreases the bias of the spline. Ideally, one should estimate the optimal degree m, the optimal number of knots N and the optimal knot locations by cross-validation to obtain the best predictive spline. b

spline partial least squares regression (SPLS) [REGR] partial least squares regression

-+ b

split-plot design [EXDE] design

-+

spurious correlation [DESC] -+ correlation b

square matrix [ALGE] -+ matrix b

square root transformation [PREP] -+ transformation b

square transformation [PREP] -+ transformation b

b stability of clustering [CLUS] -+ assessment of clustering

b

stacking [GRAPH]

+ dotplot b

stagewise regression [REGR] variable subset selection

-+

standard error of estimate [ESTIM]

317

b standard addition method (SAM) [REGR] Calibration procedure used in chemistry to correct for matrix effects. The chemical sample is divided into several equal-volume aliquots and increasing amounts of standards are added to all but one aliquot. Each aliquot is diluted to the same volume, a response yi is measured and plotted as the function of X i , the amount of standard added. The regression model is:

yi = bl(e +Xi)

+ ei

i = 1,n

where 6 denotes the unknown amount of the analyte. The intercept is the response for the aliquot without standard addition

bo = bi 6 therefore the unknown amount of analyte is given by bo/bl. The key assumption is that the linearity of the model holds over the range of the calibration including zero response. SAM cannot be used when spectral interferences are present. The generalized standard addition method (GSAM) is the multivariate extension of SAM used to correct for spectral interferences and matrix effects simultaneously. The key equations are

ro=KTc0

AR=ACK

where ro is the response vector of p sensors, co is the concentration vector of n analytes, K is the n x p calibration matrix, and AR and AC are changes in response and concentration, respectively, for m standard additions.

standard deviation [DESCI + dispersion b

b

standard deviation chart [QUALI (0 variable control chart)

+ control chart b

-+

standard deviation of error of calculation (SDEC) [MODEL] goodness offit

b standard deviation of error of prediction (SDEP) [MODEL] + goodness ofprediction b standard error [MODEL] + goodness of fit

b standard error of estimate [ESTIM] Standard deviation of an estimated value. For example, the standard error of the mean (SEM) calculated from n observation is s / f i where s is the standard deviation of the n observations. The standard error of the estimated regression coefficients 6 and the estimated response yi in OLS is

3 18

standard error of the mean (SEM) [ESTIM]

x>-'xi

~ ~ ( f = i )s , / x ; ( ~ ~

where s is the residual standard deviation.

b

standard error of the mean (SEM) [ESTIM] standard error of estimate

standard order of runs [EXDE] : Kites order

b

b

standard score [PREP] (a autoscaling)

: standardization

b standardization [PREP] Simple transformation of the elements of a data matrix. It can be performed columnwise (called variable standardization), rowwise (called object standardization), both ways (called double standardization), or elementwise (called global standardization). Variable standardization results in variables which are independent of the unit of measurement. Scale variant estimators are greatly influenced by the previously performed standardization. Object standardization often results in closed data. The most common standardization procedures follow.

a autoscaling One of the most common column standardizations composed of a column centering and a column scaling:

x!.U = (Xij

-Zj)/Sj

The mean of an autoscaled variable is 0 and the variance is 1. An autoscaled variable is often simply called a standardized variable, its value is called the z-score or standard score.

a centering Scale shift (translation) by subtracting a constant (the mean), resulting in the zero mean of the standardized elements. Centering can be:

- row centering:

x!.IJ = x.. 1J - x i

xi = Cxij

- column centering: xij = Xij - Xj -

- double centering: xij = Xij - Xj

-

global centering:

-

x!. IJ = x" - x

j

-

xj = C X i j i

-Xi

TI= xxxij i

j

standardization [PREP]

319

logarithmic scaling Scale shift based on logarithmic transformation followed by column centering to mitigate extreme differences between variances

XG

C log(xij = log(xij) -

n

maximum scaling Column standardization where each value is divided by the maximum value of its column:

All the values in a maximum scaled variable have an upper limit of one.

profile Standardization that results in unit sum or unit sum of squares of the standardized elements. The profiles can be row profile:

Xij

= x IJ- - / cXij

c* /c j

x!.IJ = x.. 11 /

normalized row profile:

Xij

j

column profile:

Xij

= xij

xij

i

normalized column profile: xij = Xij /

C

~2IJ

i

global profile: normalized global profile:

xij

= Xij

//F c X:

X:

j

range scaling Column standardization where each value in the column is centered by the minimum value of the column Lj and divided by the range of the column Uj - Lj x!. = (xij - Lj)/(U, - Lj) 1J

In a range-scaled variable all values lie between 0 and 1. Range scaling where the values of the variable are expanded or compressed between prespecified limits A, (lower) and B, (upper) is called generalized range scaling: x!.1J = Aj

+ (Bj - Aj)(Xij

- Lj)/(U, - Lj)

320

standardized linear combination (SLC) [ALGE]

scaling Scale expansion (contraction) by dividing by a constant (the standard deviation) that results in unit variance of the standardized elements. Scaling can be

- row scaling: - column scaling:

x!.'J

-

xij = Xij /S

global scaling:

= xi, /sj

sj = J ~ ( x i j - F j > 2 / n

s =

x(Xij

Ji

-Q2/np

j

standardized linear combination (SLC)[ALGE] Linear equation in which the sum of the squared coefficients is equal to one, i.e. the coefficient vector has unit length. For example, principal components calculated from a correlation matrix are standardized linear combinations. b

b

standardized regression coefficient [REGR] regression coeficient

--f

b standardized residual [REGR] + residual

standardized residual plot [GRAPH] + scatter plot (0 residual plot) b

standardized variable [PREP] + variable F

b

star point [EXDE] (0 composite design)

+ design

F

star symbol [GRAPH] graphical symbol

b state-space model [TIME] (: Bayesian forecast, dynamic linear model, Kalman filter) Linear model in which the parameters are not constant, but change in time. The linear equation, called the observational equation is:

Y ( 0 = b(t>x(t>4-e(t>

The response y(t) is a quantity observed in time, x(t) is the known predictor vector, e(t), the error term, is a white noise process. The parameter vector b(t), called the state vector, is a time series described by the state equation:

statistics [DESC]

b(t) = G b(t - 1)

321

+c~ ( t )

where a(?)is a white noise process independent of e(t), and G and c are coefficients. b stationarity [TIME] The phenomenon that the probabilistic structure of a time series x ( t ) does not change with time. In practice it implies that the mean and the variance of a series is independent of time, and the covariance depends only on the separation time. Stationarity allows replication within a time series, thus making formal inference possible. b stationary process [TIME] --+ stochastic process

b stationary time series [TIME] + stochastic process (0 stationary process)

statistic [DESC] Numerical summary of a set of observations; a particular value of an estimator. If the observations are regarded as a sample from a population, then the calculated statistic is taken to be an estimate of the population parameter. For example, the arithmetic mean of a set of observations can be used as an estimate of the population mean. b

b

statistical distribution [PROB]

: distribution

statistical process control (SPC) [QUAL] + controlchart b

b

statistical quality control [QUAL]

: quality control b statistics [DESC] A branch of mathematics concerned with collecting, organizing, analyzing and interpreting data. There are two major problems in statistics: estimation, and hwothesis testing. In the case of inference statistics the data set under consideration is a sample from a population, the calculated statistic is taken as an estimate of the population parameter and the conclusions about the properties of the data set are translated to the underlying population. In contrast, the goal of descriptive statistics is simply to analyze, model or summarize the available data without further inference. A data set can be described by freauency. location, disuersion, skewness, kurtosis,

322

steepest ascent optimization [OPTIM]

a.

quantiles, Relationship between two variables can be described by association, correlation, covariance. Scatter matrix, correlation matrix, covariance matrix, tivariate dispersion are characteristics of multivariate data. Statistics may also be divided into theoretical statistics and data analvsis.

a-

b steepest ascent optimization [OPTIM] + steepest descent optimization

steepest descent optimization [OPTIM] Gradient oDtimization that minimizes a function f by estimating the optimal parameter values following the negative gradient direction. The iterative procedure starts with an initial guess for the p parameter values PO. In each iteration i one calculates the gradient, i.e. the partial first derivatives of the function with respect to the parameters: b

gi = (af/apli, a f / a p ~ i*, - - af/appi) 9

and a new set of parameter values is obtained as:

where the step size Si is determined by a linear search optimization. Steepest descent is a gradient optimization method where Ai is the identity matrix 1. Moving along the negative gradient direction ensures that the function value decreases at the fastest rate. However, this is only a local property, so frequent changes of direction are often necessary, making the convergence very slow, hence the optimization is quite inefficient. The method is sensitive to small perturbations in direction and step size, so these must be computed to high precision. The main problem with steepest descent is that the second derivatives describing the curvature of the function near the minimum are not taken into account. Because of its drawbacks this optimization is seldom used nowadays. The opposite procedure, which maximizes a function by searching for the optimal parameter values along the positive gradient, is called the steepest ascent optimization. b stem-and-leaf diagram [GRAPH] Part graphical, part numerical display of a univariate distribution. As in the histogram, the range of the data is partitioned into intervals. These intervals are established by first writing all possible leading digits in the range of the data to the left of a vertical line. Each object in the interval then is represented by its trailing digit, written to the right of the vertical line. The leading digits on the left form the stem and the trailing digits on the right are called the leaves.

stochastic process [TIME] leaf

stem

L

DATA 100 102 111 115 117 120 129 131 133

133 141 143 144 144 144 145 152 158

323

158 159 160 163 164 172 181 186 195

J 10 11 12 13 14 15 16 17 18 19

02 157 09 133 134445 2889 034 2 16 5

This is a compact way to record the data, while also giving visual information about the shape of a distribution. The length of each row represents the density of objects in the corresponding interval. It is often necessary to change measurement units or to ignore some digits to the right. b step direction [OPTIM] -+ optimization b

step size [OPTIM]

+ optimization b stepwise linear discriminant analysis (SWLDA) [CLAS] + discriminant analysis

b

stepwise regression (SWR) [REGR] variable subset selection

stochastic model [MODEL] -+ model b

b stochastic process [TIME] (: random process) Random phenomena that can be described by at least one random variable n(t) where t is a parameter belonging to an index set T. Usually t is interpreted as time, but it can also refer to a distribution in space. The process can be either continuous or discontinuous. Random walk and white noise are special stochastic processes. A list of the most important stochastic processes follows.

counting process Integer-valued, continuous stochastic process N(t) of a series of events, in which N(t) represents the total number of occurrences of the event in the time interval (O,?). If the time intervals between successive occurrences (interarrival times) are i.i.d. random variables, the process is called a renewal process. If these time

324

stochastic process [TIME]

intervals follow an exponential distribution, the process is called a Poisson process. If the series of occurrences are repeated trials with two outcomes (e.g. success or failure), the process is called a Bernoulli process.

ergodic process Stochastic process in which the time average of a single record x(t) is approximately equal to the ensemble average x(t). The ergodic property of a stochastic process is commonly assumed to be true in engineering and physical sciences, therefore parameters may be estimated from the analysis of a single record. independent increment process Stochastic process in which the quantities x(t+ 1)- x ( t ) are statistically independent. Markov process Stochastic process in which the conditional probability distribution at any point x(t) depends only on the immediate past value x(t - l), but is independent of the history of the process prior to t - 1. A Markov process having discrete states is called a Markov chain, while a Markov process with continuous states is called a diffusion process. narrow-band process Stationary stochastic process continuous in time and state ~ ( t= ) A(t) cos[ct

+ 4 (t)]

where c is a constant, A(t) is the amplitude and @(t)is the phase of the process. A stochastic process that does not satisfy this condition is called a wide-band process.

normal process Stochastic process in which at any given time t the random variable x ( t ) is normally distributed. shot noise process Stochastic process induced by a sequence of impulses applied to a system at random time points tn: N(t)

~ ( t=)

C An ~ ( tn)t , n=l

where w(t, tn) is the response of the system at time t resulting from an impulse An at time tn, and N(t) is a counting process with interarrival time tn.

stationary process Stochastic process with stationarity. A process that does not satisfy stationarity is called an evolutionary process. A time series that is a stationary process is called a stationary time series.

sum of squares in ANOVA (SS) [ANOVA]

325

Wiener-Levy process Stationary independent increment process in which every independent increment is normally distributed, the average value of x(t) is zero, and x(0) = 0. The most common Wiener-Levy process is the Brownian motion process. It is also widely used in other fields such as quantum mechanics and electric circuits. b

stochastic variable [PREP] variable

-+

b

strategy [MISCI

+ game theory b

stratified sampling [PROB] sampling

-+

b Studentized residual [REGR] + residual

b Studentized residual plot [GRAPH] + scatter plot ( 0 residual plot)

b Student’s t distribution [PROB] + distribution

subdiagonal element [ALGE] + matrix b

submatrix [ALGE] + matrix operation ( 0 partitioning of a matrix) b

b subsampling [PROB] + sampling (0 cluster sampling) b

subtraction of matrices [ALGE] matrix operation

-+

b sufficient estimator [ESTIMI + estimator b sum of squares in ANOVA (SS) [ANOVA] Column in the analysis of variance table containing the squared deviations of the observations from the grand mean or from an effect mean summed over the observations. It is customary to indicate the summation over an index by replacing that index with a dot. For example, in a one-way ANOVA model the effect of

326

sum of squares linkage [CLUS]

treatment A, is calculated as: Ai = yi. = Yi./K =

Yik/K k

where K is the number of observations at level i, and the sum of squares associated with the effect A is:

b

sum of squares linkage [CLUS]

+ hierarchical clustering (a agglomerative clustering) b

superdiagonal element [ALGE] matrix

-+

b

supervised learning [MULT] pattern recognition

--3

b

supervised pattern recognition [MULT] pattern recognition

-+ b

-+

survival function [PROB] random variable

sweeping [ALGE] + Gaussian elimination b

b symmetric design [EXDE] + design

b

symmetric distribution [PROB]

--+ random variable b symmetric matrix [ALGE] + matrix

b

symmetric test [TEST]

+ hypothesis testing b

synapse [MISC]

+ neural network

target transfomationfactor analysis (TTFA)[FACT] b

327

systematic distortion [ESTIM]

: bias b

systematic sampling [PROB]

-+

sampling

T b t distribution [PROB] + distribution

b

t test [TEST]

-+

hypothesis test

b Taguchi method [QUAL] Quality control approach suggesting that statistical testing of a product should be carried out at the design level, called off-line quality control, in order to make the product robust against variations in manufacturing. This proposal is different from the traditional on-line quality control, such as acceptance sampling and statistical process control. The Taguchi method is based on minimizing the variability of the product or process, either by keeping the quality on a target value or by optimizing the output. The quality is measured by statistical variability, for example mean squared error or standard deviation, rather than percentage of defects or other criteria based on control limits. Taguchi makes a distinction between variables, that can be controlled, and noise variables. He suggests systematically including the noise variables systematically into the parameter design. The variables in the parameter design can be classified into two groups: the ones that affect the mean response and the ones that affect the variability of the response.

b

Tanimoto coefficient [GEOM]

-+ b

distance

binarydata)

target rotation [FACT]

-+

b

(0

factor rotation

target transformation factor analysis ("FA) [FACT]

+ factor rotation

328

taxi distance [GEOM]

b taxi distance [GEOM] + distance (o quantitative data) b tensor [ALGE] Mathematical object, a generalization of the vector relative to a local Euclidean space, that possesses a specified system of components for every coordinate system that changes under a transformation of coordinates. The simplest tensors are building blocks of linear algebra: zero-order tensors are scalars, first-order tensors are vectors, and second-order tensors are matrices. b term in ANOVA [ANOVA] Categorical predictor variable in the analvsis of variance model. There are two kinds of terms: a main effect term and an interaction term. A main effect term, also called an effect, is a measurable or an observable quantity that affects the outcome of the observations. It is measured on a nominal scale, i.e. it assumes categorical values, called levels. The effects are commonly denoted by consecutive upper case letters (A, B, C,etc.), the levels are indicated by a lower case index, and the number of levels by the corresponding upper case letters as:

Ai

i=l,I

Bj

j = 1, J

An interaction term consists of more than one effect, and takes on values corresponding to the possible combinations of the levels of the effects, also called a treatment. There are two ways to combine effects. In a crossed effect term all combinations of levels are possible. For example if effect Ai has two levels and effect Bj has four levels, then the term ABij has eight levels: (1J) (1,2) (1,3) (L4) (2,2) (2,3) (2,4). In a nested effect term the level combinations are restricted. For example, the same B effect nested in A results in term B(A)jci) with four levels: (1,l) (1,2) (2,3) (2,4). An effect that has a fixed number of levels I is called a fixed effect. The corresponding main effect term in the ANOVA model is not considered to be a random variable. An interaction term that contains only fixed effects is also fixed, i.e. it is not random. A model containing only fixed effects is called a fixed effect model, or a first kind model or model I. In this model the treatment effects Ai, Bj , ABij , etc. are defined as deviations from the grand mean, therefore:

time series [TIME]

329

In this model conclusions from testing Ho: Ai = 0 apply only to the I levels included in the model. An effect that has a large number of possible levels from which I levels have been randomly selected is called a random effect. The corresponding main effect term in theANOl.44 model is a random variable. All interaction terms that contain random effects are also considered to be random. A model containing only random effects is called a random effect model, or a second kind model or model 11. The variance of a random effect term (and of the error term) is called a variance component or component of variance, denoted as a2,02, a&, etc. In a random effect model conclusions from testing Ho: 02 = 0 apply beyond the effect levels in the model, i.e. an inference is drawn about the entire population of effect levels. A model containing both fixed effects and random effects is called a mixed effect model. b terminal node [MISC] + graph theory (0 digraph) b

test [TEST]

: hypothesis test b test set [PREP] + data set b test statistic [TEST] + hypothesis testing b

theoretical distribution [PROB]

+ random variable b theoretical variable [PREP] + variable

three-dimensional motion graphics [GRAPH] + interactive computer graphics b

b

-+

tied rank [DESC] rank

F time series [TIME] Set of observations x(t), ordered in time, where t indicates the time when x(t) was taken. The observations are often equally spaced in time. In case of multivariate observations the scalar x(t) is replaced by a vector x(t). It is assumed that the observations are realizations of a stochastic process, x ( t ) is a random variable, and the observations made at different time points are statisticallydependent. The multivariate joint distribution is described by the mean function, autocovariance function,

330

time series analysis (TSA) [TIME]

cross-covariance function, and spectral density function. A time series can be written as a sum of four components: -, fluctuation about the trend, seasonal component, and random component. Transforming one time series x(t) into another time series y(t) is called filtering. The simplest is the linear filter:

Y ( 0 = a x(t) time series analysis (TSA)[TIME] Analysis of time series, i.e. of series of data collected as a function of time. A time series model is a mathematical description of such data. The goal of the analysis is twofold: modeling the stochastic mechanism that gives rise to an observed series and forecasting (predicting) future values of the series on the basis of its history. Another common objective is to monitor the series and to detect changes in a trend. b

b time series model [TIME] Mathematical description of time series. It is composed of two parts: one contains past values of the time series

x(t-.Il j=o,p and the other one contains terms of a white noise process

a ( ? - i)

i =O,q

The parameters p and q define the complexity of the model, they are indicated in parentheses after the model name. The most commonly used models are the following.

autoregressive integrated moving average model (ARIMA) Model that can be used when a time series is nonstationary, i.e. p(t) is constant in time. TheARIMA@, 1,q) model written in difference equation form is: x ( t ) - ~ (- t1) =

C bj [x(t - j ) - x(t - j

+

- l)] a(t) -

j

C

Ci

a(t - i)

I

The simplest ARIMA models are the IM(1,l)model: x(t)

- x ( t - 1) = a(t) - ca(t - 1)

and theARI(1,l) model: x(t)

- ~ (- t1) = b[x(t - 1) - x( t - 2)] + a(t)

The AZUM model can also be written for differences between other than neighboring points, denoted ARIMA@,d, q). For example, the ARIM(p, 2, q) is a model for x ( t ) - x ( t - 2). The dth difference of ARIMA(p, d, q) is a statimary A M @ ,q) model.

time series model [TIME]

331

autoregressive model (AR) Model in which each point is represented as a linear combination of the p most recent past values of itself plus a white noise term which is independent of all x(t-j] values

The simplest AR@) isAR(l), a first-order model:

with autocovariance and autocorrelation functions ~ ( 0= ) a;/(l - b2) r(t>= b y ( t - 1) p ( t ) = br For a higher-order model, assuming stationarity and zero means, the autocorrelation and autocovariance functions are defined by theYule-Walker equations:

autoregressive moving average model (ARMA) The A M @ ,q ) model is a mixture of AR(p) and MA(q) models:

Each point is represented as a linear combination of past values of itself, and of past and present terms of a white noise process. The simplest A M model is the A M (1,l):

with autocovariance and autocorrelation functions y(0) = (1 - 2bc + c2)a;/(1 - b2) ~ ( 1= ) by(0) - ca; v(t>= b y ( t - 1) p ( t ) = (1 - bc)(b - c)b'-'/(l - 2bc + c2)

332

time seriesplot [GRAPH]

Box- Jenkins model : autoregressive moving average model moving average model (MA) Model, based on moving averages, in which each point is represented as a weighted linear combination of present and past terms of a white noise process: x(t) = a(t) -

C ci a(t - i) 1

The simplest MA(q) is MA(1), a first-order model: x(t) = a(t) - ca(t - 1) with autocovariance and autocorrelation functions y(0) = a,2(1+c2) y(1) = -ca;

+

p(1) = -c/(l c2) y(t) =p(t) = O for

t

2 2

b time series plot [GRAPH] + scatter plot b

tolerance limits [QUALI lot

-+

top-down induction decision tree method (TDIDT) [CLAS] : classification tree method

b

b

total calibration [REGR] calibration

-+

b total sum of squares (TSS) [MODEL] Sum of squared differences between observed values and their mean:

TSS can be partitioned into two components: the model sum of squares, MSS, and the residual sum of squares, RSS: TSS = MSS + R S S =

C(ii- f12 + C ( y i 1

fi)2

i

TSS is an estimate of the variability of yi without any model. b

total variation [DESC] multivariate dispersion

-+

transformation [PREP] 333

-+

trace of a matrix [ALGE] matrix operation

b

training - evaluation set split [MODEL]

b

+ model validation b

training set [PREP]

+ data set b transfer function [MISC] + neural network b transformation [PREP] Mathematical function for transforming the original values of a variable x to new values Xr

x' = f ( x ) Transformations are used to:

- stabilize the variance; - linearize the relationship among variables; - normalize the distribution;

-

represent results on a more convenient scale;

- mitigate the influence of outliers. Transformation can be considered as a correction for undesired characteristics of a variable, such as heteroscedasticity, nonnormality, nonadditivity, or nonlinear relationship with other variables. The following are the most common transformations.

angular-linear transformation Transformation that converts an ordinary (linear) variable x into an angular variable t 360"x

r=-

k

where k is the number of units in a full cycle. For example, the time 06:OO can be transformed to 90" (one-fourth of a cycle) with k = 24.

linear transformation Transformation that can be interpreted as a roto-translation of the original variable: x'=a+bx where a is the translation term (or intercept) and b is the rotation term (or slope). There are two subcases: pure rotation ( a = 0) and simple translation of the origin (b = 0).

334

transformation [PREP]

logarithmic transformation Transformation that changes multiplicative behavior into additive behavior. For example, in regression a nonlinear relationship can be changed into a linear one:

x' = log, (k + X ) where the logarithmic base a is usually 10 or e, and k is an additive constant (often 1) to remove zero values. The standard deviation of the new values is proportional to their mean (i.e. the coefficient of variation is constant).

logit transformation Transformation of percentages, proportions or count ratios (variables with values between zero and 1) in order to obtain a logit scale, where values range between -3 (for P N 0.05) and +3 (for P N 0.95): pl = In

(L) 1-P

metameric transformation Transformation of the values of a dose or response variable into a dimensionless scale -1, 0 and +l. The transformed values are called metameters. It is often used to simplify the analysis of the dose-response relationship. probit transformation Tlansformation (abbreviation of probability unit) to make negative values very rare in a standard normally distributed variable by adding 5 to each original value.

x' = x + 5.0 rankit transformation Transformation for rank ordering of a quantitative variable. x' = rank (x)

reciprocal transformation 4

square root transformation Transformation especially applied to variables from the Poisson distribution, such as count data. In this case the variance is proportional to the mean and can be stabilized by using:

i=J;r+0.5

or

i=J+3/8

tree vmbol [GRAPH]

335

square transformation Transformation particularly useful for correcting a skewed distribution.

trigonometric transformation Transformations using trigonometric functions: x' = sin(x) x' = arcsin(x)

x' = cos(x) x' = arccos(x)

x' = tan@)

x' = arctan(x)

The arcsin transformation is often used to stabilize the variance, i.e. to make it close to constant for different populations from which x has been drawn. Particularly used on percentages or proportions p to change their (usually) binomial distribution into a nearly normal distribution: p' = arcsin (@ , ) The behavior of this transformation is not optimal at the extreme values. This can be improved by using the transformation

b

transformation matrix [ALGE]

: rotation matrix b

transpose of a matrix [ALGE] matrix operation

--+

treatment [EXDE] + factor

b

b treatment structure [EXDE] + design b tree [MISC] + graph theory

tree symbol [GRAPH] + graphical symbol

b

336

trend [TIME]

b trend [TIME] Long-term movement in a time series, a smooth, relatively slowly changing component. It is a nonrandom function of a time series: p(t) = E[x(t)].The trend can be various functions of time, for example:

- linear: - quadratic: - cyclical: - cosine:

+ +

p(t) = bo bit p(t) = ba bit+ bk? PO) = PO 12) p(t) = bo bl cos(2n f t)

+ +

+ b sin(2n f t)

triangle inequality [GEOM] -+ distance b

triangular diagram [GRAPH] Ternary coordinate system, in the form of an equilateral triangle, in which the values of the three variables sum to one. The quantities represented on the three coordinates are proportions rather than absolute values. This diagram is particularly useful for studying mixtures of three components. b

0

x3

b triangular distribution [PROB] + distribution

F triangular factorization [ALGE] + matrix decomposition

triangular kernel [ESTIM] + kernel b

1

two-stage nested anarysis of variance [ANOVA] b

triangular matrix [ALGE]

+ matrix

tridiagonal matrix [ALGE] + matrix b

b

tridiagonalization [ALGE] matrix decomposition

-+

b trigonometric transformation [PREP] + lransformation b

trimmed estimator [ESTIM]

+ estimator b

trimmed mean [DESC]

+ location

trimmed variance [DESC] -+ dhpersion b

b

true error rate [CLAS] classification (o error rate)

-+

b

truncation error [OPTIM] numerical error

-+

b

'hkey's quick test [TEST] hypothesis test

-+

b

'hkey's test [TEST] hypothesis test

-+

two-level factorial design [EXDE] -+ design b

b

two-norm [ALGE] norm (0 matnknom)

-+

b

two-sided test [TEST] hypothesis testing

-+

b

two-stage nested analysis of variance [ANOVA]

-+ analysis of variance

337

338 b

two-stage sampling [PROB]

two-stage sampling [PROB] sampling ( cluster sampling)

-+

b two-way analysis of variance [ANOVA] + analysis of variance

b

two-way clustering [CLUS] cluster analysis

-+

b

type I error [TEST]

+ hypothesis testing b type II error [TEST] + hypothesis testing

U b u-chart [QUAL] + control chart (0 attribute control chart)

U-shaped distribution [PROB] + random variable b

b

ultrametric distance [GEOM]

+ distance b ultrametric inequality [GEOM] + distance b unbalanced factorial design [EXDEI + design b unbiased estimator [ESTIM] -+ estimator

unconditional error rate [CLAS] + classification ( 0 error rate) b

uncorrelated vectors [ALGE] + vector b

unit [PREP]

339

b underdetermined system [PREP] -+ object b

underfitting [MODEL]

+ model fitting b

undirected graph [MISC] ( 0 digraph)

+ graph theory

b unequal covariance matrix classification (UNEQ) [CLAS] Parametric classification method that is a variation of the quadratic discriminant analvsis. Each class g is represented by its centroid cg and its covariance matrix Sg. Object xi is classified according to its Mahalanobis distance from the class centroids. The metric is the inverse of the corresponding class covariance matrix Sg:

d2'g = (xi - CgIT s;' (xi - cg>

As the above distance metric follows the chi square distribution, the probability of an object belonging to a class can be calculated from this distribution. UNEQ, similar to SIMCA, simplifies the QDA classification rule by omitting the logarithm of the determinant of the class covariance matrices, which means that the class density functions are not properly scaled. In the presence of significant scale differences this usually causes inferior performance. uniform distribution [PROB] + distribution F

b uniform shell design [EXDE] + design b unique factor [FACT] (.- specific factor) Term in the factor analvsis model that accounts for the variance not described by the common factors. Its role is similar to the role of the error term in a regression model. There is one unique factor for each variable. The unique factors are uncorrelated, both among themselves and with the common factors. Principal component analysis assumes zero unique factors. F unique variance [FACT] + factor analysis b uniqueness [FACT] + communalify

unit [PREP] : object

b

340

unit matrix [ALGE]

unit matrix [ALGE] -+ matrix b

b

univariate [MULT]

+ multivariate b univariate calibration [REGR] + calibration b

univariate data [PREP]

+ data b univariate distribution [PROB] -+ random variable b univariate regression model [REGR] -+ regression model

unsupervised learning [MULT] + pattern recognition

b

b unsupervised pattern recognition [MULT] + pattern recognition b

unweighted average linkage [CLUS] (0 agglomerative clustering)

+ hierarchical clustering

b upper control limit (UCL) [QUAL] + control chart

upper quartile [DESC] + quantile b

utility function [QUALI + multicriteria decision making

b

V Vmask [QUAL] -+ control chart (0 cusum chart) b

variable [PREP] b

341

validation [MODEL]

: model validation

variable [PREP] ( : attribute, characteristic, descriptor,feature) Characteristic of an object that may take on any value from a specified set. Data for n objects described by p variables are collected in a data matrix X(n, p), where the element xi, is the jth measurement taken on the ith object. There are several types of variables, listed below, depending on their of measurement, on the set of values they can take, and on the relationships with other variables. b

angular variable (: circular variable) Variable that takes on values expressed in terms of angles. autoscaled variable : standardized variable binary variable Dichotomous variable that takes values of 0 or 1. For example, quanta1 and dummy variables are binary variables. blocking variable Categorical variable in the design matrix that groups runs into blocks. The levels of the blocking variable are defined by the blocking generator. categorical variable

.- qualitative variable

cause variable Variable that is the cause of change in other (effect) variables. circular variable : angular variable concomitant variable

.- covariate

conditionally-presentvariable Variable that exists or is meaningful only for some of the objects. The values taken by such variable may exist (or be meaningful) for an object depending on the value of a dichotomous variable of the present/absent type. continuous variable Variable that can take on any numerical value. For any two values there is another one that such a variable can assume. control variable Variable measured in monitoring a process. Its values are collected at fixed time intervals, often recorded on control charts or registered automatically.

342

variable [PREP]

covariate (: concomitant variable) Predictor variable in ANCOVA model, measured on a ratio scale. Its effect on the response cannot be controlled, only observed. cyclical variable Variable in time series that takes on values that depend on the cycle (period) during which it is measured. dependent variable (: response variable) Variable exhibiting statistical dependence on one or more other variables, called independent variables. dichotomous variable Discrete variable that can take on only two values. A binary variable is a special dichotomous variable that takes on values of 0 or 1. discrete variable In contrast to a continuous variable, a variable that takes on only a finite number of values. These values are usually integer numbers, but some discrete variables also take on ratios of integer numbers. For example, variables measured on a frequency count scale are discrete variables. dummy variable (: indicator variable) Binary variable created by converting a qualitative variable into binary ones. Each level of the qualitative variable is represented by one dummy variable set to 1 or 0, indicating whether the qualitative variable assumed the corresponding level or not. effect variable Variable in which change is caused by other (cause) variables. endogenous variable Variable, mainly used in econometrics, that is measured within a system and affected both by variables in the system and by variables outside the system (exogenous variables). It is possible for a variable to be endogenous in one system and exogenous in another. exogenous variable Variable, mainly used in econometrics, that is measured outside the system. They can affect the behavior of the system described by the endogenous variables, but are not affected by the fluctuations in the system. experimental variable Variable measured or set during an experiment, in contrast to theoretical variable, calculated from a mathematical model. In experimental design the experimental variables are more commonly called factors.

variable [PREP]

343

explanatory variable :predictor variable inadmissible variable Variable that must not be included in a model because it is constant or perfectly correlated with other variables. A variable containing a large number of missing values, measured with too much noise, or highly correlated with other variables is often also considered inadmissible. independent variable : predictor variable indicator variable : dummy variable latent variable Non-observable and non-measurable hypothetical variable, crucial element of a latent variable model. A certain part of its effect is manifested in measurable manifest variables. Mainly used in sociology, economics and psychology. An example is the common factor in factor analysis. lurking variable Variable that affects the response, but may not be measured or may not even be known to exist. The effect of such variables is gathered in the error term. They may cause significant correlation between two measured variables, without providing evidence that these two variables are necessarily causally related. manifest variable Observable or measurable variable, as opposed to a latent variable, which is not observable or measurable. multinomial variable : qualitative variable multistate variable : qualitative variable noise variable Variable that cannot be controlled during the experiment or the process. predictor variable ( : explanatory variable, independent variable, regressor) Variable in a regression model as a function of which the response variable is modeled. process variable Variable controlled during the experiment or process.

344

variable control chart [QUAL]

qualitative variable (: categorical variable, multinomial Variable, multistate variable) Variable in which differences between values cannot be interpreted in a quantitative sense and for which only non-arithmetic operations are valid. It can be measured on nominal or ordinal scales. Examples are: label, color, type. quanta1 variable Binary response variable measuring the presence or absence of response to a stimulus. quantitative variable Variable, measured on a proportional or ratio scale, for which arithmetic operations are valid. random variable (: variate, stochastic variable) Variable that may take on values of a specified set with a defined frequency or probability. Variable with values associated with an element of chance or probability. ranking variable Variable defined by the ranks of the values of another variable. Rank order statistics are calculated from ranking variables that replace the ranked variable. reduced variable : standardized variable regressor : predictor variable response variable : dependent variable standardized variable ( : autoscaled variable, reduced variable) Variable standardized by autoscaling, i.e. by subtracting its mean and by dividing by its standard deviation. Such a variable has zero mean and unit variance. stochastic variable : random variable theoretical variable Variable taking values according to a mathematical model, in contrast to an experimental variable, which is measured or set in an experiment. variate : random variable F

variable control chart [QUAL]

+ control chart

variable subset selection [REGR]

345

b variable metric optimization [OPTIM] (.- quasi-Newton optimization) Gradient optimization that tries to overcome the problems in the Newton-Rauhson optimization, namely that the Hessian matrix H may become negative definite. It finds a search direction of the form H-'(pi) g(pi) where g is the gradient vector and H is a positive definite symmetric matrix, updated at each iteration, that converges to the Hessian matrix. The most well known variable metric optimization is the Davidon-Fletcher-Powell optimization (DFP). It begins as steepest descent and changes over to the Newton-Raphson optimization during the iterations by continuously updating an approximation to the inverse of the matrix of second derivatives. b variable reduction [MULT] + data reduction

variable sampling [QUALI + acceptance sampling b

b

variable step size generalized simulated annealing (VSGSA) [OPTIM]

+ simulated annealing b variable subset selection [REGR] (: best subset regression) Collection of regression methods that model the response variable as a function of a selected subset of the predictor variables only. These techniques are biased regression methods in which biasing is based on the assumption that not all the predictor variables are relevant in the regression problem. There are various strategies for finding the optimal subset of predictors. The forward selection procedure starts with an initial model containing only a constant term and inserts one predictor at a time until a prespecified goodness of prediction criterion is satisfied. The order of insertion is determined by the partial correlation coefficients between the response and the predictors not yet in the model. In other words, at each step the predictor is included in the model that produces the largest increase in R2. k partial correlations are calculated in each step as a correlation between residuals from models y = f ( X I , x2, . .. , Xk-1) and Xj = f (XI, x2, . . . , Xk-I), with j # 1, . . . , k - 1. The contribution of predictor k to the variance described by the regression model is assessed by the partial F value:

The forward selection procedure stops when the partial F value does not exceed a preselected Fin threshold, i.e. when the contribution of the selected predictor to the variance described by the model is no longer significant. Although this procedure improves the regression model in each step, it does not consider the effect of the newly inserted variable on the role of the predictors already in the model.

346

variance [DESC]

The backward elimination is an opposite strategy; it starts with the full model containing all the predictors. In each step a predictor is eliminated from the model which has the smallest partial correlation with the response (or equivalently the smallest partial F value), i.e. which results in the smallest decrease in R2. The elimination procedure stops when the smallest partial F value is greater than a preselected F,,, threshold, i.e. when all predictors in the model do contribute significantly to the variance of the model. This procedure, however, cannot be used when the full predictor matrix is underdetermined. The stepwise regression (SWR) method is a combination of the above two strategies. Both variable selection and variable elimination are attempted in each step. The procedure starts with only the constant term in the model. In each step the predictor with the largest partial F value is inserted if F > Fin and the predictor with the smallest partial F value is eliminated if F < F,,,. The procedure stops when all predictors in the model have F > Foutand all candidate predictors not in the model have F < Fin. This is the most frequently recommended method, although its success depends on the preselected Fin and Fout. The above procedures, called sequential variable selection, were developed with computational efficiency in mind, so that only a relatively small number of subsets are actually calculated and compared. As computation is getting cheaper, it is no longer prohibitive to calculate the all possible subsets regression. This examines all possible 2P combinations of the predictors and chooses the best model on the basis of a goodness of prediction criterion. This method proves to be superior, especially in case of collinearity. The above methods calculate a least squares estimate for the variables included in the model. The candidate predictors are decorrelated from the predictors already in the model. Stagewise regression considers predictors on the basis of their correlation with the response, as opposed to their partial correlation, i.e. the candidate predictors are not decorrelated from the model. In each step the correlations between the residual from the existing model and the potential predictors are calculated and the variable with the largest correlation is inserted into the model. This method does not give the least squares regression coefficients for the variables in the final model and yields a larger mean square error than that of the least squares estimate. Its advantage is, however, that highly correlated variables are allowed to enter into the model if they are also highly correlated with the response.

variance [DESC] + dispersion b

b

variance analysis [ANOVA]

: analysis of variance

variance component [ANOVA] + term in ANOVA b

vector [ALGE] b

347

variance-covariance matrix [DESC]

: covariance matrix b variance inflation factor (VIF) [REGR] Measure of the effect of ill-conditioned predictor matrix X on the estimated regression coefficients:

where Rj is the multiple correlation coefficient obtained by regressing predictor xj on all the other predictors Xk k # j . VIF is a simple measure for detecting collinearity. In the ideal case, when xj is totally uncorrelated with the other predictors, W.,= 1. As R, tends to 1, i.e. collinearity increases, VIFj tends to infinity. b

-+

variance ratio distribution [PROB] distribution

variance ratio test [TEST] + hypothesis test b

variate [PREP] + variable b

b

variate [PROB] random variable

-+

b

variation [DESC]

: dispersion

varimax rotation [FACT] -+ factor rotation b

b

variogram [TIME]

+ autocovariancefunction b vector [ALGE] Row or column of numbers. In contrast to a scalar, a vector has both size and direction. The size of a vector is measured by its norm. Conventionally a vector x is assumed to be a column vector; a row vector is indicated as xT (transpose of x). For example, multivariate measurements are usually represented by a row vector, measurement of the same quantity on various individuals or material samples are often collected in a column vector.

348

vector norm [ALGE]

Vectors x and y are orthogonal vectors if

A set of p such vectors form an orthonormal basis of the p space. Vectors x and y are uncorrelated vectors if:

Vectors x and y are linearly independent vectors if b,x+ b,y=O

with

b, # O

and

by # O

Vectors x and y are coqjugate vectors if xTZy = 0

where Z is a symmetric positive definite matrix. b

vector norm [ALGE]

+ norm

Wald-Wolfowitz’s test [TEST] + hypothesis test b

b

walk [MISC] graph theory

-+ b

Walsh’ test [TEST] hypothesis test

4

b

Ward linkage [CLUS] hierarchical clustering ( agglomerative clustering)

b

warning limit [QUAL] controlchart

-+

Watson’s test [TEST] -+ hypothesis test b

well-strctured admksibility [CLUS] 349 b Weibull distribution [PROB] + distribution

F

Weibull growth model [REGR] regression model

4

b weight [PREP] Numerical coefficient associated with objects, variables or classes indicating their relative importance in a model. Most statistical models can incorporate observation weights; e.g. weighted mean, weighted variance, weighted least squares. Observation weights play an important role in regression, for example in generalized least squares regression, robust regression, biweight, smoother.

b weighted average linkage [CLUS] + hierarchical clustering (0 agglomerative clustering) b weighted centroid linkage [CLUS] + hierarchical clustering (0 agglomerative clustering)

weighted graph [MISC] + graph theory b

weighted least squares regression (WLS) [REGR] + generalized least squares regression b

weighted mean [DESC] + location b

b

weighted nearest means classification (WNMC) [CLAS] centroid classijication

4

b

weighted variance [DESC]

+ dispersion b well-conditioned matrix [ALGE] + matrix condition b

-+

well-determined system [PREP] object

well-structured admissibility [CLUS] + assessment of clustering ( 0 admissibility properties) b

350

Westenberg’stest [TEST]

b Westenberg’s test [TEST] + hypothesis test b

-+

Westlake design [EXDE] design

b white noise [TIME] Stationarv stochastic process defined as a sequence of i.i.d. random variables a(t). This process is stationary with mean function

P(t) = E[a(t)I

and with autocovariance and autocorrelation functions y ( t , s) = var [a(t)]

p(t,s) = 1

if t = s, otherwise both are zero. b wide-band process [TIME] + stochastic process (0 narrow-bandprocess)

Wiener-Levy process [TIME] + stochastic process b

b Wilcoxon-Mann-Whitney’s test [TEST] + hypothesis test b Wilcoxon’s test [TEST] + hypothesis test b

Wilks’ A test [FACT]

+ rank analysis b Wilk’s test [TEST] + hypothesis test

b

Williams-Lambert clustering [CLUS]

+ hierarchical clustering (o divisive clustering) b Williams plot [GRAPH] + scatter plot (0 residual plot)

window smoother [REGR] + smoother b

f i f e sorder [EXDE]

351

b Winsorized estimator [ESTIM] + estimator

within-group covariance matrix [DESC] + covariance matrix b

X b

%chart [QUAL]

+ control chart

(0

variable control chart)

Y b Yates algorithm [EXDE] Algorithm for calculating estimates for the effects of factors and of their interactions in two-level factorial design. The outcomes of the experimental runs of a K-factor factorial design are first written in a column in Yates order. Another column is calculated by first adding together the consecutive pairs of numbers from the first column, then subtracting the top number from the bottom number of each pair. With the same technique a total of K columns are generated; the entries of a new column are sums of differences of pairs of numbers from the previous columns. The last column contains the sum of squares such that the first value corresponds to the mean, the next K values to the K factors and the rest to the interactions. To obtain the estimate of the mean, the corresponding sum of squares is divided by 2K, the factor and interaction effects are estimated by dividing the corresponding sum of squares by 2K-1. b

Yates chi squared coefficient [GEOM] (0 binarydatu)

+ distance

b Yates order [EXDE] (: standard order of runs) The most often used order of runs in a two-level factorial design. The first column of the design matrix consists of successive minus and plus signs, the second column of successive pairs of minus and plus signs, the third column of four minus signs

352

Youden square design [EXDE]

followed by four plus signs, and so forth. In general, the kth column consists of 2K-1minus signs followed by 2K-' plus signs. For example, the design matrix for a 23 design is:

b Youden square design [EXDE] + design b Yule coefficient [GEOM] + distance ( 0 binarydata) b

Yule-Walker equations [TIME] ( 0 autoregressive model)

+ time series model

b z-chart [GRAPH] Plot of time series data that contains three lines forming a z shape. The lower line is the plot of the original time series, the center line is a cumulative total and the upper line is a moving total. b z-score [PREP] + standardization

(0 autoscaling)

zero matrix [ALGE] + matrix b

zero-order regression model [REGR] + regression model b

zooming [GRAPH] + interactive computer graphics b

353

References

[ALGE] LINEAR ALGEBRA

G.H. Golub and C.E van Loan, Matrix Computations. John Hopkins University Press, Baltimore, MD (USA), 1983 w AS. Householder, The Theory of Matrices in Numerical Analysis. Dover Publications, New York, NY (USA), 1974 w A. Jennings, Matrix Computation for Engineers and Scientists. Wiiey, New York, NY (USA), 1977 w B.N. Parlett, The Symmetric Eigenvalue Problem. Prentice-Hall, Englewood Cliffs, NJ (USA), 1980

[ANOVA] ANACYSIS OF VARIANCE

O.J. Dunn and V.A. Clark, Applied Statistics: Analysis ofvariance and Regression. Wiley, New York, NY (USA), 1974 w L. Fisher, Fixed Effects Analysis of Variance. Academic Press, New York, NY (USA), 1978 D.J. Hand, Multivariate Analysis of Variance and Repeated Measures. Chapman and Hall, London (UK), 1987 B A. Huitson, The Analyis of Variance. Griffin, London (UK), 1966 S.R.Searle, Variance Components. Wiley, New York, NY (USA), 1992 w G.O. Wesolowsky, Multiple Regression and Analysis of Variance. Wiley, New York, NY (USA), 1976

[CLAS] CLASSIFICATION

H.H. Bock, Automatische Klassifikation. Vandenhoeck & Rupprecht, Gottingen (GER), 1974 w I. Bratko and N. Lavrac (Eds.), Progress in Machine Learning. Sigma Press, Wilmslow (UK), 1987 L. Breiman, J.H. Friedman, R.A. Olshen, and C.J. Stone, Classification and Regression 'kees. Wadsworth, Belmont, CA (USA), 1984 w H.T. Clifford and W. Stephenson, An Introduction to Numerical Classification. Academic Press, New York, NY (USA), 1975 D. Coomans, D.L. Massart, I. Broeckaert, and A. 'hssin, Potential methods in pattern recognition. Anal. Chim. Acta, 133,215 (1981) D. Coomans, D.L. Massart, and I. Broeckaert, Potential methods in pattern recognition. Part 4 A combination of ALLOC and statistical linear discriminant analysis. Anal. Chim. Acta, 133. 215 (1981) TM. Cover and P.E. Hart, Nearest neighbor pattern classification. IEEE 'Itans., 21 (1967) o M.P. Derde and D.L. Massart, UNEQ:a disjoint modelling technique for pattern recognition based on normal distribution. Anal. Chim. Acta, 184,33 (1986)

m,

354

rn

0

rn

w rn rn

rn

rn H

rn

References

M.P. Derde and D.L. Massart, Comparison of the performance of the class modelling techniques UNEQ, SIMCA and PRIMA. Chemolab, 4, 65 (1988) R.A. Eisenbeis, Discriminant Analysis and Classification Procedures: Theory and Applications. Lexington, 1972 M. Forina, C. Armanino, R. Leardi, and G. Drava, A class-modelling technique based on potential functions. J. Chemometrics, 5, 435 (1991) I.E. Frank, DASCO -A new classification method. Chemolab, 4,215 (1988) I.E. Frank and J.H.Friedman, Classification: oldtimers and newcomers. J. Chemometrics, 3, 463 (1989) I.E. Frank and S. Lanteri, Classification models: discriminant analysis, SIMCA, CART. Chemolab, 5 247 (1989) J.H. Friedman, Regularized Discriminant Analysis. J. Am. Statist. Assoc., 165 (1989) M. Goldstein, Discrete Discriminant Analysis. Wiley, New York, NY (USA), 1978 L. Gordon and R.A. Olsen, Asymptotically efficient solutions to the classification problem. Ann. Statist., 6, 515 (1978) D.J. Hand, Kernel discriminant analysis. Research Studies Press, Letchworth (UK), 1982 D.J. Hand, Discrimination and Classification. Wiley, Chichester (UK), 1981 M. James, Classification Algorithms. Collins, London (UK), 1985 I. Juricskay and G.E. Veress, PRIMA: a new pattern recognition method. Anal. Chim. Acta, 171.61 (1985) W.R. Klecka, Discriminant Analysis. Sage Publications, Beverly Hills, CA (USA), 1980 B.R. Kowalski and C.F. Bender, The k-nearest neighbour classification rule (pattern recognition) applied to nuclear magnetic resonance spectral interpretation. Anal. Chem., 4 1405 (1972) PA. Lachenbruch, Discriminant Analysis. Hafner Press, New York, NY (USA), 1975 G.J. McLachlan, Discriminant Analysis and Statistical Pattern Recognition. Wiley, New York, NY (USA), 1992 N.J. Nilson, Learning Machines. McGraw-Hill, New York, NY (USA), 1965 R. Todeschini and E. Marengo, Linear Discriminant Classification Wee (LDCT): a user-driven multicriteria classification method. Chemolab, 16,25 (1992) G.T. Toussaint, Bibliography on estimation of misclassification. Information Theory, 472 (1974) H. van der Voet and D.A. Doornbos, The improvement of SIMCA classification by using kernel density estimation. Part 1. A new probabilistic approach classification technique and how to evaluate such a technique. Anal. Chim. Acta, 161,115 (1984) H. van der Voet and D.A. Doornbos, The improvement of SIMCA classification by using kernel density estimation. Part 2. Practical evaluation of SIMCA, ALLOC and CLASSY on three data sets. Anal. Chim. Acta, 161,125 (1984) H.van der Voet, P.M.J. Coenegracht, and J.B.Hemel, The evaluation of probabilistic classification methods. Part 1.A Monte Carlo study with ALLOC. Anal. Chim. Acta, 191,47 (1986) H. van der Voet, J.B. Hemel, and P.M.J. Coenegracht, New probabilstic version of the SIMCA and CLASSY classification methods. Part 2. Practical evaluation. Anal. Chim. Acta, 191,63 (1986) S . Wold, Pattern recognition by means of disjoint principal components models. Pattern Recognition, S, 127 (1976) S. Wold, The analysis of multivariate chemical data using SIMCA and MACUP. Kern. Kemi, 3 401 (1982) S . Wold and M. Sjstrom, Comments on a recent evaluation of the SIMCA method. J. Chemometrics, L 243 (1987)

m,

0

0 0

References

355

[CLUS] CLUSTER ANALYSIS rn L.A. Abbott, EA. Bisby, and D.J. Rogers, lsxonomic Analysis in Biology. Columbia Univ. Press, New York, NY (USA), 1985 m M.S. Aldenderfer and R.K. Blashfield, Cluster Analysis. Sage Publications, Beverly Hills, CA (USA), 1984 rn M.R. Anderberg, Cluster Analysis for Applications. Academic Press, New York, NY (USA), 1973 0 G.H. Ball and D.J. Hall, A clustering technique for summarizing multivariate data. Behav. Sci., 2, 153 (1967) rn E.J. Bijnen, Cluster Analysis. Tilburg University Press, 1973 rn A.J. Cole, Numerical lsxonomy. Academic Press, New York, NY (USA), 1969 0 D. Coomans and D.L. Massart, Potential methods In pattern recognition. Part 2. CLUPOT, an unsupervised pattern recognition technique. Anal. Chim. Acta, 133,225 (1981) rn B.S. Duran, Cluster Analysis. Springer-Verlag, Berlin (GER), 1974 A.W.F. Edwards and L.L. Cavalli-Sforza, A method for cluster analysis. Biometrics, 21,362 (1965) B.S. Everitt, Cluster Analysis. Heineman Educational Books, London (UK), 1980 0 E.B. Fowlkes, R. Gnanadesikan, and J.R. Kettenring, Variable selection In clustering. J. Classif., 5 205 (1988) 0 H.P. Friedman and J. Rubin, On some invariant criteria for grouping data. J. Am. Statist. Assoc., @, 1159 (1967) A.D. Gordon, Classification Methods for the Exploratory Analysis of Multivariate Data. Chapman & Hall, London (UK), 1981 0 J.C. Gower, Maximal predictive classification. Biometrics, 643 (1974) 0 P. Hansen and M. Delattre, Complete-link analysis by graph coloring. J. Am. Statist. Assoc., -3 7 397 (1978) rn J. Hartigan, ClusteringAlgorithms. Wiley, New York, NY (USA), 1975 rn A.K. Jain and R.C. Dubes, Algorithms for Clustering Data. Prentice-Hall, Englewood Cliffs, NJ (USA), 1988 rn M. Jambu, Cluster Analysis and Data Analysis. North-Holland, Amsterdam (The Netherlands), 1983 rn N. Jardine and R. Sibson, Mathematical lsxonomy. Wiley, London (UK), 1971 0 R.A. Jarvis and E.A. Patrick, Clustering using a similarity measure based on shared nearest neighbours. IEEE 'Itans. Comput., 1025 (1973) rn L. Kaufman and P.J. Rousseeuw, Finding Groups in Data An Introduction to Cluster Analysis. Wiley, New York, NY (USA), 1990 R.G. Lawson and P.C. Jurs, New index for clustering tendency and its application to chemical problems. J. Chem. Inf. Comput. Sci., -03 36 (1990) 0 E. Marengo and R. Todeschini, Linear discriminant hierarchical clustering: a modeling and crossvalidable clustering method. Chemolab, 19,43 (1993) 0 F.H.C. Marriott, Optimization methods of cluster analysis. Biometrika, 417 (1982) rn D.L. Massart and L. Kaufman, The Interpretation of Analytical Chemical Data by the Use of Cluster Analysis. Wiley, New York, NY (USA), 1983 D.L. Massart, F. Plastria, and L. Kaufman, Non-hierarchical clustering with MASLOC. Pattern Recognition, &, 507 (1983) rn P.M. Mather, Cluster Analysis. Computer Applications, Nottingham (UK), 1969 0 G.W. Milligan and P.D. Isaac, The validation of four ultrametric clustering algorithms. Pattern Recognition, -02 41 (1980) 0 W.M. Rand, Objective criteria for the evaluation of clustering methods. J. Am. Statist. Assoc., 66, 846 (1971)

a

=,

-

356

References

w H.C. Romesburg, Cluster Analysis for Researchers. Lifetime Learning Publications, Belmont, CA (USA), 1984 A.J. Scott and H.J. Symons, Clustering methods based on likelihood ratio criteria. Biometrics, 27, 387 (1971) w P.H.A. Sneath and R.R. Sokal, Numerical Taxonomy. Freeman, San Francisco, CA (USA), 1973 w H. Spath, Cluster Analysis Algorithms. Wiley, New York, NY (USA), 1980 M.J. Symons, Clustering criteria and multivariate normal mixtures. Biometrics, 37, 35 (1981) w R.C. w o n , Cluster Analysis. McGraw-Hill, New York, NY (USA), 1970 J.W. Van Ness, Admissible clustering procedures. Biometrika, @, 422 (1973) w J. Van Ryzin (Ed.), Classification and Clustering. Academic Press, New York, NY (USA), 1977 W. Vogt, D. Nagel, and H. Sator, Cluster Analysis in Clinical Chemistry: A Model. Wiley, Chichester (UK), 1987 J.H. Ward, Hierarchical grouping to optimize an objective function. J. Am. Statist. Assoc., 3, 236 (1963) P. Willett, Clustering tendency in chemical classification. J. Chem. Inf. Comput. Sci., 25,78 (1985) P. Willett, Similarity and Clustering in Chemical Information Systems. Research Studies Press, Letchworth (UK), 1987 W.T Williams and J.M. Lambert, Multivariate methods in plant eehology. 2. The use of an electronic 8 4 689 (1960) computer for association analysis. J. Echology, 0 D. Wishart, Mode Analysis: a generalization of nearest neighbour which reduces chaining effect. in Numerical Taxonomy, Ed. A.J. Cole, Academic Press, New York, NY, 1969, p. 282. J. Zupan, A new approach to binary tree-based heuristics. Anal. Chim. Acta, 122. 337 (1980) 0 J. Zupan, Hierarchical clustering of infrared spectra. Anal. Chim. Acta, 139.143 (1982) w J. Zupan, Clustering of Large Data Sets. Research Studies Press, Chichester (UK), 1982

[ESTIM] ESTIMATION

w Y. Bard, Nonlinear Parameter Estimation. Academic Press, New York, NY (USA), 1974 P.J. Huber, Robust Statistics. Wiley, New York, NY (USA), 1981 w J.S. Maritz, Distribution-free Statistical Methods. Chapman & Hall, London (UK), 1981 w 0. Richter, Parameter Estimation in Ecology. VCH Publishers, Weinheim (GER), 1990 P.J. Rousseeuw, lhtorial to robust statistics. J. Chemometrics, 5,1 (1991) B.W. Silverman, Density Estimation for Statistics and Data Analysis. Research Studies Press, Letchworth (UK), 1986

[EXDE] EXPERIMENTALDESIGN 0

KM. Abdelbasit and R.L. Plackett, Experimental Design for Binary Data. J. Am. Statist. Assoc.,

28, 90 (1983) D.F. Andrews and A.M. Herzberg, The Robustness and Optimality of response Surface Designs. J. Statist. Plan. Infer., 2,249 (1979) w TB. Barker, Quality by Experimental Design. Dekker, New York, NY (USA), 1985 G.E.P. Box and D.W. Behnken, Some new three level designs for the study of quantitative variables. Technometrics, 2,445 (1960) w G.E.P. Box and N.R. Draper, Evolutionary Operation. Wiley, New York, NY (USA), 1969 G.E.P. Box and N.R. Draper, Empirical Model-Building and Response Surfaces. Wiley, New York, NY (USA), 1987

References

357

rn G.E.P. Box, W.G. Hunter, and J.S. Hunter, Statistics for Experimenters. An Introduction to Design, Data Analysis, and Model Building. Wiley, New York, NY (USA), 1978 N. Bratchell, Multivariate response surface modelling by principal components analysis. J. Chemometrics, 3, 579 (1989) rn R. Carlson, Design and optimization in organic synthesis. Elsevier, Amsterdam (NL), 1992 S. Clementi, G. Cruciani, C. Curti and B. Skakerberg, PLS response surface optimization: the CARS0 procedure. J. Chemometrics, 3, 499 (1989) rn J.A. Cornell, Experiments with Mixtures. Wiley, New York, NY (USA), 1990 (2nd ed.) J.A. Cornell, Experiments with Mixtures: A Review. Technometrics, & 437 (1973) o J.A. Cornell, Experiments with Mixtures: An Update and Bibliography. Rchnometrics, -l2 95 (1979) S.N. Deming and S.L.Morgan, Experimental Design: A Chemometric Approach. Elsevier, Amsterdam (NL), 1987 C.A.A. Duinveld, A.K. Smilde and D.A. Doornbos, Comparison of experimental designs combining process and mixture variables. Part 1. Design construction and theoretical evaluation. Chemolab, 19, 295 (1993) o C.A.A. Duinveld, A.K. Smilde and D.A. Doornbos, Comparison of experimental deslgns combining process and mixture variables. Part 11. Design evaluation on measured data. Chemolab, 19,309 (1993) rn V.V. Fedorov, Theory of Optimal Experiments. Academic Press, New York, NY (USA), 1972 rn P.D. Haaland, Experimental Design in Biotechnology. Marcel Dekker, New York, NY (USA), 1989 J.S. Hunter, Statistical Design Applied to Product Design. J. Qual. Control, l7,210 (1985) W.G. Hunter and J.R. Kittrell, Evolutionary Operation: A Review. Technometrics, 389 (1966) A.I. Khuri and J.A. Cornell, Response Surfaces Designs and Analysis. Marcel Dekker, New York, NY (USA), 1987 R.E. Kirk, Experimental Design. Wadsworth, Belmont, CA (USA), 1982 o E. Marengo and R. Todeschini, A new algorithm for optimal, distance based, experimental design. Chemolab, 2,117 (1991) rn R.L. Mason, R.F. Gunst, and J.L. Hess, Statistical Design and Analysis of Experiments with Applications to Engineering and Science. Wiley, New York, NY (USA), 1989 rn R. Mead, The Design of Experiments. Cambridge Univ. Press, Cambridge (UK), 1988 R. Mead and D.J. Pike, A review of response surface methodology from a hiometric viewpoint. Biometrics, 3l, 803 (1975) rn D.C. Montgomery, Design and Analysis oPExperiments. Wiley, New York, NY (USA), 1984 rn E. Morgan, Chemometrics: Experimental Design. Wiley, Chichester (UK), 1991 o R.L. Plackett and J.P. Burman, The Design of Optimum Multifactorial Experiments. Biometrika, 33, 305 (1946) o D.M. Steinberg and W.G. Hunter, Experimental Design: Review and Comment. Rchnometrics, 4 441 (1984) rn G. Tagughi and Y. Wu, Introduction to Off-Line Quality Control. Central Japan Quality Control Association (JPN), 1979

[FACT] FACTOR ANALYSIS

rn J.F! Benzecri, L'Analyse des Correspondences. Dunod, Paris (FR), 2 vols., 1980 (3rd ed.) 0 M. Feinberg, The utility of correspondence factor analysis for making decisions from chemical data. Anal. Chim. Acta, 191,75 (1986) rn B. Flury, Common Principal Components and Related Multivariate Models. Wiley, New York, NY (USA), 1988

358

References

H. Gampp, M. Maeder, C.J. Meyer, and A.D. Zuberbuehler, Evolving Factor Analysis. Comments Inorg. Chem., G, 41 (1987) R.L. Gorsuch, Factor Analysis. Saunders, Philadelphia PA, 1974 M.J. Greenacre, Theory and Applications of Correspondence Analysis. Academic Press, London (UK), 1984 H.H. Harman, Modern Factor Analysis. Chicago Univerisity Press, Chicago, IL (USA), 1976 P.K. Hopke, Target transformation factor analysis. Chemolab, 6 7 (1989) J.E. Jackson, A User's Guide to Principal Components. Wiley, New York, NY (USA), 1991 1.T Joliffe, Principal Component Analysis. Springer-Verlag, New York, NY, 1986 H.R. Keller and D.L. Massart, Evolving factor analysis. Chemolab, 2, 209 (1992) J. Kim and C.W. Mueller, Factor Analysis. Sage Publications, Beverly Hills, CA (USA), 1978 D.N. Lawley and A.E. Maxwell, Factor Analysis as Statistical Method. Macmilian, New York, NY (USA) / Butterworths, London (UK), 1971 (2nd ed.) M. Maeder, Evolving factor analysis for the resolution of overlapping chromatographic peaks. Anal. Chem., 527 (1987) M. Maeder and A. Zilian, Evolving Factor Analysis, a New Whnique in Chromatography. Chemolab, 2, 205 (1988) E.R. Malinowski and D.G. Howery, Factor Analysis in Chemistry. Wiley, New York, NY (USA), 1980-1991 (2nd ed.) w S.A. Mulaik, The Foundations of Factor Analysis. McGraw-Hill, New York, NY (USA), 1972 R.J. Rummel, Applied Factor Analysis. Northwestern University Press, Evanston, IL (USA), 1970 C.J.E Ter Braak, Canonical correspondence analysis: a new elgenvector technique for multivariate direct gradient analysis. Ecology, 7 6 1167 (1986) L.L. Thurstone, Multiple Factor-Analysis. A development and expansion of the vectors of mind. Chicago University Press, Chicago, IL (USA), 1947

[GEOM] GEOMETRICAL CONCEPTS

C.M. Cuadras, Distancias Estadisticas. Estadistica Espafi ola,-03 295 (1988) J.C. Gower, A general coefficient of similarity and some of its properties. Biometrics, 22,857 (1971)

[GRAPH] GRAPHICAL DATA ANALYSIS

D.E Andrews, Plots of high-dimensional data. Biometrics, 28, 125 (1972) H.P. Andrews, R.D. Snee, and M.H. Sarner, Graphical display of means. The Am. Statist., 195 (1980) F.J. Anscombe, Graphs in statistical analysis. The Am. Statist., 22, 17 (1973) w J.M. Chambers, W.S. Cleveland, B. Kleiner, and P.A. lbkey, Graphical Methods for Data Analysis. Wadsworth & Brooks, Pacific Grove, CA (USA), 1983 H. Chernoff, The use of faces to represent points in k-dimensional space graphically. J. Am. Statist. Assoc., 68- 361 (1973) B. Everitt, Graphical Techniques for Multivariate Data. Heinemann Educational Books, London (UK), 1978 S.E. Fienberg, Graphical methods in statistics. The Am. Statist., 2,165 (1979) 0 J.H. Friedman and L.C. Rafsky, Graphics for the multivariate two-sample problem. J. Am. Statist. 7 277 (1981) Assoc., -6 0 K.R. Gabriel, The biplot graphic display of matrices with applications to principal components anal453 (1971) ysis. Biometrika,

a

s,

References

359

B. Kleiner and J.A. Hartigan, Representing points in many dimensions by trees and castles. J. Am. Statist. Assoc., 76, 260 (1981) R. Leardi, E. Marengo, and R. Todeschini, A new procedure for the visual inspection of multivariate data ofdifferent geographic origins. Chemolab, 12, 181 (1991) R. McGill, J.W. Tbkey, and W.A. Larsen, Variation of box plots. The Am. Statist., 12 (1978) D.W. Scott, On optimal and data-based histograms. Biometrika, -6 605 (1979) H. Wainer and D. Thissen, Graphical Data Analysis. Ann.Rev.Psychol., 22,191 (1981) K. Wakimoto and M. 'Eiguri, Constellation graphical method for representing multidimensional data. Ann. Inst. Statist. Mathem., 97 (1978) w P.C.C. Wang (Ed.), Graphical Representation of Multivariate Data. Academic Press, New York, NY (USA), 1978

a

a

[MISC] MISCELLANEOUS

w D. Bonchev, Information Theoretic Indices for Characterization of Chemical Structure. Research Studies Press, Letchworth (UK), 1983 w L. Davis, Handbook of Genetic Algorithms. Van Nostrand Reinhold, New York, NY (USA), 1991 w K. Eckschlager and V. Stepanek, Analytical Measurement and Information: Advances in the Information Theoretic Approach to Chemical Analysis. Research Studies Press, Letchworth (UK), 1985 m D.E. Goldberg, Genetic Algorithms in Search, Optimization and Machine Learning. Addison-Wesley, Reading, MA (USA), 1989 w R.W. Hamming, Coding and Information Theory. Prentice-Hall, Englewood Cliffs NJ (USA), 19801986 (2nd ed.) w J. Hertz, A. Krogh and R.G. Palmer, Introduction to Theory of Neural Computation. Addison-Wesley, New York, NY (USA), 1991 D.B. Hibbert, Genetic algorithms in chemistry. Chemolab, fi 277 (1993) w Z. Hippe, Artificial Intelligence in Chemistry. Elsevier, Warszawa (POL), 1991 m J.H. Holland, Adaptation in Natural and Artificial Systems, University of Michigan Press, Ann Arbor, MI (USA), 1975 G.J. Klir and TA. Folger, Fuzzy Sets, Uncertainty, and Information. Prentice-Hall, Englewood Cliffs, NJ (USA), 1988 R. Leardi, R. Boggia and M. Terrile, Genetic algorithms as a strategy for feature selection. J. Chemometrics, 5, 267 (1992) o C.B. Lucasius and G. Kateman, Understanding and using genetic algorithms. Part 1. Concepts, properties and context. Chemolab, 19, 1 (1993) w N.J. Nilsson, Principles of Artificial Intelligence. Springer-Verlag, Berlin (GER), 1982 o A.P. de Weijer, C.B. Lucasius, L. Buydens, G. Kateman, and H.M. Heuvel, Using genetic Algorithms 45 (1993) for an artificial neural network model inversion. Chemolab, B.J. Wythoff, Backpropagation neural networks. A tutorial. Chemoiab, 18,115 (1993) J. Zupan and J. Gasteiger, Neural networks: A new method for solving Chemical problems or just a passing phase?. Anal. Chim. Acta, 248. 1 (1991)

a

[MODEL] MODELING

360 H

References

B. Efron and R. Tibshirani, An Introduction to the Bootstrap. Chapman & Hall, New York, NY (USA), 1993 B. Efron and R. Tibshirani, Bootstrap Methods for Standard Errors, Confidence Intervals, and Other Measures of Statistical Accuracy. Statistical Science, 1,54 (1986) B. Efron, Estimating the error rate of a prediction rule: improvements on cross-validation. J. Am. 316 (1983) Statist. Assoc., G.H. Golub, M. Heath, and G. Wahba, Generalized Cross-validation as a Method for Choosing a Good Ridge Parameter. Technometrics, 2 l , 215 (1979) S . Lanteri, Full validation for feature selection in classification and regression problems. Chemolab, 159 (1992) A.M. Law and W.D. Kelton, Simulation Modeling and Analysis. McGraw-Hill, New York, NY (USA), 1991 D.W. Osten, Selection of optimal regression models via cross-validation. J. Chemometrics, 2, 39 (1988) M. Stone, Cross-validatory Choice and Assessment of Statistical Predictions. Journal of Royal Statistical Society, Ser. B 36, 111 (1974)

z,

H

[MULT]MULTIVARIATE ANALYSIS H

H H H

H

H H

H

H

H H

H H H H H

A.A. Afifi and S.P. h e n , Statistical Analysis: A Computer Oriented Approach. Academic Press, New York, NY (USA), 1979 A.A. Afifi and V. Clark, Computer-Aided Multivariate Analysis. Wadsworth, Belmont, CA (USA), 1984 I.H. Bernstein, Applied Multivariate Analysis. Springer-Verlag, New York, NY (USA), 1988 H. Bozdogan and A.K. Gupta, Multivariate Statistical Modeling and Data Analysis. Reidel Publishers, Dordrecht (NL), 1987 R.G. Brereton, Multivariate pattern recognition in chemometrics, illustred by case studies. Elsevier, Amsterdam (NL), 1992 H. Bryant and W.R. Atchley, Multivariate Statistical Methods. Dowden, Hutchinson & Ross, Stroudsberg, PA (USA), 1975 N.B. Chapman and J. Shorter, Correlation Analysis in Chemistry. Plenum Press, New York, NY (USA), 1978 C. Chatfield and A.J. Collins, Introduction to Multivariate Analysis. Chapman & Hall, London (UK), 1986 W.W. Cooley and P.R. Lohnes, Multivariate Data Analysis. Wiley, New York, NY (USA), 1971 D. Coomans and I. Broeckaert, Potential Pattern Recognition in Chemical and Medical Decision Making. Research Studies Press, Letchworth (UK), 1986 M.L. Davison, Multidimensonal scaling. Wiley, New York, NY (USA), 1983 W.R. Dillon and M. Goldstein, Multivariate Analysis. Methods and Applications. Wiley, New York, NY (USA), 1984 R.O. Duda and PE.Hart, Pattern Classification and Scene Analysis. Wiley, New York, NY (USA), 1973 M.L. Eaton, Multivariate Statistics. Wiley, New York, NY (USA), 1983 K. Esbensen and P. Geladi, Strategy of multivariate image analysis (MIA). Chemolab, 67 (1989) B.S. Everitt, An Introduction to Latent Variable Models. Chapman & Hall, London (UK), 1984 R. Giffins, Canonical Analysis: A Review with Applications in Ecology. Biomathematics 12, SpringerVerlag, Berlin (GER), 1985 R. Gnanadesikan, Methods for Statistical Data Analysis of Multivariate Observations. Wiley, New

z

References

361

York, NY (USA), 1977 rn F! Green, Mathematical Tools for Applied Multivariate Analysis. Academic Press, San Diego, CA (USA), 1978 rn 1.1. Joffe, Application of Pattern Recognition to Catalytic Research. Research Studies Prees, Letchworth (UK), 1988 rn R.A. Johnson and D.W. Wichern, Applied Multivariate Statistical Analysis. Prentice-Hall, London (UK), 1982-1992 (3rd ed.) rn P.C. Jurs and TL. Isenhour, Chemical Applications of Pattern Recognition. Wiley-Interscience, New York, NY (USA), 1975 rn M. Kendall, Multivariate Analysis. Griffin, London (UK), 1980 rn P.R. Krishnaiah (Ed.), Multivariate Analysis. Academic Press, New York, NY (USA), 1966 W.J. Krzanowski, Principles of Multivariate Analysis. Oxford Science Publishers, Clarendon (UK), 1988 rn A.N. Kshirsagar, Multivariate Analysis. Dekker, New York, NY (USA), 1978 rn M.S. Levine, Canonical Analysis and Factor Comparisons. Sage Publications, Beverly Hills, CA (USA), 1977 rn P.J. Lewi, Multivariate Data Analysis in Industrial Practice. Research Studies Press, Letchworth (UK), 1982 B.F.J. Manly, Multivariate Statistical Methods. A Primer. Chapman & Hall, Bristol (UK), 1986 rn W.S. Meisel, Computer-oriented approach to pattern recognition. Academic Press, New York, NY (USA), 1972 rn K.V. Mardia, J.T. Kent, and J.M. Bibby, Multivariate Analysis. Academic Press, London (UK), 1979-1988 (6th ed.) rn D.E Morrison, Multivariate Statistical Methods. McGraw-Hill, New York, NY (USA), 1976 rn S.S.Schiffman, M.L. Reynolds and F.W.Young, Introduction to Multidimensional Scaling. Academic Press, Orlando, FL (USA), 1981 G.A.F. Seber, Multivariate Observations. Wiley, New York, NY (USA), 1984 rn M.S. Srivastana and E.M. Carter, An Introduction to Applied Multivariate Statistics. North-Holland, Amsterdam (NL), 1983 rn 0. Strouf, Chemical Pattern Recognition. Research Studies Press, Letchworth (UK), 1986 R.M. Thorndike, Correlational Procedures for Research. Gardner, New York, NY (USA), 1978 J.T.Tou and R.C. Gonzales, Pattern Recognition Principles. Addison-Wesley, Reading, MA (USA), 1974 rn J.P. Van der Geer, Introduction to Linear Multivariate Data Analysis. DSWO Press, Leiden (GER), 1986 rn E. Van der Burg, NonlineapCanonical Correlation and Some Related 'lkchniques. DSWO Press, Leiden (GER), 1988 K. Varmuza, Pattern Recognition in Chemistry. Springer-Verlag, Berlin (GER), 1980 rn D.D. Wolf and M.L. Pearson, Pattern Recognition Approach to Data Interpretation. Plenum Press, New York, NY (USA), 1983

[OPTIM] OPTIMIZATION

K.W.C. Burton and G. Nickless, Optimization via simplex Part 1. Background, definitions and simple applications. Chemolab, 1,135 (1987) rn B.S. Everitt, Introduction to Optimization Methods and their Application in statistics. Chapman & Hall, London (UK), 1987 rn R. Fletcher, Practical Methods of Optimization. Wiley, New York, NY (USA), 1987 ~1

362

References

J.H. Kalivas, Optimization using variations of simulated annealing. Chemolab, 5 l 1 (1992) rn L. Mackley (Ed.), Introduction to Optimization. Wiley, New York, NY (USA), 1988 0 J.A. Nelder and R. Mead, A simplex method for function minimization. Computer J., 2, 308 (1965) A.C. Norris, Computational Chemistry: An Introduction to Numerical Methods. Wiley, Chichester (UK), 1986 I C.H. Papadimitriou and K. Steiglitz, Combinatorial Optimization. Algorithms and Complexity. Prentice-Hall, Englewood Cliffs, NJ (USA), 1982 H P.J.M. van Laarhoven and E.H.L. Aarts, Simulated Annealing: Theory and Applications. Reidel, Dordrecht (GER), 1987

[PROB] PROBABILITY

M. Evans, N. Hasting, B. Peacock, statistical Distributions. Wiley, New York, NY (USA), 1993 rn M.A. Goldberg, An Introduction to Probability Theory with Statistical Applications. Plenum Press, New York, NY (USA), 1984 rn H.J. Larson, Introduction to Probability Theory and Statistical Inference. Wiley, New York, NY (USA), 1969 P.L. Meyer, Introductory Probability and Statistical Applications. Addison-Wesley, Reading, MA (USA), 1970 rn E Mosteller, Probability with Statistical Applications. Addison-Wesley, Reading, MA (USA), 1970

[QUAL] QUALITY CONTROL

rn 0

rn rn

G.A. Barnard, G.E.P. Box, D. Cox, A.H. Seheult, and B.W. Silverman (Eds.), Industrial Quality and Productivity with Statistical Methods. The Royal Society, London (UK), 1989 D.H. Besterfield, Quality Control. Prentice-Hall, London (UK), 1979 M.M.W.B. Hendriks, J.H. de Boer, A.K. Smilde and D.A. Doornbos, Multlcrlterla decision macking. Chemolab, & 175 (1992) G. Kateman and EW. Pijpers, Quality Control in Analytical Chemistry. Wiley, New York, NY (USA), 1981 D.C. Montgomery, Introduction to Statistical Quality Control. Wiley, New York, NY (USA), 1985 D.J. Wheeler and D.S. Chambers, Understanding Statistical Process Control. Addison-Wesley, Avon (UK), 1990

[REGR] REGRESSION ANALYSIS

rn rn 0

rn

141 EJ. Anscombe and J.W. 'hkey, The examination and analysis of residuals. Technometrics, (1963) A.C. Atkinson, Plots, 'hnsformations, and Regression. Oxford Univ. Press, Oxford (UK), 1985 D.M. Bates and D.G. Watts, Nonlinear Regression Analysis. Wiley, New York, NY (USA), 1988 D.A. Belsey, E. Kuh, and R.E. Welsch, Regression Diagnostics: IdentiFying Influential Data and Sources of Collinearity. Wiley, New York, NY (USA), 1980 L. Breiman and J.H. Friedman, Estimating Optimal 'kansformations for Multiple Regression and Correlation. J. Am. Statist. Assoc., 80- 580 (1985) S.Chatterjee and B. Price, Regression Analysis by Example. Wiley, New York, NY (USA), 1977 J. Cohen and P. Cohen, Applied Multiple Regression-Correlation Analysis for the Behavioral Sciences. Halsted, New York, NY (USA), 1975

References

363

rn R.D. Cook and S. Weisberg, Residuals and Influence in Regression. Chapman & Hall, New York, NY (USA), 1982 rn C. Daniel and F.S. Wood, Fitting Equations to Data. Wiley, New York, NY (USA), 1980 (2nd ed.) rn N. Draper and H. Smith, Applied Regression Analysis, Wiley, New York, NY (USA), 1966-1981 (2nd ed.) I.E.Frank, Intermediate least squares regression method. Chemolab, 1,233 (1987) I.E. Frank, A nonlinear PLS model. Chemolab, 8, 109 (1990) 0 I.E. Frank and J.H. Friedman, A statistical view of some chemometrics regression tools. Rchnometrics, 35, 109 (1993) J.H. Friedman, Multivariate adaptive regression splines. The Annals of Statistics, 19,1 (1991) 0 J.H. Friedman and W. Stuetzle, Projection pursuit regression. J. Am. Statist. Assoc., 76,817 (1981) T Gasser and M. Rosenblatt (Eds.), Smoothing Techniques for Curve Estimation. Springer-Verlag, Berlin (GER), 1979 rn M.H.J. Gruber, Regression Estimators: a comparative study. Academic Press, San Diego, CA (USA), 1990 rn R.E Gunst and R.L. Mason, Regression Analysis and Its Application: A Data-Oriented Approach. Marcel Dekker, New York, NY (USA), 1980 rn F.R. Hampel, E.M. Ronchetti, P.J. Rousseeuw, and W.A. Stahel, Robust Statlstics. The Approach based on Influence Functions. Wiley, New York, NY (USA), 1986 D.M. Hawkins, On the Investigation of Alternative Regression by Principal Components Analysis. Applied Statistics, 22,275 (1973) 0 R.R. Hocking, Developments in linear regression methodology: 1959-1982. Technometrics, -52 219 (1983) R.R. Hocking, The analysis and selection of variables in linear regresslon. Biometrics, -23 1 (1976) A.E. Hoerl and R.W. Kennard, Ridge Regression: Biased estlmation for non-orthogonal problems. Rchnometrics, l2, 55 (1970) 0 A. Hoskuldsson, PLS Regression Methods. J. Chemometrics, 2,211 (1988) m D.G. Kleinbaum and L.L. Kupper, Applied Regression Analysis and Other Multlvariable Methods. Duxbury Press, North Scituate, MA (USA), 1978 0 KG. Kowalski, On the predictive performance of biased regression methods and multiple linear regression. Chemolab, $ 177 (1990) 0 0. Kvalheim, The Latent Variable. Chemolab, 4l 1 (1992) 0. Kvalheim and TV. Karstang, Interpretation of latent-variable regression models. Chemolab, 2, 39 (1989) A. Lorber, L.E. Wangen, and B.R. Kowalski, A Theoretical Foundation for the PLS Algorithm. J. Chemometrics, 1,19 (1987) D.W. Marquardt, Generalized Inverses, Ridge Regression, Biased Linear Estimation, and Nonlinear Estimations. Rchnometrics, l2, 591 (1970) rn H. Martens and T Ns, Multivariate calibration. Wiley, New York, NY (USA), 1989 rn A.J. Miller, Subset Selection in Regression. Chapman & Hall, London (UK), 1990 rn F. Mosteller and J.W. Tbkey, Data Analysis and Regression. Addison-Wesley, Reading, MA (USA), 1977 rn R.H. Myers, Classical and Modem Regression with Applications. Duxbury Press, Boston, MA (USA), 1986 T Ns, C. Irgens, and H. Martens, Comparison of Linear Statistical Methods for Calibration of NIR Instruments. Applied Statistics, 35, 195 (1986) rn J. Neter and W. Wasserman, Applied Linear Statistical Models. Irwin, Homewood, IL (USA), 1974 rn C.R. Rao, Linear Statistical Inference and Its Applications. Wiley, New York, NY (USA), 1973 (2nd ed.)

364

References

P.J. Rousseeuw and A.M. Leroy, Robust Regression and Outliers Detection. Wiley, New York, NY (USA), 1987 H G.A.F. Seber, Linear Regression Analysis. Wiley, New York, NY (USA), 1977 S. Sekulic and B.R. Kowalski, MARS: a tutorial. J. Chemometrics, 6, 199 (1992) H J. Shorter, Correlation Analysis of Organic Reactivity: With Particular Reference to Multiple Regression. Research Studies Press, Chichester (UK), 1982 H G. Wahba, Spline Models for Observational Data. Society for Industrial and Applied Mathematics, Philadelphia, PA (USA), 1990 J.T. Webster, R.F. Gunst, and R.L. Mason, Latent Root Regression Analysis. Rchnometrics, & 513 (1974) H S. Weisberg, Applied Linear Regression. Wiley, New York, NY (USA), 1980 H G.B. Wetherill, Regression Analysis with Applications. Chapman & Hall, London (UK), 1986 S. Wold, N. Kettaneh-Wold and B. Skagerberg, Nonlinear PLS modeling. Chemolab, 2, 53 (1989) 0 S. Wold, P. Geladi, K. Esbensen, and J. Oehman, Multi-way Principal Components and PLS Analysis. J. Chemometrics, 1,41 (1987) o C. Yale and A.B. Forsythe, Winsorized regression. Technometrics, 18,291 (1976) H M.S. Younger, Handbook for Linear Regression. Duxbury Press, North Scituate, MA (USA), 1979 H

[TEST] HYPOTHESIS TESTING H H H

H H H

J.V. Bradley, Distribution-free Statistical Tests. Prentice Hall, Engelwood Cliffs, NJ (USA), 1968 L.N.H. Bunt, Probability and Hypothesis Testing. Harrap, London (UK), 1968 E. Caulcott, SignificanceTests. Routledge and Kegan Paul, London (UK), 1973 E.S. Edgington, Randomization Tests. Marcel Dekker, New York, NY (USA), 1980 K.R. Koch, Parameter Estimation and Hypothesis Testing in Linear Models. Springer-Verlag, Berlin (GER), 1988 E.L. Lehman, Testing Statistical Hypotheses. Wiley, New York, NY (USA), 1986

[TIME] TIME SERIES H

H H H H H H

H H H

B.L. Bowerman, Forecasting and Time Series. Duxbury Press, Belmont, CA (USA), 1993 G.E.P. Box and G.M. Jenkins, Time Series Analysis. Holden-Day, San Francisco, CA (USA), 1976 C. Chatfield, The Analysis of Time Series: An Introduction. Chapman & Hall, London (UK), 1984 J.D. Cryer, Time Series Analysis. Duxbury Press, Boston, MA (USA), 1986 P.J. Diggle, Time Series. A Biostatistical Introduction. Clarendon Press, Oxford (UK), 1990 E.J. Hannan, Time Series Analysis. Methuen, London (UK), 1960 A.C. Harvey, Time Series Models. Wiley, New York, NY (USA), 1981 M. Kendall and J.K. Ord, Time Series. Edward Arnold, London (UK), 1990 D.C. Montgomery, L.A. Johnson, and J.S. Gardiner, Forecasting and Time Series Analysis. McGrawHill, New York, NY (USA), 1990 R.H. Shumway, Applied Statistical Time Series Analysis. Prentice Hall, Englewood Cliffs, NJ (USA), 1988

GENERAL STATISTICSAND CHEMOMETRICS H H

J. Aitchison, The Statistical Analysis of Compositional Data. Chapman & Hall, London (UK), 1986 J. Aitchison, Statistics for Geoscientists. Pergamon Press, Oxford (UK), 1987

References

365

S.E Arnold, Mathematical Statistics. Prentice-Hall, Englewood Cliffs, NJ (USA), 1990 V Barnett and T Lewis, Outliers in Statistical Data. Wiley, New York, NY (USA), 1978 H J.J. Breen and RE. Robinson, Environmental Applications of Chemometrics. ACS Symposium Series, vol. 292,Am. Chem. SOC.,Washington, D.C. (USA), 1985 H R.G. Brereton, Chemometrics. Applications of mathematics and statistics to laboratory systems. Ellis Honvood, Chichester (UK), 1990 H D.T Chapman and A.H. El-Shaarawi, Statistical Methods for the Assessment of Point Source Pollution. Khwer Academic Publishers, Dordrecht (NL), 1989 H D.J. Finney, Statistical Methods in Biological Assay. Griffin, Oxford (UK), 1978 H M. Forina, Introduzione alla Chimica Analitica con elementi di Chemiometria. ECIG, Genova (IT), 1993 H D.M. Hawkins, Identification of Outliers. Chapman & Hall, London (UK), 1980 m D.C. Hoaglin, F. Mosteller, and J.W. 'hkey, Understanding Robust and Exploratory Data Analysis. Wiley, New York, NY (USA), 1983 H W.J. Kennedy and J.E. Gentle, Statistical Computing. Marcel Dekker, New York, NY (USA), 1980 H B.R. Kowalski (Ed.), Chemometrics: Theory and Application. ACS Symposium Series, vol. 52, Am. Chem. SOC.,Washington, D.C. (USA), 1977 H B.R.Kowalski (Ed.), Chemometrics, Mathematics, and Statistics in Chemistry. Proceedings of the NATO AS1 - Cosenza 1983, Reidel Publishers, Dordrecht (NL), 1984 H D.L. Massart, R.G. Brereton, R.E. Dessy, P.K. Hopke, C.H. Spiegelman, and W. Wegscheider (Eds.), Chemometrics 'htorial. Collected from Chemolab., Vol. 1-5, Elsevier, Amsterdam (NL), 1990 H D.L. Massart, A. Dijkstra, and L. Kaufman, Evaluation and Optimization of Laboratoty Methods and Analytical Procedures. Elsevier, Amsterdam (NL), 1978-1984 (3rd ed.) H D.L. Massart, B.G.M. Vandeginste, S.N. Deming, Y. Michotte, and L. Kaufman, Chemometrics: A 'kxtbook. Elsevier, Amsterdam (NL), 1988 H M. Meloun, J. Militky and M. Forina, Chemometrics for Analytical chemistry. Volume 1: PC-Aided Statistical Data Analysis. Ellis Honvood, New York, NY (USA), 1992 H M.A. Sharaf, D.A. Illman, and B.R. Kowalski, Chemometrics. Wiley, New York, NY (USA), 1986 H J.W. 'hkey, Exploratory Data Analysis. Addison-Wesley, Reading, MA (USA), 1977 H J.H. Zar, Biostatistical Analysis. Prentice-Hall, Englewood Cliffs, NJ (USA), 1984 (2nd ed.) H H

This Page Intentionally Left Blank